Problem building scATAC-seq count matrice from bams

hildemann commented 3 years ago

When trying to follow the tutorial on creating scATAC-seq count matrices from bam files I encountered this issue in the second to last step: epi.ct.bld_atac_mtx(list_bam_files=list_cells, loaded_feat=peaks, output_file_name='test_ATAC_mtx.txt', path=path_to_play_data, writing_option='w', header=peaks_names)

This yielded the following error log :

`AttributeError Traceback (most recent call last)

in ----> 1 epi.ct.bld_atac_mtx(list_bam_files=list_cells, 2 loaded_feat=peaks, 3 output_file_name='test_ATAC_mtx.txt', 4 path=path_to_play_data, 5 writing_option='w', ~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/count_matrix/_atac_mtx.py in bld_atac_mtx(list_bam_files, loaded_feat, output_file_name, path, writing_option, header, mode, check_sq, chromosomes) 94 #for read in samfile.fetch(until_eof=True): 95 for read in samfile: ---> 96 line = str(read).split('\t') 97 if line[2][3:] in chromosomes: 98 keep_lines.append(line[2:4]) ~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in __str__(self) 462 463 def __str__(self): --> 464 return self.__repr__() 465 466 def _range_popper(self, interval_start, interval_stop=None, front=True): ~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in __repr__(self) 444 self.read_name, 445 "{}".format(self.flag), --> 446 "{}".format(self.reference_name, self.tid), 447 "{}".format(self.pos + 1), 448 "{}".format(self.mapq), **_AttributeError: 'AlignedSegment' object has no attribute 'reference_name'_**` At first I suspected this error to come from the missing bam index files, however this is not the cause. Is this a familiar problem? Is there a quick fix to the problem? Thanks for advice

DaneseAnna commented 3 years ago

Hi, I haven't seen this error before, but I am curious to figure it out. I am trying to reproduce your error. You used the bam files from Buenrostro et al. 2018 or did you aligned the data yourself? I am wondering if there is a difference in the bam header.

Does it breaks right away ?

hildemann commented 3 years ago

Thanks for the quick support! Yes I used the bam files from Buenrostro et al. 2018 however downloaded them without headers (sam-dump default options). Other than that the bam files are sorted and indexed. Are the headers needed for the tutorial? Yes it breaks right away.

DaneseAnna commented 3 years ago

Hi, sorry for the delay! I managed to reproduce your error but I don't know what is wrong, yet.

Using sam-dump I downloaded a very strange one of your bam file. However the format is ver strange (not the usual sample layout). Can you check if your bam file look similar to mine ? It might be the reason why it doesn't work.

head_SRR5356154.txt

hildemann commented 3 years ago

Hi, yes, it looks identical. Yes it's a very odd layout, which I didn't notice until now.

hildemann commented 3 years ago

Hi Anna,

is there any news on this issue, has the source been indentified yet? Or maybe put differently: What would be your way of downloading the Buenrostro bams and integrating them into Episcanpy? Or do you suspect the problem to be in the format of the data?

Thanks a lot for your help,

DaneseAnna commented 3 years ago

Hi,

Sorry, I did not think about it anymore as the problem seems to be the file format when downloaded from NCBI. There is a couple of options available to solve this, one option is to run re-aligned the data but it is long and fastidious. Another option is to download the aligned bam files from Chen et al. 2019 benchmarking (https://github.com/pinellolab/scATAC-benchmarking). They make the data available to download here https://www.dropbox.com/sh/8o8f0xu6cvr46sm/AAB6FMIDvHqnG6h7athgcm5-a/Buenrostro_2018.tar.gz?dl=0

Just one more thing. This dataset is plate based and it produce one cell per bam file. Thus it takes up a lot of memory and take a long time to build. Other methods available like the one from 10x Genomics produce multiplex bam files that can be used to build larger count matrices in a easier way.

I hope it helps, we already build matrices using the Buenrostro dataset. I can share a script with you if you need.

hildemann commented 3 years ago

Hi Anna,

thanks a lot for pointing out this github repository. Now having looked into the Buenrostro_2018.tar.gz, i tried to work with two of the bam files as followed (the rest of notebook is identical to your tutorial):

path_to_play_data ='~/buenrostroBams/sc-bams_nodup/' file_annot_name = '~/samData/GSE96769_PeakFile_20160207.bed'

As annotation file i used the original datasets Peakfile from ncbi. At first only trying on two of the bam files # list of the bam files you want to build a count matrix for list_cells =['BM1077-CLP-Frozen-160106-13.dedup.st.bam', 'BM1077-CLP-Frozen-160106-14.dedup.st.bam',]

peaks = epi.ct.load_features(file_annot_name) peaks_names = epi.ct.name_features(peaks)

And eventually the command: epi.ct.bld_atac_mtx(list_bam_files=list_cells, loaded_feat=peaks, output_file_name='test_ATAC_mtx.txt', path=path_to_play_data, writing_option='w', header=peaks_names)

Results yet again in : `--------------------------------------------------------------------------- AttributeError Traceback (most recent call last)

in ----> 1 epi.ct.bld_atac_mtx(list_bam_files=list_cells, 2 loaded_feat=peaks, 3 output_file_name='test_ATAC_mtx.txt', 4 path=path_to_play_data, 5 writing_option='w', ~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/count_matrix/_atac_mtx.py in bld_atac_mtx(list_bam_files, loaded_feat, output_file_name, path, writing_option, header, mode, check_sq, chromosomes) 94 #for read in samfile.fetch(until_eof=True): 95 for read in samfile: ---> 96 line = str(read).split('\t') 97 if line[2][3:] in chromosomes: 98 keep_lines.append(line[2:4]) ~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in __str__(self) 462 463 def __str__(self): --> 464 return self.__repr__() 465 466 def _range_popper(self, interval_start, interval_stop=None, front=True): ~/anaconda3/envs/bsh/lib/python3.8/site-packages/bamnostic/core.py in __repr__(self) 444 self.read_name, 445 "{}".format(self.flag), --> 446 "{}".format(self.reference_name, self.tid), 447 "{}".format(self.pos + 1), 448 "{}".format(self.mapq), **AttributeError: 'AlignedSegment' object has no attribute 'reference_name'**` Now does this have to do with the Annotation Peak file? I thought coming from the official ncbi page there should be nothing wrong with that. However, coming back to your kind offer it would be great if you were to share the script, with which you build the matrices from the dataset. Probably that would make things much clearer for me

DaneseAnna commented 3 years ago

Hey, sorry for the delay. It has been a couple of very busy days.

I think it is still the Buenrostro header that is a little weird. You can either use the pybedtool branch and the following script:

!pip install git+https://github.com/colomemaria/epiScanpy@pybedtools import episcanpy.api as epi windows = epi.ct.make_windows(5000) bamfile = ['/Users/annadanese/Desktop/BM1077-CMP-Frozen-160106-83.dedup.st.bam'] epi.ct.bld_atac_mtx_pybedtools(list_bam_files=bamfile, loaded_feat=windows, output_file_name='test.h5ad')

OR: we used the following script when we worked with these data. https://www.dropbox.com/s/i4hbhvzns5sfd8y/run_buenrostro_1_build_array.sh?dl=0

Let me know if any of it is working !

Best, Anna

hildemann commented 3 years ago

Hi @DaneseAnna ,

no worries, your help has been excellent. So after trying out what you suggested and working around a few minor problems i now have a anndata object with correct objects and the properties:

"AnnData object with n_obs × n_vars = 3 × 491436 obs: 'cell_name', 'label', 'cell_type', 'facs_label' uns: 'omic'"

Created on three bam files. All the objects are as they should be however, editing variables or loading the gene/transcript annotation doesn't work. The variables are still the numbers 0 to 491435, which makes it hard for me to do an analysis with the object.

I downloaded the data as suggested in tutorial with: !wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz -O data/data_tutorial_buenrostro/gencode.v19.annotation.gtf.gz !cd data/data_tutorial_buenrostro; gunzip gencode.v19.annotation.gtf

Then running the command (copied from the tutorial under section "Load gene/transcript annotation"): epi.tl.find_genes(adata, gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf', key_added='transcript_annotation', upstream=2000, feature_type='transcript', annotation='HAVANA', raw=False)

Threw

IndexError Traceback (most recent call last)

in ----> 1 epi.tl.find_genes(adata, 2 gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf', 3 key_added='transcript_annotation', 4 upstream=2000, 5 feature_type='transcript', ~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/tools/_find_genes.py in find_genes(adata, gtf_file, key_added, upstream, feature_type, annotation, raw) 43 line = line.split('_') 44 if line[0] not in raw_adata_features.keys(): ---> 45 raw_adata_features[line[0]] = [[int(line[1]),int(line[2]), feature_index]] 46 else: 47 raw_adata_features[line[0]].append([int(line[1]),int(line[2]), feature_index]) **IndexError: list index out of range** --------------------------------------------------------------------------- Is this a familiar error to you? Or do you know why it is thrown? Thanks a lot for the help

DaneseAnna commented 3 years ago

Hi, sorry ! I did not see your message. Which variable did you use? windows = epi.ct.make_windows(5000) or something else? Theoretically, the name of the variable, if specified, should be saved in adata.var_names.

Best, Anna

hildemann commented 3 years ago

No worries @DaneseAnna Well, i tried to specify the variable names by giving it:

"file_annot_name = '/nfs/home/students/vhildemann/samData/GSE96769_PeakFile_20160207.bed' peaks = epi.ct.load_features(file_annot_name) peaks_names = epi.ct.name_features(peaks)"

and passing option "loaded_feat=peaks" into the "bld_atac_mtx_pybedtools" command

However still there are these numbers only & adding transcript annotation fails, bc of the index error.

adata.var_names gives me:

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ... '491426', '491427', '491428', '491429', '491430', '491431', '491432', '491433', '491434', '491435'], dtype='object', length=491436)

Thanks

DaneseAnna commented 3 years ago

Ok, this shouldn't be. I think it is an error specific to the pybedtool function. I will fix it. In the mean time you can change it by writing adata.var_names = peaks. That should fix the problem as long as they are the same length. Then, epi.tl.find_genes should work if you have variable names. However it takes a long time to run and you might prefer to use it on a smaller feature space. Or, you can build a gene activity matrix that will give you a new AnnData object with genes as the feature space. It's api.tl.geneactivity

Best, Anna

hildemann commented 3 years ago

Hi @DaneseAnna ,

thank you for the suggestions ! They solved most of the problems. Finally the analysis is running almost as it should be. Now what is a interpretational question i couldn't solve on my own is, in the tutorial: What the function 'epi.pp.normalize_total(adata)' does and how it changes the visualizations of the data? ( umaps line 42 compared to umaps line 49) In the documentation i couldn't find a clear explanation on that.

However apart from that, still the last command: 'epi.pl.rank_feat_groups_matrixplot(adata)' is throwing:

"Warning: dendogram data not found(using key=dendogram_louvain).Running 'sc.tl.dendogram' with default parameters. For fine tuning it is recommended to run 'sc.tl.dendogram' independently. "

Followed by:

ValueError Traceback (most recent call last)

in ----> 1 epi.pl.rank_feat_groups_matrixplot(adata) ~/anaconda3/envs/bsh/lib/python3.8/site-packages/episcanpy/plotting/_scanpy_plotting.py in rank_feat_groups_matrixplot(adata, groups, n_features, groupby, key, show, save) 468 Are passed to :func:`~scanpy.pl.matrixplot`. 469 """ --> 470 sc.pl.rank_genes_groups_matrixplot(adata, groups, n_genes=n_features, groupby=groupby, 471 key=key, show=show, save=save) 472 ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/__init__.py in rank_genes_groups_matrixplot(adata, groups, n_genes, groupby, key, show, save, **kwds) 637 """ 638 --> 639 _rank_genes_groups_plot( 640 adata, 641 plot_type='matrixplot', ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_tools/__init__.py in _rank_genes_groups_plot(adata, plot_type, groups, n_genes, groupby, key, show, save, **kwds) 406 elif plot_type == 'matrixplot': 407 from .._anndata import matrixplot --> 408 matrixplot(adata, gene_names, groupby, var_group_labels=group_names, 409 var_group_positions=group_positions, show=show, save=save, **kwds) 410 ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in matrixplot(adata, var_names, groupby, use_raw, log, num_categories, figsize, dendrogram, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, standard_scale, swap_axes, show, save, **kwds) 2181 2182 if dendrogram: -> 2183 dendro_data = _reorder_categories_after_dendrogram( 2184 adata, 2185 groupby, ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _reorder_categories_after_dendrogram(adata, groupby, dendrogram, var_names, var_group_labels, var_group_positions) 3109 """ 3110 -> 3111 key = _get_dendrogram_key(adata, dendrogram, groupby) 3112 3113 dendro_info = adata.uns[key] ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/plotting/_anndata.py in _get_dendrogram_key(adata, dendrogram_key, groupby) 3195 "tuning it is recommended to run `sc.tl.dendrogram` independently." 3196 ) -> 3197 dendrogram(adata, groupby, key_added=dendrogram_key) 3198 3199 if 'dendrogram_info' not in adata.uns[dendrogram_key]: ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scanpy/tools/_dendrogram.py in dendrogram(adata, groupby, n_pcs, use_rep, var_names, use_raw, cor_method, linkage_method, optimal_ordering, key_added, inplace) 130 corr_matrix, method=linkage_method, optimal_ordering=optimal_ordering 131 ) --> 132 dendro_info = sch.dendrogram(z_var, labels=categories, no_plot=True) 133 134 # order of groupby categories ~/anaconda3/envs/bsh/lib/python3.8/site-packages/scipy/cluster/hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color) 3275 "'bottom', or 'right'") 3276 -> 3277 if labels and Z.shape[0] + 1 != len(labels): 3278 raise ValueError("Dimensions of Z and labels must be consistent.") 3279 ~/anaconda3/envs/bsh/lib/python3.8/site-packages/pandas/core/indexes/base.py in __nonzero__(self) 2147 2148 def __nonzero__(self): -> 2149 raise ValueError( 2150 f"The truth value of a {type(self).__name__} is ambiguous. " 2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." **ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().** --------------------------------------------------------------------------- Is this a common error? Both running 'sc.tl.dendogram' or 'a.all()' independently, hasn't worked for me. Thanks a lot!

colomemaria / epiScanpy

Problem building scATAC-seq count matrice from bams #69