calico / scBasset

Sequence-based Modeling of single-cell ATAC-seq using Convolutional Neural Networks.
Apache License 2.0
94 stars 12 forks source link

testing non-10X data set? #2

Closed willey2020 closed 2 years ago

willey2020 commented 2 years ago

Hello! Could I ask for a question regarding running data set in non-10X version? like bam files or only a peak count matrix, is there any way to import them into scBasset? Thank you very much!

hy395 commented 2 years ago

Hi, sorry for the late response. I was trying to improve scBasset for the last few month. I updated scBasset package with new tutorial. Yes, scBasset can be applied to non-10x input. As long as you convert your peak-by-cell matrix into anndata format, where the .var contains "chr", start", and "end" columns. scBasset can do the preprocess.

willey2020 commented 2 years ago

Thank you so much!! Will try!

willey2020 commented 2 years ago

Hello! Thank you again for your answer! Sorry for bother again, Could you give a quick suggestion about how to input non 10X Buenrostro's raw data. I follow your advice instead of using sc.read_10x_h5, I use sc.read_csv to convert presaved peak matrix from Signac to anndata format, everything looks fine till I met trouble in preprocessing.py, showing Traceback (most recent call last): File "/home/gougou/scBasset/bin/scbasset_preprocess.py", line 60, in main() File "/home/gougou/scBasset/bin/scbasset_preprocess.py", line 54, in main make_h5_sparse(ad, '%s/all_seqs.h5'%output_path, input_fasta) File "/home/gougou/scBasset/scbasset/utils.py", line 136, in make_h5_sparse m = m.tocoo().transpose().tocsr() AttributeError: 'numpy.ndarray' object has no attribute 'tocoo'

I found that in the adh5 count from 10X input's anndata a sparse csr_matrix, but my anndata shows ndarray type.

I think the example Buenrostro's raw data is non 10X version, which I guess it didn't go through sc.read_10x_h5. Could you provide any guidance of how to import that type of data into anndata that can accepted by your downstream pipeline. Thank you! Thank you so much!

willey2020 commented 2 years ago

I just fix the issue. The reason I met, is that I should transform the ndarray into sparse matrix format inside the anndata, I use the function sparse.csr_matrix(ad) to convert it to sparse matrix and then preprocessing it works. Thank you very much again and sorry for bothering with this small issue.

Hello! Thank you again for your answer! Sorry for bother again, Could you give a quick suggestion about how to input non 10X Buenrostro's raw data. I follow your advice instead of using sc.read_10x_h5, I use sc.read_csv to convert presaved peak matrix from Signac to anndata format, everything looks fine till I met trouble in preprocessing.py, showing Traceback (most recent call last): File "/home/gougou/scBasset/bin/scbasset_preprocess.py", line 60, in main() File "/home/gougou/scBasset/bin/scbasset_preprocess.py", line 54, in main make_h5_sparse(ad, '%s/all_seqs.h5'%output_path, input_fasta) File "/home/gougou/scBasset/scbasset/utils.py", line 136, in make_h5_sparse m = m.tocoo().transpose().tocsr() AttributeError: 'numpy.ndarray' object has no attribute 'tocoo'

I found that in the adh5 count from 10X input's anndata a sparse csr_matrix, but my anndata shows ndarray type.

I think the example Buenrostro's raw data is non 10X version, which I guess it didn't go through sc.read_10x_h5. Could you provide any guidance of how to import that type of data into anndata that can accepted by your downstream pipeline. Thank you! Thank you so much!

hy395 commented 2 years ago

Hi,

Glad the problem is solved. Ye, scBasset assumes anndata.X is in sparse format. I'll add a note to the readme file. Thanks!

willey2020 commented 2 years ago

Thank you again!