bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
939 stars 171 forks source link

How to make ATAC counts as bins #164

Open ttgump opened 4 months ago

ttgump commented 4 months ago

Hi, I have a question of using scGPT on ATAC data. The typical scATAC-seq has binarized values 0 and 1, so how to make the ATAC counts as bins? Should we only make the counts to only 2 bins: 0-bin and 1bin? Thanks.

subercui commented 4 months ago

I think after preprocessing and peak calling, ATAC data can be mapped to small windows of regions. Here is a related paper that may help suggest a lot of tools https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3#Sec6 . To clarify, all the ATAC-seq data we used in the experiments are processed public datasets after peak calling. @ChloeXWang would you like to comment more on this question?

ttgump commented 4 months ago

I think after preprocessing and peak calling, ATAC data can be mapped to small windows of regions. Here is a related paper that may help suggest a lot of tools https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3#Sec6 . To clarify, all the ATAC-seq data we used in the experiments are processed public datasets after peak calling. @ChloeXWang would you like to comment more on this question?

Yes, I understand that ATAC-seq data will be cell by peak matrix. My question is that the counts of cell by peak matrix are binary values. There are only 0 and 1 values. Not like RNA-seq counts, you can assign counts into many bins (I think the default setting is 51 bins). How can we assign binary ATAC-seq counts into bins?

ChloeXWang commented 4 months ago

I see, the ATAC datasets we have been working with are not binary. Would you be able to access non-binary peak count matrices? Or would you mind elaborating a bit more on your usage scenario (e.g, cluster, integration)? We can see if there is any binning recommendations for your problem at hand.

ttgump commented 4 months ago

Yes, we can access to the raw count matrix of the ATAC-seq data. Most reads are 0, 1, 2, do you have any suggestion of binning?