dbrg77 / ISSAAC-seq

16 stars 3 forks source link

Question about Fig1B and ATAC/filtered_mtx/metrics.csv in this repository #2

Closed skelviper closed 9 months ago

skelviper commented 9 months ago

Hi!

First, I would like to express my appreciation for the Issaac-seq method and the way the GitHub repository is organized. It's incredibly clear and well organized, making it a great resource for researchers.

I have a question regarding the comparison with other methods, specifically pertaining to the content in the ATAC/filtered_mtx/metrics.csv files within each directory. Should these files be interpreted as "cell name-fragments in ATAC peak-number of ATAC peak"? I noticed that for most cell types, which should be diploid, the nCounts are almost twice or more than the nFeatures in some methods (e.g., 10xmultiome and Issaac-seq), while in others like share-seq, the ratio is more or less 1:1.

From my basic understanding, scATAC-seq experiments are prone to dropout, and detecting both alleles in a diploid cell can be challenging. Could you provide some insight into why there's such a significant discrepancy in the nCounts being more than double the nFeatures in some cases? How do different experimental methods account for this variation?

Thank you for your time and assistance. Happy Spring Festival!

Zhiyuan Liu

dbrg77 commented 9 months ago

Hi @skelviper

Thanks for your question.

First, you are correct that scATAC-seq experiments are prone to dropout, because there are very limited copies of DNA for each locus in a single cell: 2 - 4 copies in a diploid cell depending on cell cycle stages. On top of that, normal Tn5-based methods all have 50% loss after tagmentation (see this Twitter thread), because you need two different adapters added at the sides of the fragment.

Regarding the "discrepancy" of various methods on nCounts, I do not think it is due to the experimental method. The ATAC experiments in all the methods listed in the paper are the same, so it is unlikely that the method itself causes the difference. I think It is basically a technical thing where people use different standards for counting. This heavily influences the how we should interpret the count matrix.

For our ISSAAC-seq and 10x Multiome, we started with the FastQ files and performed peak calling and read count by ourselves, where the properly mapped read pairs were treated as single-end reads and peaks were called by MACS2. Then read in each peak was counted. I have explained my reasoning why I did this in this post, and also in this Twitter thread and some more discussion from the MACS GitHub Discussion #435. Therefore, in the case of ISSAAC-seq and 10x Multiome, I'm sure that nCount means the number of reads from peaks, and nFeature means the number of peaks that have at least 1 read.

For other technologies, the data were submitted to SRA. It was impossible to get the index reads from SRA, and we cannot start from the FastQ files and go through our own analysis pipeline. Therefore, as you can see from the Snakefile, we just download the count matrix provided by the authors. In this case, I'm not sure how exactly the peak calling was done and what the number actually meant. Maybe they meant fragment counts, which in theory should be half of the read count.

In terms of what count/fragment number should we expect in a peak, it depends. Sometimes, we get large peaks (500 ~ 2000bp) after peak calling, in this case, even in a single diploid cells, there can be many fragments/reads (more than 2) in the large peaks.

What is the best practice? I don't know. There is this recent paper suggesting fragment counts is superior. Maybe I should change our pipeline in the future.

I hope I answered your questions, and let me know if anything is still not clear.

Happy Chinese New Year!

Regards, Xi

skelviper commented 9 months ago

Hi Xi,

Your reply is very clear and detailed, thank you so much!

Zhiyuan