UcarLab / AMULET

A count based method for detecting doublets from single nucleus ATAC-seq (snATAC-seq) data.
https://ucarlab.github.io/AMULET/
GNU General Public License v3.0
29 stars 5 forks source link

Can AMULET identify doublets with relatively low read counts from cells with a wide range of read counts #17

Open Yydtqqqg opened 2 years ago

Yydtqqqg commented 2 years ago

Hi! I am applying AMULET on snATAC-seq data generated from my lab. For a dataset like this, after filtering with TSSE >= 10 and log10(UQ + 1) >= 3 (UQ means the number of uniquely mapped fragments), the cells still have a wide range of UQ.

屏幕快照 2022-04-04 14 23 37

Since the number of overlaps (genomic regions with >2 overlapping reads) for a cell positively correlates with the UQ of the cell, cells have high UQ tend to have a larger number of overlaps. Because the majority of cells fall in the range of 3 < log10(UQ + 1) < 4, I believe the majority of doublets are also there (at least with the same order of magnitude). However, with a doublet formed by two singlets each with UQ = 3000 and a singlet with UQ = 30000, AMULET would very probably locate more overlaps in the singlet than in the doublet. The cells whose UQ > 10000 tend to have more overlaps and are thus more likely to be identified as doublets. The AMULET paper indicates that 25K median valid reads per nucleus is optimal, but is there a way to maximize the probability of finding doublets in the range of 3 < log10(UQ + 1) < 4 in this dataset given that there are cells in range 4 < log10(UQ + 1) < 5?

ajt986 commented 2 years ago

Ideally, singlets with UQ = 30000 wouldn't be identified as a doublet by AMULET since there shouldn't be many instances of overlaps >2. If it is identified as a doublet by AMULET, then the overlaps are likely driven by repetitive regions, copy number variations, or sequencing errors. AMULET tries to take repetitive regions into account by filtering regions identified by repeat masker.

For better identifying doublets within 3 < log10(UQ + 1) < 4, you could try running AMULET on just those subset of cells. For this, you can provide a different singlecell.csv file with only cells in this range. It would adjust the background estimate used for the poisson distribution. Changing the statistical method to something that accounts for the total unique fragments within the cell does sound like a great way to improve the method!

Yydtqqqg commented 2 years ago

Thank you! I think adjusting the background estimate is the key here. The thing really confuses me is that AMULET finds some number of overlaps in many singlets. I plotted the number of overlaps for each cell in our dataset based on the OverlapSummary.txt file generated by the first step of AMULET:

屏幕快照 2022-04-04 15 15 54

There are 4084 cells in this dataset, and ~1000 of them has close to 0 overlaps, but the remaining cells more or less have some overlaps. One thing I know is that we sequenced very deep for our dataset. Should we adjust the expected number of overlaps from 2 to something bigger? I also noticed that the number of overlaps correlates positively to the read counts when I plotted read counts for the cells.

屏幕快照 2022-04-04 15 15 48

Is there a parameter in AMULET that can be adjusted to take sequencing depth into account (maybe normalizing by read counts)? If there is, I think this might be helpful to solve the problem.

ajt986 commented 2 years ago

The expected number (2) is based on the biology of how many copies of chromosomes are within diploid cells. What this means is that this number should only change based on the number of chromosome copies within the cells being sequenced. For example, finding multiplets in proliferating diploid cells will need to adjust this number since they may have up to 4 copies. Consequently, running AMULET on cells such as PBMCs may detect proliferating cells.

Unfortunately, AMULET doesn't have a built in option for accounting for sequencing depth/unique fragments within the cells. This would be a future work/extension of the method. However, I imagine that the AMULET output files that provide the overlap count information in conjunction with the number of unique reads within the cell can be easily utilized to apply a different statistical method for identifying the most probable multiplets.

Yydtqqqg commented 2 years ago

Thank you and good to know this! Can you suggest some statistical method for this situation?

Yydtqqqg commented 2 years ago

Hi! I think I found that reason why the number of overlaps correlates with the read counts for a cell. In the Overlaps.txt file, I found that >90% of the overlaps have >=3 fragments with exactly the same starting and ending position.

屏幕快照 2022-04-09 16 41 17

Here, each overlap in the head of the file seems to contain PCR duplicates. However, I believe that AMULET removes PCR duplicates and optical duplicates by checking the duplicate flag. Then, how could this situation happen and how can we solve it?

ajt986 commented 2 years ago

AMULET does remove the reads that are marked as duplicates. My first thoughts are that there is either a bug in the code that's producing these outputs, or the BAM file does not have the duplicates flagged. Do you mind sharing what version of AMULET you are running and what the inputs are (e.g., Cell Ranger ATAC 2.0 etc)?

Yydtqqqg commented 2 years ago

I am running AMULET-v1.1. The original inputs are count matrix . When I use this command to query if the BAM files have duplicates flagged samtools view XXX.bam | awk '(($2 >= 1024 && $2 < 2048) || ($2 >= 3072)){print}' | wc -l, it gives zero. What should I do in this case?

ajt986 commented 2 years ago

Processing the bam file to mark the duplicates would allow the tool to run. However, the algorithm to mark the duplicates would need to be aware that not all reads come from the same cell. Reads with the same start/end positions from different cells should not be marked as duplicates, but otherwise would be without this awareness. An alternative solution is to update AMULET to include an extra step to save the last processed read for each cell and skip reads with the same start/end position as the previous read from the corresponding cell. Hoping to find time to include this as an option for AMULET.

Yydtqqqg commented 2 years ago

Right! Could you suggest some tool that does not mark reads with identical start/end positions from different cells as duplicates? I am considering Picard, but I am not sure if it has such functionality.