down stream analysis after running AMULET.sh

Erfan1369 commented 4 months ago

Dear Developers,

Thanks for introducing such a tool to detect multiplets in single-cell epigenomic data.

I’m completely new to using this tool and have some simple questions. I have successfully run AMULET.sh on node clusters and have the output, which includes 6 files, all in txt format:

MultipletCellIds_01 MultipletBarcodes_01 (co MultipletProbabilities MultipletSummary Overlaps OverlapSummary

Except for the summary file, does the MultipletBarcodes_01 reflect the potential multiplet cells that must be removed from the original data? If so, is there any downstream analysis pipeline in AMULET to follow up? If not, what would be the next step? Do I need to remove these cells using other scATAC-seq analysis tools like Signac, ArchR, etc.? To ask the question more comprehensively, how can the unwanted multiplets be removed from the output of the Cell Ranger pipeline after detecting them with AMULET?

The Overlaps.txt provides useful information about the number of fragments for each cell in addition to the overlap information. In my output, the CellIds and Barcodes are similar (which I think is normal as each barcode is a representation of one cell). However, looking at the number of fragments shows a contradiction in the results: when the cells and barcodes are quite similar, why is the number of reads for each barcode and cell id slightly different? Is this normal? Or should I rerun the analysis to tune the parameters to remove the contradiction?

Screenshot 2024-07-10 145106

Thanks for your assistance.

ajt986 commented 4 months ago

Hello!

The barcodes that AMULET lists as multiplets should be the cell IDs in your scATAC-seq object. What I recommend is to create a new feature column that has values either 0 (barcode IS NOT in MultipletBarcodes_01) or 1 (barcode IS in MultipletBarcodes_01). Then, I would inspect the feature plots using this new feature column you created to see where the multiplets are in the object:

Identify clusters that have a high number of multiplet annotated cells and remove these clusters (These are the heterotypic multiplets derived from multiple cell types), but some may be missed due to lower read depth cells. This will eliminate potential false negatives.
After removing all heterotypic multiplets, remove the remaining multiplet annotated cells/barcodes (These are the homotypic multiplets derived from the same cell type)

For the Overlap.txt table you shared, the Cell ID should be the same as the barcode in the default usage of the current implementation. Cell Ranger ATAC used to have another identifier for the cell, but in AMULET we later just switched this to just use the barcode again for the Cell Id. For the read information, "Number of Valid Reads" is different from Total Number of Reads", where valid reads are the ones that remained after filtering out low quality mapped reads, duplicates, multi-secondary reads etc. Sorry for the confusion here!

Hoping this clarifies everything! Asa

Erfan1369 commented 4 months ago

Thanks for your quick reply and suggestions,

I’m going to try what you have advised here. A few things to make sure I’m on the right road:

So, I must remove all the multiplets after filtering the data based on QC metrics like TSS enrichment score, fraction of fragments in peaks, etc.? Secondly, is it important to optimize the number of clusters to remove multiplets, or would casual clustering be sufficient? I guess after clustering there might be different scenarios: clusters with only a few (or no) multiplets, or clusters with a high coverage of multiplets (not all of the cells in the cluster are considered multiplets). What would be the best decision here? Do I remove only the cells (barcodes) that are detected as multiplets, or is it better to remove the whole cluster when it is “contaminated” with a high percentage of multiplets?

Sorry if it is not completely clear.

Erfan

ajt986 commented 4 months ago

Yes, I recommend running AMULET after initial cell QC so that the mean read depth distributions used in the Poisson distribution are driven by high quality cells and not by low quality ones.

For removing clusters with doublets, I'm mainly talking about clusters that are almost entirely annotated as doublets. When you find these clusters, you'll see that the cells not annotated as doublets are ones with lower read depth. Essentially the lower read depth cells are not saturated enough to detect enough instances of > 2 reads overlapping for diploid cells, and hence will be missed by AMULET. With clusters that fit this criteria, it is better to remove the entire cluster. I would check for these clusters at the overall level and within clustering by specific cell type lineages. For example, with PBMCs, I would check for these clusters first when clustering total PBMCs. Then I would check clustering with just CD4+ T, CD8+ T & NK, Myeloid, and B cell lineages (4 subsets) to clean out these clusters.

Of course, after cleaning out these clusters described above, you should remove any remaining cells marked as a doublets/multiplets detected by AMULET. Homotypic multiplets will amplify the reads per cell and might impact other downstream analyses.

Side Note: These multiplet clusters will be easier to see if you combine cells from multiple samples. But keep in mind that AMULET should only be run one sample at a time since the read depth distributions differ between samples. Also be careful with barcodes as they can be the same between samples and should include a prefix or suffix ID to distinguish barcodes between samples when combining.

Erfan1369 commented 4 months ago

Thanks! Awesome!

Now I have a picture of how to perform the analysis with AMULET to get the real and right cells for downstream analysis.

would you rather to optimize clustering before or after detecting multiplets?

Is there any implementation of AMULET to use in other tools like Signac, ArchR, etc.? Or do I need to create a new fragment and metadata file after QC processing for each sample to detect the multiplets with AMULET?

ajt986 commented 4 months ago

Speaking from my own experience, a less optimized clustering before removing multiplets was sufficient. I usually perform clustering in 2 passes, with the first pass removing multiplet/low quality clusters, followed by a more rigorous reclustering (from the beginning) with the remaining high quality cells.

As for other tools with an implementation of AMULET, I'm aware of scDblFinder having one: https://bioconductor.org/packages/release/bioc/html/scDblFinder.html

You shouldn't need to create a new fragment file, just provide a new single cell CSV file with the low QC cells excluded.

UcarLab / AMULET

down stream analysis after running AMULET.sh #24