caleblareau / maegatk

Mitochondrial Alteration Enrichment and Genome Analysis Toolkit
MIT License
17 stars 2 forks source link

Comparing mgatk and maegatk #8

Closed dy-lin closed 1 year ago

dy-lin commented 1 year ago

I'm looking to compare the calls between mgatk and maegatk. I've read the FAQ and know that they should produce very similar results. I'm currently working with scRNA-seq data from 10X Multiome kit, processed with Cellranger-ARC.

How is a comparison made? For mgatk, I can read the results into a Seurat object using ReadMGATK(). However this function fails for maegatk output as it has more columns. Is there a way to output a Seurat object RDS like there is a Signac object RDS for mgatk?

Are there any suggestions to make a scRNA via mgatk vs a scRNA via maegatk comparison? Ultimately I will compare scATAC via mgatk against scRNA via maegatk.

caleblareau commented 1 year ago

The tricky thing is that there isn't an obvious way of calling variants with maegatk since there is only 1 strand. In the MAESTER paper, there were a variety of filtering strategies to identify high-confidence variants. An mgatk versus maegatk comparison will show that most of the per cell, per variant allele counts will be similar.

The logic behind the ReadMGATK() --> variant calling in Seurat fails again because there is no strand concordance to call. @petervangalen do you have any advice on the best scripts to refer to call variants from maegatk output?

petervangalen commented 1 year ago

To call informative variants from maegatk output, I generated a heteroplasmy/allele frequency matrix for all possible variants (af.dm) and applied thresholds depending on the experimental question, such as "variants with 20-fold coverage, VAF of <10% in the bottom 10% of cells and VAF of >90% in top 10% of cells" (TenX_CellLineMix) or "variants present in at least 10 cells with a VAF of >50%" (CH_Sample) or "variants with differing presence between samples (or cell subsets) from the same patient" (GBM). This would yield on the order of 10 to 30 informative variants of which >80% transitions. Then I would consider cells to be part of a clone based on detecting these informative variants (e.g. heteroplasmy > 1%). As I said, this downstream analysis depends on the biological question and generally required a lot of iteration to find the best thresholds for each experiment. I am still looking for a more standardized approach.

We are also planning to compare scATAC vs. scRNA-derived variants, so it might be interesting to compare notes. Hope that helps!