caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit
http://caleblareau.github.io/mgatk
MIT License
101 stars 27 forks source link

Using call mode for bulk RNA-seq data. #40

Closed francat closed 3 years ago

francat commented 3 years ago

Hi there!

I am trying use mgatk's call mode to call variants on bulk RNA-seq data. It appears the following files are not generated.

.variant_stats.tsv.gz .cell_heteroplasmic_df.tsv.gz *.vmr_strand_plot.png

I used the following command on a directory that contains multiple aligned bams from different indviduals. mgatk call -i bams_og -n GTEX_4 -o /global/scratch/fran_catalan/Data/MitoData/GTEX_v8/Liver/mgatk_test/Liver_4 -c 24 -g hg38 --snake-stdout

A summary of .log files

cat Liver_4/logs/*log Wed Jul 21 14:51:45 PDT 2021: Starting analysis with mgatk Wed Jul 21 14:51:45 PDT 2021: mgatk will process 226 samples Wed Jul 21 14:52:43 PDT 2021: Processing samples with 24 threads Wed Jul 21 20:17:34 PDT 2021: mgatk successfully processed the supplied .bam files Wed Jul 21 20:40:40 PDT 2021: Successfully created final output files Wed Jul 21 20:40:53 PDT 2021: Intermediate files successfully removed.

Describe the sequencing assay being analyzed

I'm analyzing RNA bulk data from GTEX v8. I want to compare the variants I find through mgatk with the variants found in Ludwig 2020 et al (awesome paper btw!) I previously tried my own custom variant pipeline where I used gatk's mutect2 on mitochondrial mode, but it returned very few unique heteroplasmic variants. I

Clarify if the execution successful on the test data provided in the repository

The three extra files aren't generated when I run the call or tenx mode on the test data provided.

Additional context

Add any other context about the problem here, including for example if you have run the tool successfully in other contexts.

I'm hoping you might have some advice on how to best compare heteroplasmic variants between groups of individuals from the output that mgatk generates. Apologies that my question is a little off topic, mundane, and doesn't necessarily pertain to single cell, but thank you so much for your time!

caleblareau commented 3 years ago

Thanks for the detailed message!

In brief, the outputs aren't produced by design for call mode. The variant statistics that would be computed simply are appropriate only for multiple single-cells/samples for within a given donor. The output would mostly tell you which variants are on haplotypes that differ among the GTEx population based on the assumptions of the framework. This simply isn't the right approach for this data type.

I would recommend either freebayes (has worked well in my experience) or to check out the recent preprint from the Broad on doing mtDNA variant calling: https://gnomad.broadinstitute.org/news/2021-07-gnomad-v3-1-mitochondrial-dna-variants-manuscript/