Add specialized mtDNA analysis

williamrowell commented 2 years ago

Consider:

extracting all reads aligning to chrM
filtering reads with primary/supp alignments elsewhere (NUMTs)
realign with an aligner aware of circular references (graphmap, raptor)
call major variants, haplotypes
call minor variants?

crazysummerW commented 1 year ago

Hi, @williamrowell I am attempting to analyze PacBio mitochondrial data, but I have some confusion. pichuan told me that you might be able to give me some suggestions. Here is a specific description of my question: https://github.com/google/deepvariant/issues/686 I am a beginner in mitochondrial variant calling, and currently, I haven't found a very suitable open-source software. Could you provide me with some specific recommendations regarding aligning, filtering, and calling variants on PacBio mitochondrial data?

Looking forward to your reply. Thanks.

williamrowell commented 1 year ago

Repeating most of my response from the other ticket just for sake of documentation.

@crazysummerW DeepVariant works well for identifying SNVs and indels at 50% and higher in the sample, with the exception that DeepVariant misses variants in the first/last ~100 bp of chrM because of the window size. For calling the high frequency variants, you can also use gatk HaplotypeCaller, but since the tool was optimized for short reads, there's a lot of additional tweaking both for HaplotypeCaller and VariantFiltration to get good results for HiFi.

Callers that have been specifically designed and optimized for short reads, like Mutect2, do a great job of identifying low frequency heteroplasmic SNVs, but struggle with separating low frequency indels from sequencing errors. We've had some success with more general purpose low frequency variant callers like lofreq and freebayes, but we still need to explore some parameters before sharing our recommended workflow.

Can you describe what kind of analysis you're looking for?

What is the coverage depth for the mitochondria?
What is the library prep (WGS, target enrichment, isolated mitochondria, etc)?
What kinds of variants are you interested in (SNVs, small coding indels, small non-coding indels, SVs)?
If SVs, what size?
Are you looking for major variants, heteroplasmic variants, NUMTs?
If heteroplasmic variants, what is the minimum detection threshold you'd want?

crazysummerW commented 1 year ago

@williamrowell Thanks for your reply.

What is the coverage depth for the mitochondria?

I tested Revio WGS data. Based on mosdepth analysis, the average depth was about 30X for the whole genome and around 400X for the mitochondria. I have a question about that: in NGS sequencing, if the average depth for WGS is 30X, then the coverage of mitochondria typically ranges around 4000X. Why is it only 400X in PacBio sequencing data? What could be the reason for this difference?

What is the library prep (WGS, target enrichment, isolated mitochondria, etc)?

WGS.

What kinds of variants are you interested in (SNVs, small coding indels, small non-coding indels, SVs)?

SNVs, small coding indels, small non-coding indels, SVs.

If SVs, what size?

Greater than 50bp.

Are you looking for major variants, heteroplasmic variants, NUMTs?

Yes. But I am not very clear about the data processing of NUMTs.

If heteroplasmic variants, what is the minimum detection threshold you'd want?

About 5%.

Could you provide me with some specific recommendations about the analysis pipline for PacBio mitochondrial data?

cnolanpacb commented 1 year ago

Hi, glad you're interested in looking at the mitochondria with Hifi.

I have a question about that: in NGS sequencing, if the average depth for WGS is 30X, then the coverage of mitochondria typically ranges around 4000X. Why is it only 400X in PacBio sequencing data? What could be the reason for this difference?

This will be down to differences in the library prep between short read NGS and Hifi, we apply size selection prior to sequencing which is typically looking for much larger DNA fragments than short read size selection, so proportionally we sequence less total mtDNA due to losing mtDNA molecules which are sheared more than once, and thus are below our target range for fragments.

Greater than 50bp.

pbsv is our purpose built SV caller, which I can recommend for calling homoplasmic SVs.

Yes. But I am not very clear about the data processing of NUMTs.

NUMTs shouldn't be huge problem for Hifi data due to the read length. In order to eliminate NUMTs as a source of false positive heteroplasmic calls, you can filter your mitochondrial alignments for reads which also align to the nuclear DNA (ie supplementary alignments).

Could you provide me with some specific recommendations about the analysis pipline for PacBio mitochondrial data?

We are working on best-practice recommendations for mtDNA analysis.

Currently I would:

Follow the steps from the human-wgs workflow for alignment and variant calling.
Run lofreq (http://csb5.github.io/lofreq/) for low frequency variant calling, optionally filtering for supplementary alignments to remove NUMTs.
You will need to add indel qualities (error rates) to your BAM prior to calling using the indelqual command, in order to get insertion and deletion calls: lofreq indelqual --uniform 30,29 -f ${REF_FASTA} -o ${OUTPUT} ${INPUT_BAM}
Call variants with: lofreq call --call-indels -f ${REF_FASTA} -r chrM ${INDEL_QUAL_BAM}

crazysummerW commented 1 year ago

hi, @cnolanpacb Thank you very much for your response!

I have learned from the internet that the majority of NUMTs are less than 500 bps in length. Do you think filtering NUMTs is necessary for Hifi data? Additionally, I noticed that most NUMTs analysis is focused on NGS. Do you have any tool recommendations for analyzing NUMTs with Hifi data? I tested the mitochondrial analysis of HG002 using lofreq (without analyzing NUMTs), and it ran successfully. But I could not find a benchmark for mitochondrial analysis. Is there any method to analyze the accuracy and sensitivity of lofreq in mitochondrial analysis? Or do you have any test results to share?

williamrowell commented 1 year ago

Hi @crazysummerW,

I have learned from the internet that the majority of NUMTs are less than 500 bps in length. Do you think filtering NUMTs is necessary for Hifi data? Additionally, I noticed that most NUMTs analysis is focused on NGS. Do you have any tool recommendations for analyzing NUMTs with Hifi data?

Because of read length and the recommended alignment parameters used by pbmm2, NUMTs typically appear as structural variants, either as insertions of mtDNA sequence in reads that otherwise align to nuclear DNA, or as break ends between nuclear chromosomes and chrM. A structural variant caller like pbsv can be used to identify these events.

If you want to prevent NUMT alignments from causing false positive small variant calls, you can follow @cnolanpacb's suggestion and:

select alignments from all reads that align to chrM (this will include both real mitochondrial reads and NUMTs)
throw away any reads with a primary or supplementary alignment to a nuclear chromosome (to remove reads with NUMTs)
only call variants (using lofreq, for instance) from the remaining non-NUMT chrM reads

I tested the mitochondrial analysis of HG002 using lofreq (without analyzing NUMTs), and it ran successfully. But I could not find a benchmark for mitochondrial analysis. Is there any method to analyze the accuracy and sensitivity of lofreq in mitochondrial analysis? Or do you have any test results to share?

I'm not aware of a mtDNA truth set for HG002, but HPRC assemblies have been generated from many samples. These will typically include only a single "best" chrM for each sample.

We don't have any tests to share at the moment.

PacificBiosciences / pb-human-wgs-workflow-snakemake

Add specialized mtDNA analysis #106