benedictpaten / marginPhase

MIT License
34 stars 16 forks source link

Margin could not phase some HiFi variants with UL reads #22

Open mobinasri opened 1 year ago

mobinasri commented 1 year ago

For polishing HPRC diploid assemblies we need to phase the variants called by applying DeepVariant on the alignments of all HiFi reads to each haplotype. We are using WhatsHap and Margin to phase HiFi variants with Ultra Long reads (>100kb). Since Margin is multi-threaded it is boosting the speed of our pipeline versus WhatsHap which is not multi-threaded. So we are inclined to use Margin. However we noticed that there exist some variants phased by WhatsHap but not by Margin. I made a small set of three variants (per haplotype) to make this issue easily reproducible. I'm explaining the issues in more detail below and also attaching the IGV snapshots of the three variants left unphased in the paternal haplotype (hap1). The equivalent variants on the other haplotype (maternal or hap2) are having the same issue so I skip writing about them but they can be explored and investigated using the files and IGV sessions provided at the end of this post.

Examples:

  1. Location: HG002#1#JAHKSE010000019.1:39117238-39117397 This variant is happening in a TG dimer repeat. There is a deletion of length two that has be phased here (genotype = TTG/T). As I mentioned above this variant is called from the HiFi alignments, which are not shown in the screenshot. The UL reads are shown in two groups. Just for more informative visualization these two groups are phased naively in IGV by the nearest SNP which is phased by both WhatsHap and Margin. As expected the UL reads are not very clean here however there exist some signal for phasing and WhatsHap could use it to phase the variant but Margin could not. Top window is showing the variant record in the WhatsHap output and the bottom one is for Margin.

Example_1_Hap1

  1. Location: HG002#1#JAHKSE010000052.1:29343390-29343498 This variant similar to the first one is happening in a TG dimer repeat and the deletion is of length 2 (genotype = AGT/A). Again UL alignments are phased naively by the nearest phased SNP. Similar to the previous case it is not clean but there is some signal.

Example_2_Hap1

  1. Location: HG002#1#JAHKSE010000064.1:12689434-12689590 In this case we have two variants beside each other. The first one is a 5-bp deletion with the genotype of ACAAAG/A and the other one is a 1-bp insertion with the genotype of G/GA. Again the UL reads are phased by the nearest SNP. WhatsHap could phase both variants correctly however Margin leaves the insertion unphased.

Example_3_Hap1

The related files such as the raw and phased vcf files, assembly fasta files, UL alignments and HiFi alignments are all uploaded here: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/04ab5ae4-0170-11ee-904c-0a13c5208311--HPRC_Polishing/HPRC_Y1/HG002/margin_issues/

"igv_session.xml" can be downloaded locally and loaded in IGV to explore the variants explained above.

Below are the WDL files used for running Margin and WhatsHap: https://github.com/miramastoras/hpp_production_workflows/blob/79d7f972700ec7f87f0a920c1eb2cf34cc03c692/QC/wdl/tasks/marginPhase.wdl https://github.com/miramastoras/hpp_production_workflows/blob/79d7f972700ec7f87f0a920c1eb2cf34cc03c692/QC/wdl/tasks/whatsHapPhase.wdl