KolmogorovLab / hapdup

Pipeline to convert a haploid assembly into diploid
Other
85 stars 8 forks source link

Incorrect genotype for large deletion #14

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi,

I have used Hapdup to make a haplotype-resolved assembly from Illumina-corrected ONT reads (haploid assembly made with Flye 2.9) and I am particularly interested in a large 32kb deletion. Here is a screenshot of IGV (from top to bottom: HAP1, HAP2 and haploid assembly):

hapdup

I believe the position and size of the deletion are near correct. However, the deletion is homozygous while it should be heterozygous. I have assembled with Hifiasm this proband and its parents using 30x PacBio HiFi: the 3 assemblies support an heterozygous call in the proband. I can also see from the corrected ONT that there is support for a heterozygous call. Finally, we can see this additional contig in the haploid assembly which I guess also support a heterozygous call.

Hence, my question is: even if MARGIN manages to correctly separate reads with the deletion from reads without the deletion, can the polishing of Flye actually "fix" such a large event in one of the haplotype assembly?

Thanks, Guillaume

GuillaumeHolley commented 2 years ago

I realized maybe the issue is not with the polishing but with the haploid assembly at the first place. Since there are two contigs in the haploid assembly, one for each allele of the deletion (even though the reference allele only covers the deletion), long reads are respectively mapping to each of these allele so Pepper-MARGIN never actually sees reference allele reads mapping to the contig with the deletion.

GuillaumeHolley commented 2 years ago

I had hope that redoing my haploid assembly with Flye but using the extra parameter max_bubble_length=300000 would help but it did not. Maybe the issue is with the reads themselves.

GuillaumeHolley commented 2 years ago

I believe the problem is most likely not the length of the deletion but the large number of heterozygous SNPs around that deletion (which are likely TP according to the PacBio HiFi assembly I have). Locally, these clusters of SNPs around the breakpoints might conflict with the allowed error rate of Flye and prevent Flye to merge the bubble, no matter its size.

GuillaumeHolley commented 2 years ago

To continue with my updates on this issue, I tried to get rid of the extra haplotig with purge_dups but unfortunately, the software mischaracterized the coverage cutoff thresholds and it didn't get rid of the extra haplotig. I could set those thresholds manually but I need to automate this task. I wrote a very simple script which computes the mean coverage of the reads mapped to the haploid assembly and remove contigs for which the coverage is below 50% of that mean. Not sure it's a great solution but it got rid of the extra haplotig. With this done, I just transformed the problem into another one. The deletion is 32kb but my read N50 is more like 20kb. Which means that many reads which were mapping to the extra haplotig but not overlapping its boundaries do not map anywhere now. Since Hapdup does polishing but not reassembly, the result is still the same.

GuillaumeHolley commented 2 years ago

At this point, I believe this is more of a Flye-related problem than a Hapdup-related problem so I'll close here and reopen on the Flye's github repository.

mikolmogorov commented 2 years ago

Sorry for the late response - will answer in the new thread!

GuillaumeHolley commented 2 years ago

The following issue provides a solution for this problem. The key elements are: