Question about Hifiasm algorithm

cjain7 commented 8 months ago

In the review article "Genome assembly in the telomere-to-telomere era", @lh3 and @richarddurbin have mentioned that:

When constructing an overlap graph, we discard a read contained in longer reads. This apparently straightforward step may lead to assembly gaps...To alleviate this problem, hifiasm tries to rescue a contained read if having the read would patch an assembly gap. This heuristic works in simple cases but is not always reliable.

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach? Knowing your insights would be helpful.

Thanks!

richarddurbin commented 8 months ago

There is no universally optimal route to assembly of variable-length reads, even when they are error-free. Figures 3 and 4 in the manuscript illustrate some of the issues. Fixes in one direction lead to potential problems in another direction. The best heuristics depend on the distribution of read lengths, the distribution of repeat lengths in the genome and the distribution of coverage.

Richard

From: Chirag Jain @.> Date: Tuesday, 2 January 2024 at 04:31 To: chhylp123/hifiasm @.> Cc: Richard Durbin @.>, Mention @.> Subject: [chhylp123/hifiasm] Question about Hifiasm algorithm (Issue #586)

In the review article "Genome assembly in the telomere-to-telomere era", @lh3https://github.com/lh3 and @richarddurbinhttps://github.com/richarddurbin have mentioned that:

When constructing an overlap graph, we discard a read contained in longer reads. This apparently straightforward step may lead to assembly gaps...To alleviate this problem, hifiasm tries to rescue a contained read if having the read would patch an assembly gap. This heuristic works in simple cases but is not always reliable.

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach? Knowing your insights would be helpful.

Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/chhylp123/hifiasm/issues/586, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA2FXZW4JKIMCXXT6N5FLPTYMOEQPAVCNFSM6AAAAABBJRSNQ2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DCOJTGAZTINA. You are receiving this because you were mentioned.Message ID: @.***>

lh3 commented 8 months ago

Can you please give some intuition for why this method is not reliable? Is this method difficult to implement, or is there a fundamental issue with this approach?

Suppose there is a 80kb homozygous region. You have one 100kb read on the paternal haplotype and many ~20kb reads on the maternal haplotype. Most ~20kb reads would be contained in the 100kb read. Hifiasm, in my understanding, only attempts to rescue one or a couple of reads. In this example, it would not work because it needs to build a path over contained reads and rescue all of them. In addition, here we know there are only two haplotypes. We may have multiple repeat haplotypes in satellites. Rescuing contained reads will be even harder in this case.

chhylp123 / hifiasm

Question about Hifiasm algorithm #586