chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
524 stars 86 forks source link

Scaffolding dual assembly #216

Open olekto opened 2 years ago

olekto commented 2 years ago

Hi, now that hifiasm 'generates a pair of haplotype-resolved assemblies', what is the best way of scaffolding these? This has been on my mind a bit lately, and this might not be the best place to ask, but I'm certain that many of the knowledgeable people that develop and use hifiasm have some input.

If you take both assemblies, merge them together in a fasta file and then map HiC reads to it, would that be a good enough signal for many scaffolders? Would many of the read pairs map multiple places and therefore have low Q and be filtered out, even though they could have been informative? A quick test with SALSA lead to quite poor scaffolding, with the N50 of scaffolds only about 15 % as large as when scaffolding the primary contigs.

Further, if you do scaffold each haplotype-resolved assembly on it's own, could that lead to errors? For many species there are likely smaller or larger rearrangements between the two homologous chromosomes, whereby mapping all HiC data to each assembly could confuse that signal.

Are there a way to split the HiC reads into two piles, one for each haplotype? One could run something like DipAsm, but that seems a bit involved and messy.

Any answers and pointers are much appreciated.

Thank you.

Sincerely, Ole

m-jahani commented 2 years ago

That is my question too, please let me know if you find any answer for it. Thanks, Mojtaba

olekto commented 2 years ago

What I have been doing so far is using all HiC data to scaffold each assembly/haplotype by itself. If there are large differences between the haplotypes, the resulting scaffolding might not be optimal, but I cannot really think of any other way of doing this with the tools I am aware of. So I take hap1 and scaffold with SALSA, Juicer, yahs, pins, or whatever, and do the same with hap2 with the same data.

Ole

m-jahani commented 2 years ago

Yes, I am doing the same thing. However, the results are not promising so far. There are large differences between the two haplotypes.

chhylp123 commented 2 years ago

Probably you could try this solution: https://github.com/baozg/phased-assembly-check. We have not run it by ourselves but it is reasonable.

olekto commented 2 years ago

Doesn't that require trio data? We don't have that for our species. I guess it is not straightforward doing similar with just HiFi + HiC.

tor. 24. feb. 2022, 20:32 skrev chhylp123 @.***>:

Probably you could try this solution: https://github.com/baozg/phased-assembly-check. We have not run it by ourselves but it is reasonable.

— Reply to this email directly, view it on GitHub https://github.com/chhylp123/hifiasm/issues/216#issuecomment-1050192855, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMOP23R57CJIENEWMIRS7DU42BWNANCNFSM5JGXDYHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

chhylp123 commented 2 years ago

I guess Hi-C only could work. The idea is to use Hi-C data to identify contigs that are misassigned to wrong haplotypes.

paul-havlak-driscolls commented 10 months ago

I'm wondering about this, too. The question is how to get only the Hi-C reads you want for that haplotype, without all the noise of other reads. What I plan to try is as follows:

One reason I haven't tried this myself yet is that my challenging genomes are auto-tetraploid, so read four haplotypes instead of two phases into the above. Still many interesting subtleties to resolve.