Diploid assembly with trios

ssgross commented 1 year ago

Hi,

I have a 60x UL ONT human dataset with about 5% error rate. I want to make a diploid assembly and I understand the new phased assembly mode should work for that. I also have 30x Illumina data available for both parents and it seems that data should be super useful (in theory) to increase the diploid assembly quality. One option is to trio bin and run two separate Shasta haploid assemblies, but the relatively high error rate makes naive k-mer based binning in some regions difficult or impossible. This idea could probably be greatly improved by doing all vs all read alignment and figuring out which k-mers are likely to be errors based on the alignments. Shasta is obviously already doing all vs all read alignment so it might be more natural to incorporate this step into the assembly process (similar to how it is done in hifiasm). Furthermore, I think there are probably additional opportunities for the assembler to make smarter downstream decisions on how to assemble the two haplotypes if there is no hard commitment to which reads go with which haplotype at the very beginning.

So, I was wondering if you have thought about this or a similar idea and whether it makes sense to you. I was considering trying to implement a smarter up front trio binning as described above (using read vs. read alignments), but it occurred to me a tighter integration with the assembler might be better. Thanks!

paoloczi commented 1 year ago

To avoid adding too much additional complexity to Shasta, I prefer to keep it working with a single genome at a time, without attempting to add trio information in the assembler.

I agree with you that it is best to postpone as much as possible the decision of which haplotype each read belongs to, for the reasons you allude to. For that reason, in Shasta Mode 2 assembly reads are never assigned to haplotypes. An earlier version of phased diploid assembly ("Mode 1" assembly, now defunct) attempted to assign reads to haplotypes (but without using trio information), but I was never able to get it to work to my satisfaction.

You might be interested in work, as of now in progress and unpublished, that @rlorigro and @meredith705 have been doing. They have been using trio and HiC information to add additional, longer-range phasing to a Shasta phased diploid assembly. They have been able to produce significantly longer phasing blocks than the already long ones assembled by Shasta with Ultra-Long (UL) reads. I will leave it for the three of you to connect if there is reciprocal interest, and by all means please feel free to continue the discussion here.

rlorigro commented 1 year ago

Hi @ssgross,

As Paolo mentioned, Shasta is currently doing phasing on a data structure similar to a variation graph, inferring which bubbles created by the reads are real. So it proposes well supported, local haplotypes, and then we have two methods for labelling and chaining them together, using either linked reads (from the same individual) or parental kmers. We are consistently getting chromosome scale phasing out of this pipeline, with switch and hamming rates <1%. If you are interested, please check out our repository: https://github.com/rlorigro/GFAse

It's in early stages of maturity, so if you have any questions about using it, just send me an email with my username @ucsc.edu and we can start a discussion. If you are interested in the trio pipeline, @meredith705 primarily handles the kmer processing, while I am focused on the linked reads at the moment.

paoloczi commented 1 year ago

Shasta development moved to a new repository (see the README for more information). If additional discussion is needed, feel free to open a new issue in the new repository.

chanzuckerberg / shasta

Diploid assembly with trios #297