Short unphased regions in phased assembly

drs commented 2 years ago

Hello,

I am assembling the small genome of a diploid protist with Shasta 0.8.0 in phased mode. So far the results looks very promising with a good haplotype separation and very good genome representation.

My issue is that some path contains very small unphased regions that separates the phased blocks (39 bp, 201 pb and 714 pb in the path shown below). While I don't need to properly phase the whole molecule (ex. the bubble in the left can be phased with any haplotype for my purpose) these very short break in haplotype are problematic because I suspect that a wrong phasing will induce error in the protein annotations.

Screenshot_20220124_174009

Looking at the documentation I have seen that the next version will implement a new phasing method that will hopefully fix these. In the mean time I would really appreciate any input on how to avoid these small unphased regions.

Also, I was wondering if any output file provides the identity of the reads that are present in each haplotype ? These might be useful to resolve incomplete phasing using whatshap phased Nanopore reads.

Thank you, Samuel

paoloczi commented 2 years ago

Yes, this is one of the known shortcomings of phased assembly in Shasta 0.8.0, listed in the 0.8.0 documentation. Many (but not all) of these will go away in 0.9.0, which is imminent (weeks). In the meantime, there is nothing that can be done.

However, in many cases the two large bubbles flanking one of these artifacts belong to the same phasing component (3rd numeric portion of the phased segment name PR.bubbleChain.position.component.haplotype). If that is the case, the haplotypes on the two sides can be considered phased with respect to each other. I don't know if this helps you.

Shasta phased assembly works by phasing the bubbles, not the reads. This means that we avoid the question of assigning reads to haplotypes. This has some computational advantages, but it also means that we don't have that information, and therefore there is no way to extract it or write it out.

Since you are using Bandage, I suggest loading Assembly-Phased.csv in Bandage. Using the Bandage csv functionality, you will then be able to display various pieces of potentially useful information. Also, the colors will be more useful.

drs commented 2 years ago

Thank you for the very fast reply and for these informations. I will be looking forward version 0.9.0 !

For this first small genome I think that the PR.bubbleChain.position.component.haplotype will allow me to manually build the haplotype assembly with Bandage. Unfortunately the in cases where the end of the paths (end of the chromosomes) are different (as shown below) the segment name is not PR. but UR. or numbers.

Screenshot_20220124_193840

Do you think that it will also be solved with 0.9.0 ?

paoloczi commented 2 years ago

Near the ends of assembled regions, coverage tapers down, and as a result assembly becomes difficult. This is a more fundamental problem and will not change in 0.9.0. It is possible that the two grey regions there correspond to haplotypes, but Shasta did not create a bubble there because they do not re-converge back together. To check this, you could map those two regions to each other and see if the portions near the red 97-base segment are similar.

paoloczi commented 2 years ago

Shasta 0.9.0 is out and many of these artifacts should have disappeared.

paoloczi commented 2 years ago

I am closing this due to lack of discussion. Feel free to reopen it or create a new issue if you still see too many artifacts in phased assembly with Shasta 0.9.0.

chanzuckerberg / shasta

Short unphased regions in phased assembly #277