chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
270 stars 57 forks source link

fasta header with no sequence #270

Closed colindaven closed 2 years ago

colindaven commented 2 years ago

Hi,

thanks for the great tool and impressive docs - I've learned a lot. I'm having a look at phased promethion assemblies and keep finding

Warning: a fasta header with no sequence was encountered.

I guess this is just an empty allele as part of a bubble (i.e heterozygote deletion) but I would be interested to hear more.

Thanks

paoloczi commented 2 years ago

That message was written out by old Shasta releases when a read with zero length (no sequence) was encountered in one of the input files. This should never happen with fasta files generated by ONT base callers.

I may be able to provide more information if you post here the Shasta release you are using and the entire log output (stdout of the assembly process).

I am not sure I understand your comment regarding bubbles, because this message refers to input fasta files containing reads, not an output of the assembly. Please clarify.

colindaven commented 2 years ago

Sorry, my bad. This is a new Shasta release, 0.8.0 I believe.

shasta-Linux-0.8.0 --version
Shasta Release 0.8.0
Linux version

I'm using bbmap readlength.sh and bbstats.sh to check the output of the phased Shasta assembly. So this is not a Shasta Warning message, but rather from bbmap.

I am wondering why Shasta outputs phased assemblies including fasta headers which do not contain any sequence. Is this expected behaviour, or just an artifact of phasing errors in short regions or similar? These are generally 30-40X promethion human assemblies.

Thanks.

paoloczi commented 2 years ago

Shasta phased assembly outputs the assembly in three different representations - detailed, phased, and haploid - see the section entitled "Mode 2 assembly output" here for more information.

You must be using the detailed assembly representation, in which all small bubbles representing heterozygous loci are explicitly kept. The names of the two segments of each bubble are identical except for their endings .0 and .1. This makes it easy to locate each pair of segments that make up a bubble. However for heterozygous insertions or deletions one of the two sides will be empty, as you observed.

If you are trying to map the assembly to a reference for analysis, it may be easier to use the phased representation, or even the haploid representation, depending on what your goal is. Output files BubbleChains.csv and PhasingRegions.csv describe how the three representations are related to each other. The documentation currently does not describe these two files, but I can add that information if we find that you need it for your purposes.

The phased representation will contain much longer stretches of sequence which is much easier to map univocally. Each segment represents a phased haplotype, and can often be hundreds of Kb to multiple Mb in length.

colindaven commented 2 years ago

Thanks for the insights. My goal is an overview of structural variation across the various samples.

Running Minigraph on the Shasta haploid assemblies seems to have been useful for this. I will rename the sequences and rerun based on comments from the minigraph author.

I don't think I need additional docs on the BubbleChains.csv and PhasingRegions.csv files relate to each other, but thanks for the offer.

I may also retry Minigraph with the phased representation (possibly after screening out 0 length sequences and renaming all sequences to be unique).

Lastly, I'm sure I'll have to map phased or haploid assemblies to the genome and visualize as well.

Thanks

paoloczi commented 2 years ago

Sounds like a good plan.

Using the phased representation will be important if you want to get good sensitivity to heterozygous structural variants. In the haploid representation, each heterozygous structural variant has only 50% chance of being present.

I am working on an improved version of mode 2 (phased) assembly that should result in phased assemblies of generally better quality and with fewer artifacts. That is a couple of months away and should be part of the next Shasta release (probably 0.9.0). When it becomes available, you should consider trying it out if you are still working on this project.