chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

Assertion failed: readCount > 0 at void ChanZuckerberg::shasta::Assembler::findAlignmentCandidatesLowHash(std::size_t, double, std::size_t, std::size_t, std::size_t, std::size_t, std::size_t) in /home/paolo/shasta/src #122

Closed SaumikDana closed 4 years ago

SaumikDana commented 4 years ago

Assertion failed: readCount > 0 at void ChanZuckerberg::shasta::Assembler::findAlignmentCandidatesLowHash(std::size_t, double, std::size_t, std::size_t, std::size_t, std::size_t, std::size_t) in /home/paolo/shasta/src

Can you help figure out this issue?

paoloczi commented 4 years ago

Sorry about that cryptic message - I will replace it with a better one. The message means that there are no usable reads for assembly. It is possible that all of your reads are shorter than the read length threshold (controlled by command line option --Reads.minReadLength, default value 10000 bases). There should be a message telling you how many reads were discarded due to short length or a couple other possible reasons. If you still have problems figuring out what is going on please attach the entire log output (stdout).

SaumikDana commented 4 years ago

I did change the read threshold to 100, and got assembly.fasta file. But that file is empty. I have pasted the output log file herewith.

ss.txt

paoloczi commented 4 years ago

Shasta assembly parameters are optimized for standard nanopore reads (typical read length 30 kb) at coverage 60x. The output you sent shows that your average read length is only 800 bases. Do you know why your reads are so short? Is it possible that something went wrong in your sequencing? I would look at that first. You can look at files Binned-ReadLengthHistogram.csv and ReadLengthHistogram.csv to get an idea of the read length distribution.

If you really want to try and assemble using these reads, it may be possible doing it with Shasta, but some tweaking of assembly parameters will be necessary. I am happy to help with this if you want to try, but I cannot guarantee that it will work as we don't know what's in the data.

The first thing to do would be to reduce --Align.minAlignedMarkerCount from its default 100 to something around 10 or 20. This parameter controls the minimum length, in markers, of an alignment between two reads for that alignment to be used in the assembly. Each aligned marker corresponds to roughly 30 bases, so with the default 100 you are asking for 3 kb of overlap between two reads. This is reasonable if the reads are 10 kb long or more, but for your distribution of read lengths would cause most alignments to be lost. This is a probable reason for your empty assembly results.

Other parameter adjustments may also be necessary, especially if your coverage is very different from the standard 60x expected for a Shasta assembly. You have 4 Mb of sequenced bases - what is the expected size of your genome?

If you want to try this experiment, please send me the assembly log output again, plus file AssemblySummary.html in the assembly directory, and we can iterate from there.

paoloczi commented 4 years ago

I also looked at the code to improve the error message written when no usable reads are available in the input files. The current code does write a good error message in that case and terminates execution. Looking at the time when that message was added, I conclude that you are probably using Shasta's initial release 0.1.0. Please switch to the current release 0.4.0 when you have a chance. You can download it from here.

SaumikDana commented 4 years ago

I did the following

wget https://github.com/chanzuckerberg/shasta/releases/download/0.4.0/shasta-Linux-0.4.0 chmod ugo+x shasta-Linux-0.4.0

and then this: ./shasta-Linux-0.4.0 --input first_region.fasta --Reads.minReadLength=100 --Align.minAlignedMarkerCount=10

But I got the following error file: FATAL: kernel too old /var/spool/torque/mom_priv/jobs/2632675.sug-moab.SC: line 10: 14090 Aborted ./shasta-Linux-0.4.0 --input first_region.fasta --Reads.minReadLength=100 --Align.minAlignedMarkerCount=10

Any suggestions?

paoloczi commented 4 years ago

If you are running on an old Linux kernel, download this executable instead:

https://github.com/chanzuckerberg/shasta/releases/download/0.4.0/shasta-OldLinux-0.4.0

This should work on pretty old Linux versions.

SaumikDana commented 4 years ago

Still an empty Assembly.fasta file..I have attached the log file and the html file (renamed as txt file to enable uploading here)

ss.txt

AssemblySummary.txt

paoloczi commented 4 years ago

Of course I did not expect the new release to fix the assembly. See my long comment about 5 hours ago regarding assembly parameters.

paoloczi commented 4 years ago

Oops, I just noticed that you also reduced --Align.minAlignedMarkerCount from 100 to 10 as I suggested. Let me take a better look.

Do you know why your nanopore reads are so short?

SaumikDana commented 4 years ago

It is capture data..The fasta file is the reads corresponding to a SV corresponding to a region in genome..

paoloczi commented 4 years ago

The number of alignments is still very low (only 1200 alignments for 4000 reads), so there are a few possibilities.

One possibility is that coverage is very low. Do you have an estimate for the length of the capture region? Is it possible that the capture protocol has high bias - that is, some regions have much higher coverage than others?

And do you know why the reads are so short? If you trimmed them, it would be best for Shasta to work with the original, untrimmed reads.

Another possibility is that these reads have higher error rate than typical nanopore reads. If that is the case, other assembly parameters would also have to be tweaked.

I would be happy to run a few experiments if you can give me access to the input data. Of course I would not use them for any other purpose or share them with anybody, and I would delete them when done.

fritzsedlazeck commented 4 years ago

Hi @paoloczi , thanks for your fast feedback. This is all catpure data for very small regions thus the read size is not huge. I realize that this is outside of what the assembler was designed for.

I would not assume that they have higher error rates. If we map them to the genome they look fine and map to the region. What @SaumikDana is trying to run is an assembly of just the first region of reads that predicted a certain variant.

If you think that is worth the trouble, I am happy to share that region.

Thanks Fritz

paoloczi commented 4 years ago

Thanks for the explanation @fritzsedlazeck . Does this mean that the reads were trimmed? If so, it would be best to run Shasta giving it as input the full length, untrimmed reads. In that case, assembly should be possible, as long as there is enough coverage (40x to 60x). If you could give me access to untrimmed reads for one region, I can experiment. If for some reason you want to work with trimmed reads, I can try that too, but I am not sure how well that would work.

This sounds like an interesting experiment.

fritzsedlazeck commented 4 years ago

the reads were not trimmed, its just the capture that was done over this particular region. There are a few in there that are a bit longer... Whats the best email to send them to? Thanks Fritz

paoloczi commented 4 years ago

My email address is not published on GitHub, but I found your email address at Baylor and I will email you personally there.

fritzsedlazeck commented 4 years ago

That works. Thanks Fritz

paoloczi commented 4 years ago

This issue was resolved after some private interaction via e-mail. With some tweaking of assembly parameters, Shasta was able to assemble the structural variant in question.