ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
737 stars 134 forks source link

metaviral fails without error notification #646

Open geboro opened 3 years ago

geboro commented 3 years ago

I'm running metaviral SPAdes with low-complexity (2-5 viruses) samples, and while it successfully works with most of them, three of them are consistently failing without warnings. The logfile reports no errors, and I even have scaffolds.fastaand contigs.fasta files produced, but they are empty (size zero). I had previously recovered complete viral genomes (19 kbp) with SPAdes 3.13 from these very same samples, but with v3.15 even using normal spades I get hundreds of scaffolds <3000 kbp.

From the output, I only find assembly_graph_after_simplification.gfa but none of the final assembly graph files. In the K99 directory I find files with the edges_before_XXX.fasta files, but all the components and final_contigs files are empty, so I guess something is failing in this step.

Cheers!

asl commented 3 years ago

Tagging metaviralSPAdes' author for the troubleshooting @Dmitry-Antipov

Dmitry-Antipov commented 3 years ago

Hi. Could you please send us the spades.log file? (either to spades.support@cab.spbu.ru or as attach here)

geboro commented 3 years ago

Thanks. Here it is. spades.log Cheers!

snayfach commented 2 years ago

I have run into the same issue as @geboro. I tested several spades modules on 25 isolate miseq viral libraries. All modules produce a contigs.fasta file except when using --metaviral flag, in which case the file is empty for several samples. Spades runs to completion without errors. There is a warning about the insert length, but this appears in the log files for successful runs as well spades.log

Dmitry-Antipov commented 2 years ago

Hi Actually it is normal that metaviralSPAdes do not detect any viral-like contigs for some samples where there are no circular (and specific linear) paths with some conditions on coverage and length, but this should not happen for isolate viral libraries. Is it possible that there are quasispecies or groups of relative species in these libraries? This can be seen if you look (or send us) on the graph before all metaviral procedures - .../gdFB431/K127/assembly_graph_after_simplification.gfa

In this specific case you have very high average coverage - that may also prevent viralSPAdes from finding complete viruses in the data.

snayfach commented 2 years ago

I don't think the high coverage is a problem. Many other libraries had similar coverage (500-1000x) and finished without issues.

I've attached the assembly graph file. I'd be very interested in determining if this (or other) libraries contained closely related, but distinct viral strains/species.

Update: I took a look at the assembly_graph_after_simplification.gfa files. In the two libraries that failed to yield a finished assembly, there was a low ratio segments (S) to links (L) (mean=4.25) relative to the rest of the libraries (mean=38x). I think this answers the question and indicates that there was strain variation in these two libraries, but I'd welcome any insights you may have.

Dmitry-Antipov commented 2 years ago

Yes, this looks like a case with multiple strains - we can see three bulges of similar length and a complex region graph

With lower coverage metaviralSPAdes could output one (with higher coverage) of these strains, but the coverage is too high - metaviralSPAdes has cutoff 600x for edge removal procedure.

Speaking on segments to link ratio - it may rather correspond mostly to the low coverage trash contigs (that were removed from the picture above) - there are lots of isolated trash contigs with low coverage, and with higher dataset coverage there will be more of those.

snayfach commented 2 years ago

Thanks, this is resolved as far as I'm concerned. I might suggest adding a warning or something to the log file that indicates why no final assembly is output. That might help future users.

Also, if you have any pointers for extracting this information from the assembly graph (number and size of bulges), that would be great. With the goal of flagging assemblies that might contain multiple strains.

asl commented 2 years ago

Note that SPAdes 3.15.4 includes a dedicated diagnostics for empty output here. So it won't come as a surprise :)

snayfach commented 2 years ago

For the last assembly (shown above) I downsampled the library to 650x coverage and the program output a circular genome. However, I'm dealing with one last tricky phage library for which metaviralspades won't complete even after downsamping. The coverage is ~500x and the expected genome size is ~75 Kbp. Looking at the assembly graph there are 6 small bulges <2 Kbp that do not have abnormally high coverage.

I've attached the log and assembly graph: spades.log assembly_graph_after_simplification.gfa.zip

mchlou commented 1 year ago

Hello, I am new to this, does this mean we need to do downsample our data? If so, does this affect the final assembly or even the number of viral species we could find. Sorry if this is a stupid question. I hope someone could also give me a reference for @snayfach's statement "there was a low ratio segments (S) to links (L) (mean=4.25) relative to the rest of the libraries (mean=38x)" I really don't understand this.