ablab / quast

Genome assembly evaluation tool
http://quast.sf.net
Other
402 stars 77 forks source link

ValueError: invalid literal for int() with base 10 #73

Closed nick-youngblut closed 6 years ago

nick-youngblut commented 6 years ago

I'm using quast 5.0.0 py27pl526ha92aebf_1 bioconda

I get the following error when running metaquast.py on my metagenome:

ValueError: invalid literal for int() with base 10: '476978:53'

I'm using ~800 reference genomes.

The log of the run is attached: metaquast.log

alexeigurevich commented 6 years ago

Hi! It looks like a bug in parsing minimap2 output. Seems like it is rather specific, so we can't reproduce it without your help. Could you please send us raw minimap2 output from /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/minimap_output/metabat2_low_PE-003-contigs_broken.coords_tmp or at least one of the input files, e.g. /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool//bins_DASTool_bins/metabat2_low_PE.003.contigs.fa Thank you!

nick-youngblut commented 6 years ago

Thanks for looking into this issue! There is no minimap_output directory under metaquast/runs_per_reference/1302858/. Here's the contigs input file: metabat2_low_PE.003.contigs.fa.zip

alexeigurevich commented 6 years ago

Sorry, gave you the incorrect path, minimap output should be in metaquast/runs_per_reference/1302858/contigs_reports/minimap_output So, the correct full path will be /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/metabat2_low_PE-003-contigs_broken.coords_tmp

By the way, we checked your contigs and found nothing suspicious there, so could you please attach one of your references, too: /ebio/abt3_projects/databases/simulated_metagenomes/source_genomes/1302858.fasta

nick-youngblut commented 6 years ago

Sorry for the delay. Here's the files. Thanks again for helping with this issue!

1302858.fasta.zip metabat2_low_PE-003-contigs_broken.coords_tmp.zip

alexeigurevich commented 6 years ago

Hmm, according to the metaquast.log, Quast crashed on trying to parse 476978:53 as an integer value. And according to the log, Quast was parsing the output for reference 1302858 located in raw_coords_fpath='/ebio/abt3_projects/vadinCA11/data/metagenome/si...put/metabat2_low_PE-003-contigs_broken.coords_tmp'. We looked into this file and there is no 476978:53 inside it! (Only 476978 is present there, followed by the tab sign which is correctly parsed).

Could you please run grep -r "476978:53" /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/ to check whether this error-causing fragment is present in some other minimap output file? If you find something, please attach the corresponding file here.

I suggest that the entire issue could be due to input/output issue/bug on your machine. To check that, you could rerun absolutely the same command and check whether this error appears again or not. Note that MetaQuast will reuse already processed stages of the pipeline, so rerun should take less time than the original run. Let us how it is going!

nick-youngblut commented 6 years ago

I ran grep -r "476978:53" /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/, and there were no hits.

I did get 2 hits when running the following:

$ grep -r "476978" /ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/
/ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/metabat2_low_PE-003-contigs_broken.coords_tmp:coassemble_21386_1   2631    0   2603    +   1302858_1302858.PRJNA192621.CP006647    907294  476978  479581  2568    2603    60      NM:i:35 ms:i:2393   AS:i:2393   nn:i:0  tp:A:P  cm:i:239    s1:i:2331   s2:i:0  dv:f:0.0052 cg:Z:2603M  cs:Z::821*tg:1*at*ag:2*ag:2*ag:9*ga:6*tc:18*ga:2*ct:3*ct:41*tc:24*tc:3*ga:4*ct*ct:2*tc:9*ag:22*cg:4*ct:1416*ga:5*ga:18*gt*ct:8*ag:7*tg:10*ag:11*tc*ga:46*ga:32*tg:11*ct:7*ag:1*ag*ga:3*gt:20
/ebio/abt3_projects/vadinCA11/data/metagenome/simulated_metagenomes/shallow_sequencing/llmga/bin_refine/DAS_Tool/metaquast/runs_per_reference/1302858/contigs_reports/minimap_output/metabat2_low_PE-003-contigs.coords_tmp:coassemble_21386    2631    0   2603    +   1302858_1302858.PRJNA192621.CP006647    907294  476978  479581  2568    2603    60  NM:i:35ms:i:2393    AS:i:2393   nn:i:0  tp:A:P  cm:i:239    s1:i:2331   s2:i:0  dv:f:0.0052 cg:Z:2603M  cs:Z::821*tg:1*at*ag:2*ag:2*ag:9*ga:6*tc:18*ga:2*ct:3*ct:41*tc:24*tc:3*ga:4*ct*ct:2*tc:9*ag:22*cg:4*ct:1416*ga:5*ga:18*gt*ct:8*ag:7*tg:10*ag:11*tc*ga:46*ga:32*tg:11*ct:7*ag:1*ag*ga:3*gt:20

When I re-ran metaquast, it did complete successfully. However, the resulting report.html file does not show any tables/plots. Maybe that's due to the high number of reference genomes.

alexeigurevich commented 6 years ago

I did get 2 hits when running the following:

"476978" itself is fine, so nothing bad if it is present somewhere, the problem is with "number:number" pattern which cannot be parsed as an integer value.

When I re-ran metaquast, it did complete successfully.

That is great! So my suggestion that the original issue was due to a temporary I/O issue looks likely to be true.

However, the resulting report.html file does not show any tables/plots. Maybe that's due to the high number of reference genomes.

This looks like a separate issue. You have a really huge number of references but MetaQuast should still able to process them correctly. Could you please attach report.html and metaquast.log files here?

nick-youngblut commented 6 years ago

The other reports (eg., combined_reference/report.html) do show the data when I view them in the Chrome browser. It just seems to be the main report.html file. The main report file and the log are attached.

metaquast.log.zip report.html.zip

alexeigurevich commented 6 years ago

We found the cause of the problem, finally! It is a known bug of 5.0.0 that is fixed here and will be available since 5.0.1 (planned for this week). Sorry, I completely forgot about this fix and didn't catch that it is related to your issue, too.

The issue occurs only in MetaQUAST and only when using --split-scaffolds option. This is due to not fully-correct renaming of --scaffolds to --split-scaffolds in v.5.0.0. There is a simple workaround for the issue -- just use a short version of this option (-s) in your command! It remains the same in both v.4. and v.5.. Please rerun your assessment again and everything should be fine this time (you can use the same output dir to reuse already generated stuff and speed up the overall evaluation).

Note that the minimap2 issue that you originally reported is also caused by this --split-scaffolds problem. Also note that you probably don't need to use --split-scaffolds/-s anymore, since starting from v.5. we report scaffold gap misassemblies and other scaffold-related metrics always when we see stretches of N's in input assemblies. In v.4. they were calculated only if the corresponding option is specified. Thus, the only thing that --split-scaffolds/-s is now doing is adding "_broken" versions of the input assemblies to the evaluation.

nick-youngblut commented 6 years ago

Thanks for figuring out the issue! I did try re-running with -s instead of --split-scaffolds and it seemed to complete successfully

alexeigurevich commented 6 years ago

Great to hear that!