Maggi-Chen / Inspector

A tool for evaluating long-read de novo assembly results
MIT License
21 stars 9 forks source link

Inspector-correct not writing properly corrected structural errors into contig_corrected.fa #3

Open mmontonerin opened 2 years ago

mmontonerin commented 2 years ago

I am finding a new type of error during the correction.

It seems like the jobs get sent properly to flye, and start getting assembled, but then immediately after looks for the result, and claims that they don't exist. In the meantime, if I do top, I see the Flye jobs running, but the inspector claims that has already finished, so those results, I guess, will be saved in draft_assembly.fasta but not on the final contig_corrected.fa.

When checking with inspector the inspector corrected assembly, we see the small errors very nicely corrected, but not the structural errors. So it might be because they get never written into that final corrected fasta.

This is how the error looks in one of these cases:

[2021-11-23 21:32:05] INFO: Extending reads [2021-11-23 22:19:49] INFO: Overlap-based coverage: 10 [2021-11-23 22:19:49] INFO: Median overlap divergence: 0.109189 0% 90% 100% [2021-11-23 22:35:36] INFO: Assembled 2 disjointigs [2021-11-23 22:35:36] INFO: Generating sequence 0% 10% 20% 30% 50% 60% 70% 80% 100% [2021-11-23 22:35:36] ERROR: Caught unhandled exception: Can't open /space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg0031504646246463133col/00-assembly/draft_assembly.fasta [2021-11-23 22:35:36] ERROR: flye-modules(+0x3ab73) [0x562558de4b73] [2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0ee0e07f6f] [2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0ee0e07fb1] [2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(cxa_rethrow+0) [0x7f0ee0e0819a] [2021-11-23 22:35:36] ERROR: flye-modules(+0xc5e4) [0x562558db65e4] [2021-11-23 22:35:36] ERROR: flye-modules(+0x373e5) [0x562558de13e5] [2021-11-23 22:35:36] ERROR: flye-modules(+0x179d2) [0x562558dc19d2] [2021-11-23 22:35:36] ERROR: /lib64/libc.so.6(__libc_start_main+0xed) [0x7f0ee03b934d] [2021-11-23 22:35:36] ERROR: flye-modules(+0x17b79) [0x562558dc1b79] [2021-11-23 22:35:36] ERROR: Command '['flye-modules', 'assemble', '--reads', 'Inspector_MS_nextdenovo/assemble_workspace/read_ass_ctg0031504646246463133__col.fa', '--out-asm', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg0031504646246463133col/00-assembly/draft_assembly.fasta', '--config', '/space/Software/final_env/lib/python2.7/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg', '--log', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg0031504646246463133col/flye.log', '--threads', '4', '--min-ovlp', '5000']' returned non-zero exit status -6 [2021-11-23 22:35:36] ERROR: Pipeline aborted

Maggi-Chen commented 2 years ago

We set a maximal runtime of 10 minutes for each run of Flye, as we occasionally observed endless Flye process. The expected local assembly contig length is less than 100kbp, and assembly with Flye should finish within 10 minutes. If you go into the directory 'assemble_workspace' in your output folder, there should be FASTA files named as 'read_asscontigname...'. Can you try running Flye directly with one of the FATSA as input, and see if Flye can generate the contig? Also, can you record how long does it take to finish? If all the Flye assembly took more than 10min, I should probably change the hard cutoff to an adjustable parameter. Thanks.

Best, Maggi

mmontonerin commented 2 years ago

What we observe is that it takes some hours (in some cases half a day). The small ones work, but not the bigger re-arrangements, because it is multiple Flye jobs for one contig.

KPapac commented 2 years ago

We set a maximal runtime of 10 minutes for each run of Flye, as we occasionally observed endless Flye process. The expected local assembly contig length is less than 100kbp, and assembly with Flye should finish within 10 minutes. If you go into the directory 'assemble_workspace' in your output folder, there should be FASTA files named as 'read_asscontigname...'. Can you try running Flye directly with one of the FATSA as input, and see if Flye can generate the contig? Also, can you record how long does it take to finish? If all the Flye assembly took more than 10min, I should probably change the hard cutoff to an adjustable parameter. Thanks.

Best, Maggi

I tried to run flye directly with one of the reads that did not work for me. I noticed that I get a few 'read_ass_contig...' files that have a semicolumn within the name, for example: 'read_ass_ctg000790228711228712270;82col.fa' .

When I run flye for that read: flye --polish-target ../contig_corrected.fa --nano-raw 'read_ass_ctg000790__228711__228712__270;82__col.fa' --out-dir test &>> test.log

I get an error in the test.log that minimap failed to execute and in the minimap.stderr I get:

[M::mm_idx_gen::7.8301.40] collected minimizers [M::mm_idx_gen::9.6931.70] sorted minimizers [M::main::9.6931.70] loaded/built the index for 321 target sequence(s) [M::mm_mapopt_update::10.2161.67] mid_occ = 331 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 321 [M::mm_idx_stat::10.520*1.65] distinct minimizers: 21069919 (74.75% are singletons); average occurrences: 2.182; average spacing: 5.330 ERROR: failed to open file 'read_ass_ctg000790228711228712__270' [M::main] Version: 2.17-r941 [M::main] CMD: flye-minimap2 /space/no_backup/Kostas/inspector_filtering/Inspector_A28_nextdenovo_insp_corrected/assemble_workspace/test_fail/chunks_1.fasta read_ass_ctg000790228711228712270 [M::main] Real time: 10.567 sec; CPU: 17.369 sec; Peak RSS: 1.569 GB /bin/bash: 82col.fa: command not found [samfaipath] build FASTA index...

When renaming 'read_ass_ctg000790228711228712270;82col.fa' to something without a semicolumn the "flye --polish-target" command works and produces a 'polished_1.fasta' file. This one took about 1.5 minutes to finish.

I also tried with reads that failed and did not have any semicolumns. The reads were:

read_ass_ctg000820_pilon1415195614151957151col.fa read_ass_ctg000820_pilon2460972024609721235col.fa read_ass_ctg000820_pilon59357925935920128exp.fa read_ass_ctg000820_pilon1629295116292952438col.fa read_ass_ctg000820_pilon2608646526086466115col.fa read_ass_ctg000820_pilon59446885944878190exp.fa read_ass_ctg000820_pilon163203163577374exp.fa read_ass_ctg000820_pilon263362802633635777exp.fa read_ass_ctg000820_pilon70340777034078756col.fa read_ass_ctg000820_pilon171396931713969468col.fa read_ass_ctg000820_pilon28437792844287508exp.fa read_ass_ctg000820_pilon84742718474272375col.fa read_ass_ctg000820_pilon2002869320028694124col.fa read_ass_ctg000820_pilon3368594336859579col.fa read_ass_ctg000820_pilon95960979596924827exp.fa read_ass_ctg000820_pilon2034524920345692443exp.fa read_ass_ctg000820_pilon4093977409405982exp.fa read_ass_ctg000820_pilon2292957422929575201col.fa read_ass_ctg000820_pilon51862655186386121exp.fa

I ran: flye --polish-target ../pilon.fasta --nano-raw assemble_workspace/read_ass_ctg000820_pilon__* --out-dir test &>> test.log and it produced a 'polished_1.fasta'.

Here is the flye.log:

[2021-12-09 15:07:22] root: INFO: Running Flye polisher [2021-12-09 15:07:22] root: DEBUG: Cmd: /space/Software/Inspector_kostas/inspector_conda/bin/flye --polish-target ../pilon.fasta --nano-raw assemble_workspace/ read_ass_ctg000820_pilon1415195614151957151col.fa assemble_workspace/read_ass_ctg000820_pilon1629295116292952438col.fa assemble_workspace/ read_ass_ctg000820_pilon163203163577374exp.fa assemble_workspace/read_ass_ctg000820_pilon171396931713969468col.fa assemble_workspace/ read_ass_ctg000820_pilon2002869320028694124col.fa assemble_workspace/read_ass_ctg000820_pilon2034524920345692443exp.fa assemble_workspace/ read_ass_ctg000820_pilon2292957422929575201col.fa assemble_workspace/read_ass_ctg000820_pilon2460972024609721235col.fa assemble_workspace/ read_ass_ctg000820_pilon2608646526086466115col.fa assemble_workspace/read_ass_ctg000820_pilon263362802633635777exp.fa assemble_workspace/ read_ass_ctg000820_pilon28437792844287508exp.fa assemble_workspace/read_ass_ctg000820_pilon3368594336859579col.fa assemble_workspace/ read_ass_ctg000820_pilon4093977409405982exp.fa assemble_workspace/read_ass_ctg000820_pilon51862655186386121exp.fa assemble_workspace/ read_ass_ctg000820_pilon59357925935920128exp.fa assemble_workspace/read_ass_ctg000820_pilon59446885944878190exp.fa assemble_workspace/ read_ass_ctg000820_pilon70340777034078756col.fa assemble_workspace/read_ass_ctg000820_pilon84742718474272375col.fa assemble_workspace/ read_ass_ctg000820_pilon95960979596924827exp.fa --out-dir test [2021-12-09 15:07:22] root: INFO: Polishing genome (1/1) [2021-12-09 15:07:28] root: INFO: Running minimap2 [2021-12-09 15:09:35] root: INFO: Separating alignment into bubbles [2021-12-09 15:16:31] root: DEBUG: Generated 123488 bubbles [2021-12-09 15:16:31] root: DEBUG: Split 235335 long bubbles [2021-12-09 15:16:31] root: DEBUG: Skipped 235540 empty bubbles [2021-12-09 15:16:31] root: DEBUG: Skipped 4 bubbles with long branches [2021-12-09 15:16:31] root: INFO: Alignment error rate: 0.074903 [2021-12-09 15:16:31] root: INFO: Correcting bubbles

This took 11 minutes and 14 seconds to run, so over the 10 minute threshold.

Maggi-Chen commented 2 years ago

Hello mmontonerin and KPapac,

Thank you for the information. In the latest release, I have added a parameter "--flyetimeout" for inspector-correct.py to set the maximal allowed runtime for each Flye job, with a default value of 20 min. Since we are correcting the structural errors individually, the expected contig length from each Flye local assembly is smaller than 50kbp (or roughly twice of the read length), which means each Flye job should finish within minutes. In KPapac's case, one job finished within 11 minutes, which is totally reasonable. But we never expect Flye to run for hours here. Based on our benchmark, when a Flye job is running for too long, it usually means there is something wrong, and eventually Flye does not generate the correct contig sequence (possible because of insufficient reads, or this region is really messy). We actually set this 10min cutoff initially to kill these kind of jobs, so that we don't waste hours in waiting. So I would not suggest to set a super large value for timeout, unless you are using really long reads (i.e., Nanopore ultralong reads).

Also, ';' in the file name can really cause problems in executing commands... I have modified the code to avoid semicolon in the file names. It should no longer be a problem. Thank you for pointing this out.

Best, Maggi

KPapac commented 2 years ago

Hey Maggi,

I tried with --flyetimeout 1800 (30 minutes).

/space/Software/Inspector_kostas/inspector-correct.py -i Inspector_A28_nextdenovo_unpolished --datatype nano-raw -o Inspector_A28_nextdenovo_insp_corrected --flyetimeout 1800 -t 20 &>> my_inspector-correct_out.log

In the my_inspector-correct_out.log I still got the same errors. my_inspector-correct_out.log.

I guess we do have quite long nanopore reads, as quite a few are over 50kb. If I understand correctly, in the 'read_ass_ctg00108020137942013902108exp.fa' should be the nanopore reads that overlap the area that should be fixed? If so, then if I do awk '/^>/ {next} {seqlen = length($0); print seqlen}' read_ass_ctg001080__2013794__2013902__108__exp.fa | sort -n to get a sorted list of the length of the sequences in the fasta file, I have more than half of the reads (82 in total) above 10kb, with the largest being at 92.6kb. Note also that we used guppy 4.3.4 as the basecaller, which is supposed to decrease the error rate in the reads.

Interestingly, the first flye job that failed according to the 'my_inspector-correct_out.log' failed after about 20 minutes after it began being processed. Last successful flye job ended at 22:26:13 and the first error occurred at 22:46:45. That shouldn't have happened I assume, since I put a 30 minute threshold for the flye job to exit.