ekimb / rust-mdbg

minimizer-space de Bruijn graphs (mdBG) for whole genome assembly
MIT License
176 stars 18 forks source link

Assembly graph looks right, but no sequence in the simplified gfa or its fasta. #28

Closed mildlyhuman closed 2 years ago

mildlyhuman commented 2 years ago

Hello,

I tried to replicate the Zymo D6331 assembly described in the mDBG manuscript. The assembly and the graph simplification finished successfully, but sequences were all placeholders. Somehow I can get correct output for the example case. I wonder what I did wrong.

A related question: I did not seem to find the assemblies from the manuscript. Could you please point me to them?

Thanks in advance; relevant info listed below.

expected result

Raw assembly graph, simplified assembly graph populated with sequences, and a fasta file of the simplified graph.

what did not work

Raw assembly graph is produced. Simplified graph is produced, but not sequences (all "*"). The fasta file also contain "*" instead of contig sequences.

environment

OS: Ubuntu Linux

mDBG installation: mamba install -c bioconda rust-mdbg, also git clone for graph simplification scripts.

mDBG version: bioconda points to 1.0.1, but rust-mdbg --version prints rust-mdbg 0.1.0. I am positive that there is only one rust-mdbg executable in my $PATH (by checking with command -v).

Other tools:

input data

Downloaded SRR13128014 then seqtk seq -UA thereads.fq.gz | pigz -p32 - > input.fa.gz.

zcat input.fa.gz | head -4 | less -S gives:

>m64015_200911_223407/2/ccs
AGTCTTGCTTTCAATCCAATTTATAAAAGATCCTATAGCACCAATTTTTATGGTATTAAAGGAGCCAACATGTTGTCTTTCATATGTTGCTTGTCCAGCATCAGCTAAATATACAAGTTG[....]
>m64015_200911_223407/75433587/ccs
CTCTTTGATCGATCAGGCTTCGGAATCCGGCATTATTTCTATGCTGATGGAAGTTGCCGGCATTCTGGGTGCCATGATGGTCGGTGTTCTGATTGCTTCCAATATCAAGATCAACATTGC[....]

the run

rust-mdbg input.fa.gz -k 40 -l 12 --density 0.004 --threads 48  -p asm2 1>dbg_STDOUT 2>dbg_STDERR
/path/rust-mdbg/utils/magic_simplify asm2 1>>dbg_STDOUT 2>>dbg_STDERR

... which is stored in a file and executed via . job.sh. (I have also tried the -k 21 -l 14 --density 0.003 which was used for ATCC dataset.)

output files

ls | grep fa$ gives:

asm2.gfa
asm2.msimpl.fa
asm2.msimpl.gfa

ls | grep asm2 | grep sequences | wc -l gives: 48

head -6 asm2.msimpl.gfa gives:

S       utg0000001      *       LN:i:2916334
A       utg0000001      0       +       utg0000001      0       0
A       utg0000001      1685167 +       utg0000017      0       0
A       utg0000001      1909424 -       utg0000217      0       0
A       utg0000001      2119534 -       utg0000049      0       0
A       utg0000001      2353766 -       utg0000108      0       0

head -6 asm2.msimpl.fa says:

>utg0000001
*
>utg0000002
*
>utg0000003
*

dbg_STDOUT says:

Input file: input.fa.gz 
Format: FASTA 
Warning: Using default minimum k-mer abundance value (2). 
Warning: Using default pre-simp value (0.01). 
Parsing input sequences... 
Number of reads: 1978852 
Number of nodes before abundance filter: 4090733 
Number of nodes after abundance filter: 343441 
Number of edges: 682146 
Pre-simp = 0.01: 970 edges removed. 
Total execution time: 103.983139133s 
Maximum RSS: 1.6727676GB 
Done parsing unitigs GFA, got 2075 unitigs. 
Done parsing original GFA, with 311224 k-min-mers. 
Done parsing .sequences file, recorded 311224 sequences. 
Total execution time: 4.496757754s 
Maximum RSS: 0.13537979 GB 

dbg_STDERR says:

^M5244821984 / 5244821984 [==============================] 100.00 % 54334367.95/s ^MConverted reads to k-min-mers.                                                  + command -v gfatools
+ /usr/bin/time gfatools asm asm2.gfa -t 10,50000 -t 10,50000 -b 100000 -b 100000 -t 10,50000 -b 100000 -b 100000 -b 100000 -t 10,50000 -b 100000 -t 10,50000 -b 1000000 -t 10,150000 -b 1000000 -u
[M::main] Version: 0.3-r157-dirty
[M::main] CMD: gfatools asm -t 10,50000 -t 10,50000 -b 100000 -b 100000 -t 10,50000 -b 100000 -b 100000 -b 100000 -t 10,50000 -b 100000 -t 10,50000 -b 1000000 -t 10,150000 -b 1000000 -u asm2.gfa
[M::main] Real time: 1.563 sec; CPU: 1.564 sec
1.44user 0.12system 0:01.65elapsed 94%CPU (0avgtext+0avgdata 135648maxresident)k
53448inputs+19768outputs (0major+38040minor)pagefaults 0swaps
+ python /path/bin/rust-mdbg/utils/gfa_break_loops.py asm2.tmp1.gfa
+ [[ ! asm2 == *--old-behavior* ]]
+ cargo run --manifest-path /path/bin/rust-mdbg/utils/../Cargo.toml --release --bin to_basespace -- --gfa asm2.tmp2.gfa --sequences asm2
    Finished release [optimized] target(s) in 0.63s
     Running `/path/bin/rust-mdbg/target/release/to_basespace --gfa asm2.tmp2.gfa --sequences asm2`
+ mv asm2.tmp2.gfa.complete.gfa asm2.tmp2.gfa
++ ls -Lon asm2.tmp2.gfa
++ awk '{ print $4 }'
+ filesize=63262102
+ current_graph=asm2.tmp2.gfa
+ ((  filesize > 1000000 ))
+ /usr/bin/time gfatools asm asm2.tmp2.gfa -t 10,50000 -b 100000 -t 10,100000 -b 1000000 -t 10,150000 -b 1000000 -u
[M::main] Version: 0.3-r157-dirty
[M::main] CMD: gfatools asm -t 10,50000 -b 100000 -t 10,100000 -b 1000000 -t 10,150000 -b 1000000 -u asm2.tmp2.gfa
[M::main] Real time: 0.192 sec; CPU: 0.192 sec
0.10user 0.08system 0:00.23elapsed 79%CPU (0avgtext+0avgdata 67132maxresident)k
123560inputs+176outputs (0major+16999minor)pagefaults 0swaps
+ current_graph=asm2.tmp3.gfa
+ ((  filesize > 100000000 ))
+ mv asm2.tmp3.gfa asm2.msimpl.gfa
+ [[ asm2 != *\-\-\k\e\e\p* ]]
+ rm -rf asm2.tmp1.gfa asm2.tmp2.gfa
+ bash /path/bin/rust-mdbg/utils/gfa2fasta.sh asm2.msimpl
rchikhi commented 2 years ago

Hi, we didn't publish the assemblies made in the manuscript, as they're of draft quality.

One thing we didn't make clear in the article, and sorry about that, it's that we didnt use magic_simplify in metagenomes, but instead a modified version. Running https://github.com/ekimb/rust-mdbg/blob/master/utils/magic_simplify_meta instead of magic_simplify should produce our results, modulo version differences.

It is a bit odd that you're getting '*'s instead of sequences. If it is still the case after using magic_simplify_meta, let us know.

mildlyhuman commented 2 years ago

Thank you very much for the clarification. The magic_simplify_meta outputed sequences correctly and the graph looked better in Zymo D6331.