cmks / DAS_Tool

DAS Tool
Other
138 stars 17 forks source link

`Contigs of contig2bin files not found in assembly` error however all contigs present in both? #90

Closed jfy133 closed 1 year ago

jfy133 commented 1 year ago

Hello,

I have hit an issue similar to one of the ones in https://github.com/cmks/DAS_Tool/issues/78

Where I get an error:

Analyzing assembly 
Error: Contigs of contig2bin files not found in assembly: 36 
k141_271 flag=1 multi=3.0000 len=1018,
k141_273 flag=1 multi=7.0000 len=1463,
k141_837 flag=1 multi=2.0000 len=1035,
k141_1014 flag=1 multi=2.0000 len=1083,
k141_350 flag=1 multi=2.0000 len=1336,... 

However when I search for each of the 36 contigs listed in the contig2bin file in the assembly fasta (via grep, so should be an exact match, and is direct output of CONCOCT), I find them...

I was wondering if anyone could help identify where the mismatch is happening...?

dastool_contig2binerror.zip

jfy133 commented 1 year ago

Ok @alexhbnr has kindly identified the problem for me:

In line 457, the script DAS_Tool.R parses the file MEGAHIT-DASTool-test_minigut_sample2.seqlength and splits the contig names on the first space. Therefore, the original contig name k141_271 flag=1 multi=3.0000 len=1018 is shortened to k141_271. 

However, this is only done when parsing the *.seqlength file but not the file test_minigut_sample2.tsv.

In line 482, it then compares the contig names between the *.seqlength table and the test_minigut_sample2.tsv file and doesn't find any overlaps because the former are shortened.

There are two possible fixes: either we are shortening the contig names in the TSV file that we provide to avoid the splitting issue or we make a PR to DAS Tool to perform the splitting on both tables.

I'm not sure where the contig name shortening is happening in our pipeline yet (I've certainly not done this atively, but I will have a look and report back if it's my fault

cmks commented 1 year ago

I've pushed a fix for this issue into the master branch. You can try running your above example again using the attached modified contig2bin file (which matches the single copy genes found on your fasta file). test_minigut_sample2.tsv.zip

jfy133 commented 1 year ago

Hi @cmks , thanks for the fast fix! Unfortunately I accidently deleted my pipeline run (nf-core/mag) results where I hit the error. I will run it again to hit the error and test the fix but I'm at a workshop for the next couple of days so it might take a bit of time to confirm the fix (sorry about that!)

jfy133 commented 1 year ago

Ok, fortunately one of the workshop sessions was not necessary for me so I was able to test this, I can confirm it works - thank you very much!

jfy133 commented 1 year ago

Ah one more question @cmks do you have a rough ETA how long it could take for a patch release containing the fix?

For my particular purpose I would need the DAS_Tool bioconda recipe to me updated for use it in the pipeline I'm working on

tanaes commented 1 year ago

I'll also need an updated Conda recipe, just ran into this myself!

cmks commented 1 year ago

Done: https://github.com/cmks/DAS_Tool/releases/tag/1.1.6. The bioconda recipe should update automatically after some time.