das tool halts after single copy gene detection

rgtaketani commented 1 year ago

Dear Christian,

I have been using DAS tool for the past year quite successfully. Currently, however, I am working with a group of larger datasets (assembly file >30Gb and 1700 bins) that I am not able to run DAS tool on. Unfortunatly, the error doesn't state clearly what is the problem. I am running it on a 32-core 370Gb RAM HPC running CentOS. I also tried with 750Gb of RAM. How can I remedy this problem?

My command looks like this:

DAS_Tool -i ${WORKDIR}/mh_metabat_scaf2bin.tsv,${WORKDIR}/mh_concoct_scaf2bin.tsv\
             -l MH_metabat,MH_concoct\
             -c /home/data/mmeg_mags/Highfield/megahit/${SAMPLE}/final.contigs.fa \
             -o ${WORKDIR}/DASToolRun\
             --write_bin_evals 0 \
             --threads 32 \
             --score_threshold 0.6 \
             --write_bins 1 \
             --search_engine diamond \
             --db_directory ${WORKDIR}/db \
             --debug

The error is below.

Running DAS Tool using 32 threads.
predicting genes using Prodigal V2.6.3: February, 2016
identifying single copy genes using diamond version 2.0.13
Error in setnames(ans, col.names) :
  Can't assign 2 names to a 1 column data.table
Calls: cherry_pick -> fread -> setnames
Execution halted

DASToolRun_DASTool.log

I am uploading the log file. Thanks, Rodrigo

cmks commented 1 year ago

Hi Rodrigo, it is very likely that this bug has been fixed in the latest version. If updating doesn't fix the problem, I'm happy to look more into it. Please note, that there were some small changes to the command line: --write_bin_evals and --write_bins are used without arguments:

DAS_Tool -i ${WORKDIR}/mh_metabat_scaf2bin.tsv,${WORKDIR}/mh_concoct_scaf2bin.tsv\
             -l MH_metabat,MH_concoct\
             -c /home/data/mmeg_mags/Highfield/megahit/${SAMPLE}/final.contigs.fa \
             -o ${WORKDIR}/DASToolRun \
             --write_bin_evals \
             --threads 32 \
             --score_threshold 0.6 \
             --write_bins \
             --search_engine diamond \
             --db_directory ${WORKDIR}/db \
             --debug

rgtaketani commented 1 year ago

Hi Christian,

Thanks for the reply. I have installed the latest version, but I still get the error below.

Error in setnames(ans, col.names) : 
  Can't assign 2 names to a 1 column data.table
Calls: %>% -> fread -> setnames
In addition: Warning message:
In fread(paste0(arguments$outputbasename, ".seqlength"), sep = "\t",  :
  Column name 'V2' (colClasses[[2]][1]) not found
Execution halted

log file

DAS Tool 1.1.6
2023-06-26 16:45:00

Parameters:
--bins  /home/data/mmeg_mags/Highfield/das_tool/merged/mh_metabat_scaf2bin.tsv,/home/data/mmeg_mags/Highfield/das_tool/merged/mh_concoct_scaf2bin.tsv
--contigs       /home/data/mmeg_mags/Highfield/megahit/merged/final.contigs.fa
--outputbasename        /home/data/mmeg_mags/Highfield/das_tool/merged/DASToolRun
--labels        MH_metabat,MH_concoct
--search_engine diamond
--proteins      NULL
--write_bin_evals       TRUE
--write_bins    TRUE
--write_unbinned        FALSE
--threads       32
--score_threshold       0.6
--duplicate_penalty     0.6
--megabin_penalty       0.5
--dbDirectory   /home/data/mmeg_mags/Highfield/das_tool/merged/db
--resume        FALSE
--debug TRUE
--version       FALSE
--help  FALSE

Dependencies:
prodigal        /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/bin/prodigal
diamond /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/bin/diamond
pullseq /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/bin/pullseq
ruby    /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/bin/ruby
usearch
blastp  /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/bin/blastp

binTab:
k127_48525612   bin.100 MH_metabat
k127_11439299   bin.100 MH_metabat
....
k127_8073959    108     MH_concoct
k127_8073967    87      MH_concoct
k127_8073970    121     MH_concoct
k127_8073975    38      MH_concoct
k127_8073996    31      MH_concoct
k127_8074012    342     MH_concoct
k127_8074017    118     MH_concoct

Database directory does not exist or is incomplete:/home/data/mmeg_mags/das_tool/merged/db

                  Attempting to extract /home/data/mmeg_ngs/rodrigo_taketani/condaenvs/das_tool/share/das_tool-1.1.6-0/db.zip
Analyzing assembly

THank you for your help.

Regards, Rodrigo

rgtaketani commented 1 year ago

Hi Christian,

I was comparing files between the failed samples and the successful ones and noticed that there is a difference in the *.seqlength file. In the failed runs the contig length is not written in another column (after a tab). It is in the following line. See the example below.

head -20 DASToolRun.seqlength
k127_0 flag=1 multi=1.9464 len=463
463
k127_4844412 flag=1 multi=3.0000 len=518
518
k127_6459216 flag=1 multi=2.0000 len=539
539
k127_12918428 flag=1 multi=2.0000 len=379
379
k127_9688822 flag=1 multi=1.0000 len=720
720
k127_22607246 flag=1 multi=2.0000 len=438
438
k127_48444094 flag=1 multi=2.0000 len=475
475
k127_35525670 flag=1 multi=2.0000 len=600
600
k127_30681261 flag=1 multi=2.0000 len=379
379
k127_16148034 flag=1 multi=2.0000 len=463
463

Could this be related with the error?

Regards, Rodrigo

cmks commented 1 year ago

Yes, this is the problem. Contig lengths should be written as tab separated table. Not sure why there is a newline instead of a tab in your file. Is this happening only for one specific assembly? Can you please check if the example data causes the same issue (https://github.com/cmks/DAS_Tool/tree/master#examples-running-das-tool-on-sample-data)? You can also just execute the contig_length.sh script on the example assembly and compare the output format:

src/contig_length.sh sample_data/sample.human.gut_contigs.fa contig_lengths.tsv

This is what the result should look like.

rgtaketani commented 1 year ago

I ran das tool successfully with other datasets. It ran in another dataset on the same day (1/3 of the same data but other assembler). I tried the sample_data and it worked fine. I used the contig_length.sh and it produced the desired output. I tried contig_length.sh on a tiny subset of my assembly and it produced the two columns. However, it did not run on my full dataset because it was too big for sed (34Gb).

sed: regex input buffer length larger than INT_MAX

EDIT: ran in a larger machine and it gave me the same two lines output.

rgtaketani commented 1 year ago

I think I identified the issue. My assembly fasta headers are like this >k127_9688822 flag=1 multi=1.0000 len=720 I removed everything after the space so it looks like this >k127_9688822. I ran contig_length.sh and the output looked fine (two columns). Since Fasta_to_contigs2bin.sh removes everything after the space I imagine I will not have issues after that.

I just don't understand why I only have this issue with larger files?

rgtaketani commented 1 year ago

I think I identified the issue. My assembly fasta headers are like this >k127_9688822 flag=1 multi=1.0000 len=720 I removed everything after the space so it looks like this >k127_9688822. I ran contig_length.sh and the output looked fine (two columns). Since Fasta_to_contigs2bin.sh removes everything after the space I imagine I will not have issues after that.

I just don't understand why I only have this issue with larger files?

This worked. It is odd that the error didn't happen with smaller assemblies.

cmks / DAS_Tool

das tool halts after single copy gene detection #97