Open LilyAnderssonLee opened 1 year ago
Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB
The Taxprofiler process was terminated when --run_diamond
was turned on due to a lack of memory on the server.
I suspect this happened because of the usage of BLASTX
under DIAMOND
, and for some reason, we cannot use blastn/blastx
on hasta when the reference is too large.
@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.
:point_up: Memory issue has been resolved.
Some error messages from the tests:
Diamond UPPMAX database doesn't work.
Diamond version 2.1.8 has an error.
Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.
Conclusions from nextflow run nf-core/taxprofiler
:
withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 120.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } }
The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB. Here is an reference of my test:
read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h
The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.
Shall we update this config and ask in taxprofiler to change the label of the process?
I plan to discuss that in the Slack channel once I finish all tests.
Yes, for us, we need to update the above config.
Sounds great!
So from the practical point of view, we should use a complete non-redundant protein database. Update the config by adding these lines.
process { withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 72.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } } }
@sofstam What do you think?
DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.
Run DIAMOND and compare with Kraken2 results.
TO DO: 1: build the protein database 2: Run diamond for clinical samples within the #196939