Test DIAMOND - Githubissues

LilyAnderssonLee commented 1 year ago

DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.

Run DIAMOND and compare with Kraken2 results.

TO DO: 1: build the protein database 2: Run diamond for clinical samples within the #196939

LilyAnderssonLee commented 1 year ago

Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB

LilyAnderssonLee commented 1 year ago

The Taxprofiler process was terminated when --run_diamond was turned on due to a lack of memory on the server.

I suspect this happened because of the usage of BLASTX under DIAMOND, and for some reason, we cannot use blastn/blastx on hasta when the reference is too large.

@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.

LilyAnderssonLee commented 1 year ago

:point_up: Memory issue has been resolved.

Some error messages from the tests:

Diamond UPPMAX database doesn't work.
- (Error: Options require taxonomy information included in the database. Please use the respective options to build this information into the database when running diamond makedb: taxonomy mapping information (--taxonmap option), taxonomy nodes information (--taxonnodes option))
Diamond version 2.1.8 has an error.
- Error: Loading query sequences... Error: Unequal number of sequences in paired read files.

Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.

Diamond takes significantly longer, approximately 10 times more than Kraken2, as stated in the paper Benchmarking Metagenomics Tools for Taxonomic Classification.

Conclusions from nextflow run nf-core/taxprofiler:

DIAMOND_BLASTX process was killed due to the max time limit. DIAMOND_BLASTX is labeled as process_medium, and we should increase the max CPU, memory, and time.
Clone taxprofiler repo and modify the base.config of DIAMOND_BLASTX process.

withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 120.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } }

The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB. Here is an reference of my test:

read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h

The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.

sofstam commented 1 year ago

Shall we update this config and ask in taxprofiler to change the label of the process?

LilyAnderssonLee commented 1 year ago

I plan to discuss that in the Slack channel once I finish all tests.

Yes, for us, we need to update the above config.

sofstam commented 1 year ago

Sounds great!

LilyAnderssonLee commented 1 year ago

So from the practical point of view, we should use a complete non-redundant protein database. Update the config by adding these lines.

process { withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 72.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } } }

@sofstam What do you think?

genomic-medicine-sweden / taxprofiler

Test DIAMOND #27