genomic-medicine-sweden / taxprofiler

Taxonomic profiling of shotgun metagenomic data
https://nf-co.re/taxprofiler
MIT License
0 stars 1 forks source link

Test DIAMOND #27

Open LilyAnderssonLee opened 1 year ago

LilyAnderssonLee commented 1 year ago

DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.

Run DIAMOND and compare with Kraken2 results.

TO DO: 1: build the protein database 2: Run diamond for clinical samples within the #196939

LilyAnderssonLee commented 1 year ago

Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB

LilyAnderssonLee commented 1 year ago

The Taxprofiler process was terminated when --run_diamond was turned on due to a lack of memory on the server.

I suspect this happened because of the usage of BLASTX under DIAMOND, and for some reason, we cannot use blastn/blastx on hasta when the reference is too large.

@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.

LilyAnderssonLee commented 1 year ago

:point_up: Memory issue has been resolved.

Some error messages from the tests:

Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.

Conclusions from nextflow run nf-core/taxprofiler:

The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB. Here is an reference of my test:

read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h

The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.

sofstam commented 1 year ago

Shall we update this config and ask in taxprofiler to change the label of the process?

LilyAnderssonLee commented 1 year ago

I plan to discuss that in the Slack channel once I finish all tests.

Yes, for us, we need to update the above config.

sofstam commented 1 year ago

Sounds great!

LilyAnderssonLee commented 1 year ago

So from the practical point of view, we should use a complete non-redundant protein database. Update the config by adding these lines.

process { withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 72.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } } }

@sofstam What do you think?