genomic-medicine-sweden / taxprofiler

Taxonomic profiling of shotgun metagenomic data
https://nf-co.re/taxprofiler
MIT License
0 stars 1 forks source link

Test DIAMOND #27

Open LilyAnderssonLee opened 1 year ago

LilyAnderssonLee commented 1 year ago

DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.

Run DIAMOND and compare with Kraken2 results.

TO DO: 1: build the protein database 2: Run diamond for clinical samples within the #196939

LilyAnderssonLee commented 1 year ago

Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB

LilyAnderssonLee commented 1 year ago

The Taxprofiler process was terminated when --run_diamond was turned on due to a lack of memory on the server.

I suspect this happened because of the usage of BLASTX under DIAMOND, and for some reason, we cannot use blastn/blastx on hasta when the reference is too large.

@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.

LilyAnderssonLee commented 11 months ago

:point_up: Memory issue has been resolved.

Some error messages from the tests:

Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.

Conclusions from nextflow run nf-core/taxprofiler:

The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB. Here is an reference of my test:

read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h

The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.

sofstam commented 11 months ago

Shall we update this config and ask in taxprofiler to change the label of the process?

LilyAnderssonLee commented 11 months ago

I plan to discuss that in the Slack channel once I finish all tests.

Yes, for us, we need to update the above config.

sofstam commented 11 months ago

Sounds great!

LilyAnderssonLee commented 11 months ago

So from the practical point of view, we should use a complete non-redundant protein database. Update the config by adding these lines.

process { withName: 'DIAMOND_BLASTX' { cpus = { check_max( 36 task.attempt, 'cpus' ) } memory = { check_max( 72.GB task.attempt, 'memory' ) } time = { check_max( 72.h * task.attempt, 'time' ) } } }

@sofstam What do you think?