WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
249 stars 52 forks source link

Memory limitations #226

Closed gaferguz closed 1 year ago

gaferguz commented 1 year ago

Hi there, I've been tried to annotate contigs assembled from high-coverage fastq files for a while using different DRAM versions (1.3.5 and 1.4.0.rc3), trusting the memory requirements specified on DRAM wiki page:

If KOfam is used to annotate KEGG and UniRef90 is not used, then less than 50 GB of RAM is required. DRAM can be run with any number of processors on a single node.

These are the number of input sequences for the samples that im trying to annotate locally (my system has 64 GB of RAM), whith an avegare lenght around 600-700 pb. These files range from 70MB to 647MB of size:

Plot11_CoAs.fa: 935196 sequences Plot1617_CoAs.fa: 729800 sequences Plot28_CoAs.fa: 127948 sequences Plot31_CoAs.fa: 840509 sequences Plot3637_CoAs.fa: 830278 sequences

DRAM.py annotate -i './*CoAs.fa' -o annotation --threads 8 --custom_fasta_loc /home/bioinformatica/Desktop/DRAM/DRAM_data/SCycDB_2020Mar_unique.fasta --custom_db_name SCycDB --custom_fasta_loc /home/bioinformatica/Desktop/DRAM/DRAM_data/NCyc_unique.fasta --custom_db_name NCyc --min_contig_size 900

2022-10-13 14:36:48,533 - Retrieved database locations and descriptions
2022-10-13 14:36:48,533 - Annotating Plot31_CoAs
2022-10-13 14:45:52,902 - Turning genes from prodigal to mmseqs2 db
2022-10-13 14:46:00,783 - Getting hits from kofam
2022-10-13 19:56:49,948 - Getting forward best hits from peptidase
2022-10-13 20:07:43,200 - Getting reverse best hits from peptidase
2022-10-13 20:08:04,572 - Getting descriptions of hits from peptidase
2022-10-13 20:08:05,673 - Getting hits from pfam
2022-10-13 20:13:20,861 - Getting hits from dbCAN
2022-10-13 20:18:55,260 - Getting hits from SCycDB
2022-10-13 20:18:55,260 - Getting forward best hits from SCycDB
2022-10-13 20:33:25,543 - Getting reverse best hits from SCycDB
2022-10-13 20:34:47,763 - Getting descriptions of hits from SCycDB
2022-10-13 20:34:49,929 - Getting hits from NCyc
2022-10-13 20:34:49,929 - Getting forward best hits from NCyc
2022-10-13 20:47:24,416 - Getting reverse best hits from NCyc
2022-10-13 20:48:06,088 - Getting descriptions of hits from NCyc
2022-10-13 20:48:12,689 - Merging ORF annotations

After getting all hits, seems to get stuck at the merging ORF step, until the process is finally killed.

I would like to know any minimal memory recommendations for my dataset.

rmFlynn commented 1 year ago

At this point, you are unlikely to benefit from lowering threads. You may want to try annotating your fasta separately and use the merger tool on the results. I will remove that language you pointed out from the doc, we should not make promises when the size of the fasta clearly changes the requirements. You may need to split some of the larger FASTAs.