cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
133 stars 27 forks source link

Want to download all the required databases manually #84

Open bloomarun opened 11 months ago

bloomarun commented 11 months ago

Hello, The db_builder provided is great, but it is pretty slow. I want to download all the databases manually (using multi threaded download tools like axel) and then build them using the builder script. How can I go about that? I am missing out on the best annotator out there due to small glitches like these...

bdaisley commented 11 months ago

@bloomarun A quick/short-term solution for me was runnng the microbeannotator_db_builder until Step 6 (Download TrEMBL Proteins) and then cancelling during the download via keyboard interrupt. I then manually downloaded TrEMBL protein (Step 6) and TrEMBL Annotations (Step 7) using aria2c, but any multi-threaded downloader tool should work.

Example code:

#Manual Step 6 download for TrEMBL proteins:
aria2c -x 16 -s 16 https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz

#Manual Step 7 download for TrEMBL annotations:
aria2c -x 16 -s 16 https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz

The TrEMBL proteins file needs to be saved in the "protein_db" directory as "uniprot_trembl.fasta" (make the "protein_db" directory if it's not already present in the main microbeannotator_db_dir directory). Example:

~/MicrobeAnnotator_DB/protein_db/uniprot_trembl.fasta

The TrEMBL annotations file needs to be saved in the "temp_trembl_dat_files" directory as "uniprot_trembl.dat.gz" (make the "temp_trembl_dat_files" directory if it's not already present in the main microbeannotator_db directory). Example

~/MicrobeAnnotator_DB/temp_trembl_dat_files/uniprot_trembl.dat.gz

After doing this, resume the microbeannotator_db_builder script starting at Step 8 (Parse TrEMBL Annotations) as follows:

microbeannotator_db_builder -d MicrobeAnnotator_DB -m diamond -t 22 --step 8

This allowed me to cut my download time from ~72 hour to ~3 hours. The rest of the download steps are neglible in comparison so I didn't bother with multi-threading but I imagine the same could be done.

Hope this helps speed things up for you!