Putnam-Lab / Lab_Management

13 stars 7 forks source link

Making nr.dmnd file from NCBI for Diamond functional annotation step #52

Closed daniellembecker closed 1 year ago

daniellembecker commented 1 year ago

@AHuffmyer was working on her Mcap functional annotation recently and noticed an issue with making an updated nr.dmnd file when she wanted to download the most recent nr database in FASTA format from NCBI and use it to make a Diamond-formatted nr database following this protocol Step 2: Identify homologous sequences.

When following the protocol step:

Go to the sbatch_executables subdirectory in the Putnam Lab shared folder and run the scripts, make_diamond_nr_db.sh and make_diamond_nr_db.sh in this order:

$ sbatch download_nr_database.sh
Submitted batch job NNN
$ sbatch -d afterok:NNN make_diamond_nr_db.sh

She was running into this error in the script output:

Masking sequences...  [45.297s]
Writing sequences...  [2.931s]
Hashing sequences...  [0.406s]
Loading sequences...  [17.964s]
Masking sequences...  [45.622s]
Writing sequences...  [2.955s]
Hashing sequences...  [0.474s]
Loading sequences...  [18.119s]
Masking sequences...  [45.512s]
Writing sequences...  [2.644s]
Hashing sequences...  [0.395s]
Loading sequences...  [17.528s]
Masking sequences...  [45.368s]
Writing sequences...  [2.696s]
Hashing sequences...  [0.394s]
Loading sequences...  [38.254s]
Error: Inflate error.
diamond v2.0.0.138 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org/

No such file or directory
Error: Error opening file nr.dmnd
STOP Fri Dec 16 20:53:30 EST 2022

After discussing with @daniellembecker, we thought that maybe it was an issue with @AHuffmyer permissions, but @daniellembecker also re-ran the scripts on December 19th 2022 with the same 'Inflate Error'.

@daniellembecker then suspected it may be due to the fact that the nr and nr.gz databases may need to be updated since they were last downloaded last year. She deleted the previous files and re-downloaded them.

@hputnam re-visitied this issue on January 4th 2023 and is currently running the scripts to see if it worked.

hputnam commented 1 year ago

My guess is this was a space issue that did not allow the file to be expanded/created.

I ran the following on 20230103

interactive 

cd /data/putnamlab/shared/databases

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

exit

I would not recommend this due to the amount of time it takes, but it works for trouble shooting

nano /data/putnamlab/hputnam/Ahya_Fun_Annot/scripts/make_diamond_db.sh

#!/bin/bash
#SBATCH --job-name="make_diamond_db" #CHANGE_NAME
#SBATCH -t 24:00:00
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH -D /data/putnamlab/shared/databases
#SBATCH -p putnamlab

module load DIAMOND/2.0.0-GCC-8.3.0 #Load DIAMOND

diamond makedb --in /data/putnamlab/shared/databases/nr.gz -d nr
diamond dbinfo -d /data/putnamlab/shared/databases/nr.dmnd

sbatch /data/putnamlab/hputnam/Ahya_Fun_Annot/scripts/make_diamond_db.sh

This successfully completed as seen in /data/putnamlab/shared/databases/slurm-206358.out

2023-01-03 21:30:57 (3.42 MB/s) - ‘uniprot_trembl.fasta.gz’ saved [57043853550]

Building a new DB, current time: 01/04/2023 02:15:45
New DB name:   /glfs/brick01/gv0/putnamlab/shared/databases/trembl_20230103
New DB title:  uniprot_trembl.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 229580745 sequences in 6419.68 seconds.

STOP Wed Jan 4 04:14:31 EST 2023

mv nr.dmnd 20230104_nr.dmnd

The updated DB for Diamond blast can be found at /data/putnamlab/shared/databases/20230104_nr.dmnd

AHuffmyer commented 1 year ago

To fix the space issue, did you delete files prior to running the script? What changed between previous runs and this run to allow for the space?

hputnam commented 1 year ago

The only thing I did was to delete nr.dmnd. It is also possible other people cleared space in their personal directories that freed up space for everyone.