IARCbioinfo / PVAmpliconFinder

GNU General Public License v3.0
1 stars 1 forks source link

BLAST nt Configuration / Installation #4

Closed cwarden45 closed 4 years ago

cwarden45 commented 4 years ago

Hi Alexis,

This relates to issue #3, but I thought I should separate this out so that it would be easier for others to find the answer to this specific question.

I have created an ~/.ncbirc file with the following information:

[BLAST]
BLASTDB=/path/to/PVAmpliconFinder/databases

Just to be extra safe, I also ran export BLASTDB=/path/to/PVAmpliconFinder/databases before running PVAmpliconFinder.

Based upon changes in the error mesages, I believe the folder (PVAmpliconFinder/databases) is being successfully specified, but the issue is with finding the BLAST index files.

I am currently trying to download the nt files as follows:

./update_blastdb.pl --decompress nr
./update_blastdb.pl --decompress taxdb

I think the taxdb files were downloaded OK.

The combined output is as follows:

Connected to NCBI
Downloading nr (39 volumes) ...
Downloading nr.00.tar.gz... [OK]
Downloading nr.01.tar.gz...Failed to download nr.01.tar.gz.md5!
Decompressing nr.00.tar.gz ... [OK]
Connected to NCBI
Downloading taxdb.tar.gz... [OK]
Decompressing taxdb.tar.gz ... [OK]

I previously tried downloading the files without --decompress and then extracting the .tar.gz files. However, PVAmpliconFinder still didn't find the BLAST nt reference files.

I also see that there was a typo (nr instead of nt) with the extra parameter, so I am going to try that again. If that doesn't work, I can go back to downloading the compressed files (where I don't think I had a typo, but I encountered some sort of an issue). For example, I am seeing the expected number of volumse in the modified download (27, from nt.00.tar.gz to nt.26.tar.gz).

Did you do something different when you set up your BLAST nt database?

Thank you very much.

Sincerely, Charles

cwarden45 commented 4 years ago

FYI, the --decompress download with the right name looks better. so far:

Connected to NCBI
Downloading nt (27 volumes) ...
Downloading nt.00.tar.gz... [OK]
Downloading nt.01.tar.gz... [OK]
Downloading nt.02.tar.gz... [OK]
Downloading nt.03.tar.gz... [OK]
Downloading nt.04.tar.gz... [OK]
Downloading nt.05.tar.gz... [OK]
Downloading nt.06.tar.gz... [OK]
Downloading nt.07.tar.gz... [OK]
Downloading nt.08.tar.gz... [OK]
Downloading nt.09.tar.gz... [OK]
Downloading nt.10.tar.gz... [OK]
Downloading nt.11.tar.gz... [OK]
Downloading nt.12.tar.gz... [OK]
Downloading nt.13.tar.gz... [OK]
Downloading nt.14.tar.gz... [OK]
Downloading nt.15.tar.gz... [OK]
Downloading nt.16.tar.gz... [OK]
Downloading nt.17.tar.gz... [OK]
Downloading nt.18.tar.gz... [OK]
Downloading nt.19.tar.gz... [OK]
Downloading nt.20.tar.gz... [OK]
Downloading nt.21.tar.gz... [OK]
Downloading nt.22.tar.gz... [OK]
Downloading nt.23.tar.gz... [OK]
Downloading nt.24.tar.gz... [OK]
Downloading nt.25.tar.gz... [OK]
Downloading nt.26.tar.gz... [OK]
Decompressing nt.00.tar.gz ... [OK]
Decompressing nt.01.tar.gz ... [OK]
Decompressing nt.02.tar.gz ... [OK]
Decompressing nt.03.tar.gz ... [OK]
Decompressing nt.04.tar.gz ... [OK]
Decompressing nt.05.tar.gz ... [OK]
Decompressing nt.06.tar.gz ... [OK]
Decompressing nt.07.tar.gz ... [OK]
Decompressing nt.08.tar.gz ... [OK]
Decompressing nt.09.tar.gz ... [OK]
Decompressing nt.10.tar.gz ... [OK]
Decompressing nt.11.tar.gz ... [OK]
Decompressing nt.12.tar.gz ... [OK]
Decompressing nt.13.tar.gz ... [OK]
Decompressing nt.14.tar.gz ... [OK]
Decompressing nt.15.tar.gz ... [OK]
Decompressing nt.16.tar.gz ... [OK]
Decompressing nt.17.tar.gz ... [OK]
Decompressing nt.18.tar.gz ... [OK]
Decompressing nt.19.tar.gz ... [OK]
Decompressing nt.20.tar.gz ... [OK]
Decompressing nt.21.tar.gz ... [OK]
Decompressing nt.22.tar.gz ... [OK]
Decompressing nt.23.tar.gz ... [OK]
Decompressing nt.24.tar.gz ... [OK]
Decompressing nt.25.tar.gz ... [OK]
Decompressing nt.26.tar.gz ... [OK]

If this strategy fixes the problem, then I will close the ticket.

If not, then I will provide additional information for troubleshooting.

I apologize for not catching this sooner.

Thank you again!

cwarden45 commented 4 years ago

I am still encountering an issue, but I will most details on the main thread. Essentially, I am getting this error message that the nt database is not recognized:

Indexed BLAST database error: NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1595737360567/work/blast/c++/src/algo/blast/api/blast_dbindex.cpp", line 793: Error: (CDbIndex_Exception::bad index creation option) BLAST::ncbi::blast::CIndexedDb_New::CIndexedDb_New() - no database volume has an index

NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1595737360567/work/blast/c++/src/algo/blast/api/blast_dbindex.cpp", line 1006: Error: (CDbIndex_Exception::bad index creation option) BLAST::ncbi::blast::CIndexedDb_Old::CIndexedDb_Old() - no index file specified or index 'nt*' not found.

If I am correct that the BLAST configuration is causing the problem, then I will summarize the solution here.

However, in the meantime, I will close this ticket.

cwarden45 commented 4 years ago

As a troubleshooting update, I tested using a pre-existing version of BLAST using export PATH=/opt/ncbi-blast-2.4.0+/bin:$PATH.

However, that generates a different error message:

##########################################
##  Sequence identification : BLAST ##
##########################################
pool8-oral-pathogen_S8_L001
pool6-oral-pathogen_S6_L001
pool1-skin-pathogen_S1_L001
pool4-skin-pathogen_S4_L001
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
pool5-skin-pathogen_S5_L001
pool7-oral-pathogen_S7_L001
pool2-skin-pathogen_S2_L001
pool3-skin-pathogen_S3_L001
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
Done

If the type of the database is being recognized, then that means it is being found successfully.

However, I still don't have a solution to get PVAmpliconFinder.sh working quite yet.

cwarden45 commented 4 years ago

I saw this response in another discussion group, so I tested downloading the latest version of BLAST+.

I then modified the PATH to use that version with export PATH=/opt/ncbi-blast-2.10.0+/bin:$PATH.

However, that gets me back to the earlier error message:

Indexed BLAST database error: NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_260005_130.14.18.128_9008__PrepareRelease_Linux64-Centos_1575413971/c++/compilers/unix/../../src/algo/blast/api/blast_dbindex.cpp", line 793: Error: BLAST::ncbi::blast::CIndexedDb_New::CIndexedDb_New() - no database volume has an index

NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_260005_130.14.18.128_9008__PrepareRelease_Linux64-Centos_1575413971/c++/compilers/unix/../../src/algo/blast/api/blast_dbindex.cpp", line 1006: Error: BLAST::ncbi::blast::CIndexedDb_Old::CIndexedDb_Old() - no index file specified or index 'nt*' not found.
cwarden45 commented 4 years ago

As I partial response, I have downloaded the FASTA files and I am testing re-indexing those from scratch.

However, if I try to index the database within the Docker image that I created for the PVAmpliconFinder, then I get the following error message:

No volumes were created.

Error: mdb_env_open: Invalid argument

Based upon this discussion, I think there may be some space limitation.

So, I am going to try indexing the file outside of the Docker image and then use the same version of BLAST+ for PVAmpliconFinder. If that works, I will post to confirm.

cwarden45 commented 4 years ago

To help others with troubleshooting, I found a solution to the most direct error message above (if I index the reference using another computer).

While the process was actually split between 2 computers, there are the commands that were used:

wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv nt nt.fa

export PATH=/opt/ncbi-blast-2.10.0+/bin:$PATH
makeblastdb -in nt.fa -dbtype nucl -out nt

However, I am still getting an error message from PVAmpliconFinder:

Indexed BLAST database error: NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_330141_130.14.18.128_9008__PrepareRelease_Linux64-Centos_1589299866/c++/compilers/unix/../../src/algo/blast/api/blast_dbindex.cpp", line 793: Error: (CDbIndex_Exception::bad index creation option) BLAST::ncbi::blast::CIndexedDb_New::CIndexedDb_New() - no database volume has an index

NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_330141_130.14.18.128_9008__PrepareRelease_Linux64-Centos_1589299866/c++/compilers/unix/../../src/algo/blast/api/blast_dbindex.cpp", line 1006: Error: (CDbIndex_Exception::bad index creation option) BLAST::ncbi::blast::CIndexedDb_Old::CIndexedDb_Old() - no index file specified or index 'nt*' not found.

As far as I can tell, the database files should all be there (even though there are 78 instead of 27 volumes):

nt.00.nhr
nt.00.nin
nt.00.nsq
nt.01.nhr
nt.01.nin
nt.01.nsq
nt.02.nhr
nt.02.nin
nt.02.nsq
nt.03.nhr
nt.03.nin
nt.03.nsq
nt.04.nhr
nt.04.nin
nt.04.nsq
nt.05.nhr
nt.05.nin
nt.05.nsq
nt.06.nhr
nt.06.nin
nt.06.nsq
nt.07.nhr
nt.07.nin
nt.07.nsq
nt.08.nhr
nt.08.nin
nt.08.nsq
nt.09.nhr
nt.09.nin
nt.09.nsq
nt.10.nhr
nt.10.nin
nt.10.nsq
nt.11.nhr
nt.11.nin
nt.11.nsq
nt.12.nhr
nt.12.nin
nt.12.nsq
nt.13.nhr
nt.13.nin
nt.13.nsq
nt.14.nhr
nt.14.nin
nt.14.nsq
nt.15.nhr
nt.15.nin
nt.15.nsq
nt.16.nhr
nt.16.nin
nt.16.nsq
nt.17.nhr
nt.17.nin
nt.17.nsq
nt.18.nhr
nt.18.nin
nt.18.nsq
nt.19.nhr
nt.19.nin
nt.19.nsq
nt.20.nhr
nt.20.nin
nt.20.nsq
nt.21.nhr
nt.21.nin
nt.21.nsq
nt.22.nhr
nt.22.nin
nt.22.nsq
nt.23.nhr
nt.23.nin
nt.23.nsq
nt.24.nhr
nt.24.nin
nt.24.nsq
nt.25.nhr
nt.25.nin
nt.25.nsq
nt.26.nhr
nt.26.nin
nt.26.nsq
nt.27.nhr
nt.27.nin
nt.27.nsq
nt.28.nhr
nt.28.nin
nt.28.nsq
nt.29.nhr
nt.29.nin
nt.29.nsq
nt.30.nhr
nt.30.nin
nt.30.nsq
nt.31.nhr
nt.31.nin
nt.31.nsq
nt.32.nhr
nt.32.nin
nt.32.nsq
nt.33.nhr
nt.33.nin
nt.33.nsq
nt.34.nhr
nt.34.nin
nt.34.nsq
nt.35.nhr
nt.35.nin
nt.35.nsq
nt.36.nhr
nt.36.nin
nt.36.nsq
nt.37.nhr
nt.37.nin
nt.37.nsq
nt.38.nhr
nt.38.nin
nt.38.nsq
nt.39.nhr
nt.39.nin
nt.39.nsq
nt.40.nhr
nt.40.nin
nt.40.nsq
nt.41.nhr
nt.41.nin
nt.41.nsq
nt.42.nhr
nt.42.nin
nt.42.nsq
nt.43.nhr
nt.43.nin
nt.43.nsq
nt.44.nhr
nt.44.nin
nt.44.nsq
nt.45.nhr
nt.45.nin
nt.45.nsq
nt.46.nhr
nt.46.nin
nt.46.nsq
nt.47.nhr
nt.47.nin
nt.47.nsq
nt.48.nhr
nt.48.nin
nt.48.nsq
nt.49.nhr
nt.49.nin
nt.49.nsq
nt.50.nhr
nt.50.nin
nt.50.nsq
nt.51.nhr
nt.51.nin
nt.51.nsq
nt.52.nhr
nt.52.nin
nt.52.nsq
nt.53.nhr
nt.53.nin
nt.53.nsq
nt.54.nhr
nt.54.nin
nt.54.nsq
nt.55.nhr
nt.55.nin
nt.55.nsq
nt.56.nhr
nt.56.nin
nt.56.nsq
nt.57.nhr
nt.57.nin
nt.57.nsq
nt.58.nhr
nt.58.nin
nt.58.nsq
nt.59.nhr
nt.59.nin
nt.59.nsq
nt.60.nhr
nt.60.nin
nt.60.nsq
nt.61.nhr
nt.61.nin
nt.61.nsq
nt.62.nhr
nt.62.nin
nt.62.nsq
nt.63.nhr
nt.63.nin
nt.63.nsq
nt.64.nhr
nt.64.nin
nt.64.nsq
nt.65.nhr
nt.65.nin
nt.65.nsq
nt.66.nhr
nt.66.nin
nt.66.nsq
nt.67.nhr
nt.67.nin
nt.67.nsq
nt.68.nhr
nt.68.nin
nt.68.nsq
nt.69.nhr
nt.69.nin
nt.69.nsq
nt.70.nhr
nt.70.nin
nt.70.nsq
nt.71.nhr
nt.71.nin
nt.71.nsq
nt.72.nhr
nt.72.nin
nt.72.nsq
nt.73.nhr
nt.73.nin
nt.73.nsq
nt.74.nhr
nt.74.nin
nt.74.nsq
nt.75.nhr
nt.75.nin
nt.75.nsq
nt.76.nhr
nt.76.nin
nt.76.nsq
nt.77.nhr
nt.77.nin
nt.77.nsq
nt.fa
nt.nal
nt.ndb
nt.not
nt.ntf
nt.nto

I noticed that export BLASTDB=/path/to/databases doesn't work if I try to run this step on another computer. So, I am copying over the ~/.ncbirc file on the server where I am running the analysis (with a modified path).

Also, I noticed that PVAmpliconFinder can pick up in the middle of analysis, but you need to go and delete the folder for whatever steps did not complete correctly (such as the blast_result folder, if you are having problems at the step for BLAST analysis).

Since the BLAST indexing message was not clearly identifying the cause, I am testing running the BLAST step with 256M of RAM and 1 core (versus 16MB of RAM with 1-2 cores). However, I am still getting the same error message.

cwarden45 commented 4 years ago

I looked into the PVAmpliconFinder code and I tested removing -use_index true from the blastn command.

The blastdb files are still present and and path is still being successfully added with the ~/.ncbirc file, but that results in the following error message:

BLAST Database error: No alias or index file found for nucleotide database [nt] in search path [/path/to/PVAmpliconFinder/test_out/vsearch::/path/to/PVAmpliconFinder/PVAmpliconFinder/databases:]

I think that relates to this discussion, but the only thing that gave me to idea to try is to remove the ".fa" extension from the original sequence file (which I didn't think was used, and I avoided having .fa in all the other names by using the -out parameter for makeblastdb).

As I would have guessed, I still get the same error if I do that.

cwarden45 commented 4 years ago

There was a likely contributing factor to the last troubleshooting attempt.

I had previously added an extra line to the ~/.ncbirc file, but I had forgotten to update the path between different computers.

So, this is the configuration file (which didn't fix the problem in itself):

[BLAST]
BLASTDB=/path/to/PVAmpliconFinder/PVAmpliconFinder/databases
DATA_LOADERS=blastdb

I am currently not getting an error message, but blastn is taking >20 minutes for 1 sample.

I am also not sure how removing -use_index true affects other steps, but I will provide an update (with either another error message or the successful result).

cwarden45 commented 4 years ago

I think part of the issue may be that I need to create an additional index using makembindex (as discussed here)

For example, I am testing the effect of adding this command:

makembindex -input nt -iformat blastdb

My VM froze if I tried to let blastn run without -use_index true.

cwarden45 commented 4 years ago

I am still working on troubleshooting, but I thought it might be good to mention a few updates:

1) I was able to successfully create a megabast indexed reference (with makembindex), without an error messages. So, that is good.

2) I was previously able to download the regular BLAST index with update_blastdb.pl. However, I am currently having difficulties with that (an error is occurring with the download of the 2nd volume). Currently, I am not certain if I need to do that (to use the smaller reference set with the taxonomy download).

3) If I run PVAmpliconFinder on the megablast indexed reference, then I get a different error message.

Also, this takes a long time to get to the point of generating that error message (~2 days for 1 sample).

I currently don't see that error message in the log file (after shutting down the VM and Docker image), but I know the time associated with the empty blast result for the 1st demo file.

So, I am going to see if using a cluster with more computational resources helps (as well as keep re-trying the update_blastdb.pl style of reference downloading, following by the additional index step, once that is successful). Even if I still have a problem, I will try to provide more specifics about the error message.

cwarden45 commented 4 years ago

Some more updates:

2) I can re-download the smaller set of files with the taxonomy database. I am not sure if it was the main cause, but I think moving the files between folders may have affected some of the permissions.

3) If I use a different computer for the BLAST step, I can now get non-empty BLAST files. However, I think I need to use that alternative reference set with the taxonomy information:

Warning: [blastn] Taxonomy name lookup from taxid requires installation of taxdb database with ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

I think the decompressed version of the taxdb file is the same was what was downloaded from `update_blastdb.pl in the other folder (with fewer nt volumes). However, I am going to test if that other set of re-downloaded files works. If so, I will provide some output and close the main ticket.

cwarden45 commented 4 years ago

I think I have some additional questions for the "Advanced Analysis," but I think the above solution worked for the BLAST part (using the update_blastdb.pl files, adding an extra megablast index, and running the BLAST step on a computer with more computational resources but where I don't have root privledges for installation).

I hope this can be helpful for others.