Taxonomy.qza file not created by import_database.sh

Ksherriff commented 3 years ago

Hi again,

I am trying to make our own database that we could use with this script. I downloaded the Epi2me fasta file through the link you provided in the description so that I could format my fasta file in an identical fashion. I ran both fasta files through the import script with seemingly no errors until the end. The sequence.qza file is created but the script cannot find the necessary files to create the taxonomy.qza files. Is there something additional that needs to be downloaded/completed before running the script? Thanks!

MaestSi commented 3 years ago

Hi, the taxonomy.qza is not created for your custom database, but is it created for the Epi2me fasta file? Simone

Ksherriff commented 3 years ago

the taxonomy file is not created for either of the fasta files that I used. Here is the code for when I run the script.

./Import_database.sh sequencetest.fasta WARNING: A conda environment already exists at '/home/kylacochrane/miniconda3/envs/entrez_qiime_env' Remove existing environment (y/[n])? n

CondaSystemExit: Exiting.

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: numpy in /home/kylacochrane/miniconda3/envs/entrez_qiime_env/lib/python2.7/site-packages (1.16.6) DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: cogent in /home/kylacochrane/miniconda3/envs/entrez_qiime_env/lib/python2.7/site-packages (1.9) Error: cannot find file nodes.dmp at ./taxonomy/taxdump/nodes.dmp Imported sequencetest.fasta as DNAFASTAFormat to /home/kylacochrane/MetONTIIME/sequencetest_sequence.qza Usage: qiime tools import [OPTIONS]

Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/ for usage examples and details on the file types and associated semantic types that can be imported.

Options: --type TEXT The semantic type of the artifact that will be created upon importing. Use --show-importable-types to see what importable semantic types are available in the current deployment. [required] --input-path PATH Path to file or directory that should be imported. [required] --output-path ARTIFACT Path where output artifact should be written. [required] --input-format TEXT The format of the data to be imported. If not provided, data must be in the format expected by the semantic type provided via --type. --show-importable-types Show the semantic types that can be supplied to --type to import data into an artifact. --show-importable-formats Show formats that can be supplied to --input-format to import data into an artifact. --help Show this message and exit.

                There was a problem with the command:

(1/1) Invalid value for '--input-path': Path '/home/kylacochrane/MetONTIIME/sequencetest_accession_taxonomy.txt' does not exist.

MaestSi commented 3 years ago

It looks like the taxonomy folder doesn’t contain all the required information. Please try removing it and repeating the import, first with the Epi2me fasta file. Simone

Ksherriff commented 3 years ago

Deleted the Taxonomy folder and ran the script again. No taxonomy.qza, but here is the full output of running it after deleting that folder.

./Import_database.sh sequencetest.fasta WARNING: A conda environment already exists at '/home/kylacochrane/miniconda3/envs/entrez_qiime_env' Remove existing environment (y/[n])? n

CondaSystemExit: Exiting.

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: numpy in /home/kylacochrane/miniconda3/envs/entrez_qiime_env/lib/python2.7/site-packages (1.16.6) DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Requirement already satisfied: cogent in /home/kylacochrane/miniconda3/envs/entrez_qiime_env/lib/python2.7/site-packages (1.9) --2021-03-01 16:05:35-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz => ‘nucl_gb.accession2taxid.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.229, 165.112.9.228, 2607:f220:41e:250::7, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.229|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy/accession2taxid ... done. ==> SIZE nucl_gb.accession2taxid.gz ... 2029193044 ==> PASV ... done. ==> RETR nucl_gb.accession2taxid.gz ... done. Length: 2029193044 (1.9G) (unauthoritative)

nucl_gb.accession2taxid.gz 100%[=====================================================================================================>] 1.89G 1.99MB/s in 9m 5s

2021-03-01 16:14:41 (3.55 MB/s) - ‘nucl_gb.accession2taxid.gz’ saved [2030615476]

gzip: nucl_gb.accession2taxid.gz: invalid compressed data--format violated --2021-03-01 16:15:18-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz => ‘nucl_wgs.accession2taxid.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.12, 2607:f220:41e:250::7, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy/accession2taxid ... done. ==> SIZE nucl_wgs.accession2taxid.gz ... 3778046892 ==> PASV ... done. ==> RETR nucl_wgs.accession2taxid.gz ... done. Length: 3778046892 (3.5G) (unauthoritative)

nucl_wgs.accession2taxid.gz 100%[=====================================================================================================>] 3.54G 2.51MB/s in 22m 39s

2021-03-01 16:37:57 (2.67 MB/s) - ‘nucl_wgs.accession2taxid.gz’ saved [3800754460]

gzip: nucl_wgs.accession2taxid.gz: invalid compressed data--format violated cp: cannot stat 'nucl_gb.accession2taxid': No such file or directory tail: cannot open 'nucl_wgs.accession2taxid' for reading: No such file or directory rm: cannot remove 'nucl_gb.accession2taxid': No such file or directory rm: cannot remove 'nucl_wgs.accession2taxid': No such file or directory --2021-03-01 16:38:34-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz => ‘taxdump.tar.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 130.14.250.13, 2607:f220:41e:250::12, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy ... done. ==> SIZE taxdump.tar.gz ... 54580443 ==> PASV ... done. ==> RETR taxdump.tar.gz ... done. Length: 54580443 (52M) (unauthoritative)

taxdump.tar.gz 100%[=====================================================================================================>] 52.05M 1.61MB/s in 26s

2021-03-01 16:39:01 (1.98 MB/s) - ‘taxdump.tar.gz’ saved [54580443]

citations.dmp delnodes.dmp division.dmp gencode.dmp merged.dmp names.dmp nodes.dmp gc.prt readme.txt Traceback (most recent call last): File "./entrez_qiime/entrez_qiime.py", line 503, in main() File "./entrez_qiime/entrez_qiime.py", line 148, in main args.infile_acc2taxid_path, ncbi_full_taxonomy, merged_taxids, deleted_taxids) File "./entrez_qiime/entrez_qiime.py", line 383, in obtain_nodes_for_each_accession discard_header_line = next(acc2taxid) StopIteration Imported sequencetest.fasta as DNAFASTAFormat to /home/kylacochrane/MetONTIIME/sequencetest_sequence.qza Usage: qiime tools import [OPTIONS]

Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/ for usage examples and details on the file types and associated semantic types that can be imported.

Options: --type TEXT The semantic type of the artifact that will be created upon importing. Use --show-importable-types to see what importable semantic types are available in the current deployment. [required] --input-path PATH Path to file or directory that should be imported. [required] --output-path ARTIFACT Path where output artifact should be written. [required] --input-format TEXT The format of the data to be imported. If not provided, data must be in the format expected by the semantic type provided via --type. --show-importable-types Show the semantic types that can be supplied to --type to import data into an artifact. --show-importable-formats Show formats that can be supplied to --input-format to import data into an artifact. --help Show this message and exit.

                There was a problem with the command:

(1/1) Invalid value for '--input-path': Path '/home/kylacochrane/MetONTIIME/sequencetest_accession_taxonomy.txt' does not exist.

MaestSi commented 3 years ago

Hi, it looks like you have an issue with the gunzip command. In particular, in the Import_database.sh script, a taxonomy folder is created, and files nucl_gb.accession2taxid.gz and nucl_wgs.accession2taxid.gz are downloaded. The files are then decompressed with gunzip program, and then merged in a single nucl_merged.accession2taxid file.

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz
gunzip nucl_wgs.accession2taxid.gz
cp nucl_gb.accession2taxid nucl_merged.accession2taxid
tail -n+2 nucl_wgs.accession2taxid >> nucl_merged.accession2taxid

You may try going to the taxonomy folder and running only the first two lines of this code and verify it can't decompress the .gz file. I verified it works but with gunzip (gzip) 1.6 and with gunzip (gzip) 1.10. What gunzip version are you running (gunzip --version)? Simone

Ksherriff commented 3 years ago

Current version of gunzip is 1.10. I ran the first two lines of code and got an error. Here are the outputs. wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz --2021-03-02 14:35:52-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz => ‘nucl_gb.accession2taxid.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.229, 130.14.250.13, 2607:f220:41e:250::7, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.229|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy/accession2taxid ... done. ==> SIZE nucl_gb.accession2taxid.gz ... 2029193044 ==> PASV ... done. ==> RETR nucl_gb.accession2taxid.gz ... done. Length: 2029193044 (1.9G) (unauthoritative)

nucl_gb.accession2t 100%[===================>] 1.90G 2.91MB/s in 11m 18s

2021-03-02 14:47:10 (2.87 MB/s) - ‘nucl_gb.accession2taxid.gz’ saved [2039120116]

gunzip nucl_gb.accession2taxid.gz

gzip: nucl_gb.accession2taxid.gz: invalid compressed data--format violated

MaestSi commented 3 years ago

This is very strange, what is the exact size of the gzip files you download? I don’t understand if there is an issue with the files or with gunzip. What system are you running on? Simone

Ksherriff commented 3 years ago

Size of the file was 2.0 GB and running on Ubuntu 20. Since gunzip didn't work I did try and just extract from the file folder itself just right clicking on the file and selecting extract here and that seemed to work? It extracted into a file that is now 3.4 GB. I am going to try it on the second file and see where it takes me.

Ksherriff commented 3 years ago

Doing that on the second file turned it from 3.4 GB (compressed) into 1.1 GB which is very odd.

MaestSi commented 3 years ago

Yes, that is strange. The most important file is the one without wgs in the file name. In case you have issues with the other one, you can just go on with this one.

Ksherriff commented 3 years ago

The file with WGS in the name is the one that decompressed into a much smaller file so I assume something wrong happened there. Using the two files and the rest of the code you provided I combined the files and ran the script. Same result at the end. It did not create the taxonomy file. I can try and update gunzip to 1.6 and see if that is the issue.

Ksherriff commented 3 years ago

I tested out gunzip on some other compressed files and it works correctly so I don't think that is the issue. It looks like the issue might be that the files are not downloading properly or corrupted. Is there an alternative location to download them from?

MaestSi commented 3 years ago

I don't know, actually. You may consider downloading them with a Windows system and then copying them to the server where you want to perform the analyses. I tested it on 2 servers and it worked, so I think it may be an issue related to proxy settings (?).

Ksherriff commented 3 years ago

What are the file sizes I should be seeing for the unzipped file of each of these?

MaestSi commented 3 years ago

These are my files sizes: nucl_gb.accession2taxid: 10404278721 Bytes (9.7 GB) nucl_wgs.accession2taxid: 23378439865 Bytes (22 GB) Simone

Ksherriff commented 3 years ago

I took my computer home and attempted to download again. This time gunzip worked, and the file size was correct So it looks like the issue was with my works internet. I am going to rest of the code you linked earlier and then give the entire code a shot. I will let you know how it goes.

Ksherriff commented 3 years ago

Everything worked. Issue was just poor internet all along. Thanks again for all the help. Closing issue.

MaestSi / MetONTIIME

Taxonomy.qza file not created by import_database.sh #28