bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.07k stars 182 forks source link

Issue with loading taxonomy names in Diamond #719

Open joseailtoncruz opened 1 year ago

joseailtoncruz commented 1 year ago

I'm facing an issue while using Diamond (version X.X.X) to create a database using the "refseq_AA_13_4_23.fasta" file and the taxonomy mapping file "prot.accession2taxid.FULL". The database creation process seems to be progressing well until the step of loading taxonomy names.

However, during this step, I receive the error message "Loaded taxonomy names for 0 taxon ids". I have carefully checked the taxonomy files and confirmed that they are correct in terms of content and format. I'm also certain that the files are located in the correct directories.

I have followed the installation and configuration instructions provided in the official Diamond documentation and also ensured that I'm using the correct version of the taxonomy files compatible with the version of Diamond I'm using.

I would like to seek help from the Diamond community in resolving this issue. I have tried various suggested solutions, such as updating Diamond, checking the structure of the taxonomy files, and specifying the full path to the files, but the error persists.

I would greatly appreciate if someone could provide me with additional guidance or alternative suggestions to solve this problem. I'm open to any information or ideas that may help me resolve this issue.

Here are some additional details about the environment in which I'm running Diamond:

Certainly! Here's the translated version of the text for your GitHub post:

Title: Issue with loading taxonomy names in Diamond

Description:

I'm facing an issue while using Diamond (v2.1.6.160) to create a database using the "refseq_AA_13_4_23.fasta" file and the taxonomy mapping file "prot.accession2taxid.FULL". The database creation process seems to be progressing well until the step of loading taxonomy names.

However, during this step, I receive the error message "Loaded taxonomy names for 0 taxon ids". I have carefully checked the taxonomy files and confirmed that they are correct in terms of content and format. I'm also certain that the files are located in the correct directories.

I have followed the installation and configuration instructions provided in the official Diamond documentation and also ensured that I'm using the correct version of the taxonomy files compatible with the version of Diamond I'm using.

I would like to seek help from the Diamond community in resolving this issue. I have tried various suggested solutions, such as updating Diamond, checking the structure of the taxonomy files, and specifying the full path to the files, but the error persists.

I would greatly appreciate if someone could provide me with additional guidance or alternative suggestions to solve this problem. I'm open to any information or ideas that may help me resolve this issue.

Here are some additional details about the environment in which I'm running Diamond:

Operating system: WSL2 , Ubuntu Number of CPUs: 12 n Available memory: 32 Thank you in advance for any help or insights you can provide!

"/diamond$ ./diamond makedb --in refseq_AA_13_4_23.fasta --db testes --taxonmap prot.accession2taxid.FULL --taxonnames taxdmp/names.dmp --taxonnodes taxdmp/nodes.dmp diamond v2.1.6.160 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 12

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: refseq_AA_13_4_23.fasta Opening the database file... [0.044s] Loading sequences... [5.863s] Masking sequences... [11.804s] Writing sequences... [1.194s] Writing accessions... [2.074s] Hashing sequences... [0.372s] Loading sequences... [0s] Writing trailer... [0.046s] Loading taxonomy names... [0.652s] Loaded taxonomy names for 0 taxon ids. Loading taxonomy mapping file"

bbuchfink commented 1 year ago

Please try again with the latest version.

joseailtoncruz commented 1 year ago

Dear! Download the new version of Diamond, but get the following result (follow below). But even so the taxonomy of the isolates is not appearing (I followed the command line below)

diamond makedb: ($ ./diamond makedb --in refseq_AA_13_4_23.fasta --db testes --taxonmap prot.accession2taxid.FULL --taxonnames taxdmp/names.dmp --taxonnodes taxdmp/nodes.dmp diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 12

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: refseq_AA_13_4_23.fasta Opening the database file... [0.148s] Loading sequences... [7.994s] Masking sequences... [10.767s] Writing sequences... [0.653s] Writing accessions... [1.317s] Hashing sequences... [0.244s] Loading sequences... [0s] Writing trailer... [0.038s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 0 |xxx suffix 0 .xxx suffix 3004191 :PDB= suffix 0

Loading taxonomy names... [2.598s] Loaded taxonomy names for 2514661 taxon ids. Loading taxonomy mapping file... [3126.06s] Joining accession mapping... [2109.06s] Writing taxon id list... [0.253s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 0 |xxx suffix 0 .xxx suffix 5445834623 :PDB= suffix 13

Building taxonomy nodes... [0.011s] 3058564 taxonomy nodes processed. Number of nodes assigned to rank: no rank 781365 superkingdom 4 kingdom 13 subkingdom 1 superphylum 1 phylum 308 subphylum 31 superclass 6 class 505 subclass 168 infraclass 19 cohort 5 subcohort 3 superorder 57 order 1836 suborder 372 infraorder 135 parvorder 26 superfamily 896 family 10147 subfamily 3240 tribe 2349 subtribe 587 genus 107046 subgenus 1785 section 532 subsection 41 series 9 species group 358 species subgroup 134 species 2057516 subspecies 28458 varietas 9651 forma 671 strain 45825 biotype 17 clade 952 forma specialis 750 genotype 21 isolate 1322 morph 12 pathogroup 5 serogroup 148 serotype 1237 subvariety 0

Closing the input file... [0s] Closing the database file... [0.025s]

Database sequences 3004191 Database letters 825752078 Accessions in database 3004191 Entries in accession to taxid file 5446562075 Database accessions mapped to taxid 2989599 Database sequences mapped to taxid 2989598 Database hash 0b7dc7cd42cf68f73c62592d9d99007b Total time 5262s)

diamond blastx ($ ./diamond blastx -d testes.dmnd -q final.contigs.fa -o resul tado_teste.XML --outfmt 5 -e 0.0000000001 diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 12

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory:

Target sequences to report alignments for: 25

Opening the database... [1.373s] Database: testes.dmnd (type: Diamond database, sequences: 3004191, letters: 825752078) Block size = 2000000000 Opening the input file... [0.099s] Opening the output file... [0s] Loading query sequences... [0.001s] Masking queries... [0.018s] Building query seed set... [0.005s] Algorithm: Query-indexed Building query histograms... [0.002s] Seeking in database... [0s] Loading reference sequences... [18.04s] Initializing temporary storage... [0.001s] Building reference histograms... [1.826s] Allocating buffers... [0s] Processing query block 1, reference block 1/1, shape 1/2. Building reference seed array... [1.34s] Building query seed array... [0.002s] Computing hash join... [0.19s] Searching alignments... [0.246s] Deallocating memory... [0s] Processing query block 1, reference block 1/1, shape 2/2. Building reference seed array... [1.214s] Building query seed array... [0.002s] Computing hash join... [0.165s] Searching alignments... [0.002s] Deallocating memory... [0s] Deallocating buffers... [0.002s] Clearing query masking... [0s] Computing alignments... Loading trace points... [0.007s] Sorting trace points... [0s] Computing alignments... [1.002s] Deallocating buffers... [0s] Loading trace points... [0s] [1.011s] Deallocating reference... [0.003s] Loading reference sequences... [0.014s] Deallocating buffers... [0s] Deallocating queries... [0s] Loading query sequences... [0s] Closing the input file... [0s] Closing the output file... [0s] Closing the database... [0s] Cleaning up... [0s] Total time = 25.597s Reported 52 pairwise alignments, 52 HSPs. 4 queries aligned.)

bbuchfink commented 1 year ago

The XML output format does not support taxonomic information, you have to use the tabular format.

katievigil commented 1 year ago

Hi,

I don't quite understand how to build the database so my .tsv output has all the taxonomic names of the organisms I aligned with? Here is my output. I was just following what others put in their codes.

(/lustre/project/taw/share/conda-envs/diamond-env) [kvigil@cypress2 medaka]$ diamond --outfmt --taxonmap --taxonnames --taxonnodes blastx -d viralprotein -q consensus.fasta -o ONR030223oysterspooled_blastx.tsv --ultra-sensitive Error: Invalid command: outfmt. To print help message: diamond help (/lustre/project/taw/share/conda-envs/diamond-env) [kvigil@cypress2 medaka]$ diamond --outfmt 6 --taxonmap --taxonna mes --taxonnodes blastx -d viralprotein -q consensus.fasta -o ONR030223oysterspooled_blastx.tsv --ultra-sensitive Error: Invalid command: outfmt. To print help message: diamond help (/lustre/project/taw/share/conda-envs/diamond-env) [kvigil@cypress2 medaka]$ diamond makedb --in /lustre/project/taw/kvigil/Reference/refseq_viral_proteins/viral.1.protein.faa.gz -d viralprotein --taxonmap prot.accession2taxid.FULL --taxonnames taxdmp/names.dmp --taxonnodes taxdmp/nodes.dmp diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 40

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /lustre/project/taw/kvigil/Reference/refseq_viral_proteins/viral.1.protein.faa.gz Opening the database file... [0.355s] Loading sequences... [10.585s] Masking sequences... [0.655s] Writing sequences... [0.928s] Writing accessions... [0.796s] Hashing sequences... [0.062s] Loading sequences... [0s] Writing trailer... [0.015s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 0 |xxx suffix 0 .xxx suffix 592770 :PDB= suffix 0

Loading taxonomy names... Error opening file taxdmp/names.dmp: No such file or directory

katievigil commented 1 year ago

Hi I ended up creating a new database with the taxonomic files downloaded from NCBI, and I still get an error when I run diamond:

(/lustre/project/taw/share/conda-envs/diamond-env) [kvigil@cypress2 Reference]$ diamond makedb --in /lustre/project/taw/kvigil/Reference/refseq_viral_proteins/viral.1.protein.faa.gz -d viralprotein --taxonmap /lustre/project/taw/kvigil/Reference/prot.accession2taxid.FULL.gz --taxonnames /lustre/project/taw/kvigil/Reference/names.dmp --taxonnodes /lustre/project/taw/kvigil/Reference/nodes.dmp diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of TuebingenDocumentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 40

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Database input file: /lustre/project/taw/kvigil/Reference/refseq_viral_proteins/viral.1.protein.faa.gz Opening the database file... [0.356s] Loading sequences... [13.78s] Masking sequences... [0.838s] Writing sequences... [1.351s] Writing accessions... [5.399s] Hashing sequences... [0.059s] Loading sequences... [0s] Writing trailer... [0.015s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 0 |xxx suffix 0 .xxx suffix 592770 :PDB= suffix 0

Loading taxonomy names... [13.939s] Loaded taxonomy names for 2534199 taxon ids. Loading taxonomy mapping file... [2889.8s] Joining accession mapping... [1960.23s] Writing taxon id list... [0.121s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable): UniRef prefix 0 gi|xxx| prefix 0 xxx| prefix 0 |xxx suffix 0 .xxx suffix 6018257598 :PDB= suffix 13

Building taxonomy nodes... [0.025s] 3081562 taxonomy nodes processed. Number of nodes assigned to rank: no rank 786050 superkingdom 4 kingdom 13 subkingdom 1 superphylum 1 phylum 309 subphylum 31 superclass 6 class 515 subclass 169 infraclass 19 cohort 5 subcohort 3 superorder 57 order 1870 suborder 373 infraorder 135 parvorder 26 superfamily 900 family 10247 subfamily 3247 tribe 2358 subtribe 586 genus 107794 subgenus 1808 section 533 subsection 41 series 9 species group 360 species subgroup 134 species 2074184 subspecies 28800 varietas 9768 forma 682 strain 46038 biotype 17 clade 957 forma specialis 751 genotype 22 isolate 1333 morph 12 pathogroup 5 serogroup 152 serotype 1237 subvariety 0

Closing the input file... [0.01s] Closing the database file... [0.019s]

Database sequences 592770 Database letters 142010059 Accessions in database 592770 Entries in accession to taxid file 6019041334 Database accessions mapped to taxid 581744 Database sequences mapped to taxid 581744 Database hash 1e97674dc58cabad3c9a59305f745bf1 Total time 4888s

(/lustre/project/taw/share/conda-envs/diamond-env) [kvigil@cypress2 medaka]$ diamond blastx -d viralprotein.dmnd -q consensus.fasta -o ONR030223oysterspooled_blastx.tsv --ultra-sensitive diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

CPU threads: 40

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory:

Target sequences to report alignments for: 25

Opening the database... Error: Incomplete database file. Database building did not complete successfully.

katievigil commented 1 year ago

I am running diamond version 2.1.8 which I downloaded a few days ago.

katievigil commented 1 year ago

Hi I ended up getting this to work, I had the wrong path to my makedb with all the taxonomic names! So excited yey!

katievigil commented 1 year ago

Hi! Is it possible to get diamond to export the taxonomic families? I ended up meganizing my .daa files and loading them into MEGAN, but my issue with megan is that the data will not export into a nice .tsv file similar to the Diamond output, but I need the taxonomic families. Thanks!

katievigil commented 1 year ago

Hi All,

I ended up creating a Python script that basically will scan your diamond output accession numbers using the Bio.Entrez package. https://biopython.org/docs/1.75/api/Bio.Entrez.html

All you have to do is copy and paste your accession numbers into a .csv file and run this script. Just change the path to your .csv accession number file.

from Bio import Entrez

import sys import time

Replace with your own email address

Entrez.email = "kvigil@tulane.edu"

Function to retrieve taxonomy information for an accession number

def retrieve_taxonomy(accession_number): try: handle = Entrez.efetch(db="protein", id=accession_number, retmode="xml") record = Entrez.read(handle) taxonomy = record[0]["GBSeq_taxonomy"] return taxonomy except Exception as e: print(f"Error retrieving taxonomy for {accession_number}: {str(e)}") return None

if len(sys.argv) != 2: print("Usage: python blastxpython.py input_file") sys.exit(1)

input_file = sys.argv[1]

Read accession numbers from a file (one per line)

with open("/path/to/accession.csv", "r", encoding="ISO-8859-1") as file: accession_numbers = file.read().splitlines()

Retrieve and print the taxonomy information for each accession number

for accession_number in accession_numbers: taxonomy = retrieve_taxonomy(accession_number) if taxonomy: print(f"Accession: {accession_number}, Taxonomy: {taxonomy}")

Introduce a delay of 1 second between requests

time.sleep(1)
IdoBar commented 1 year ago

Hi @lucyintheskyzzz

Hi! Is it possible to get diamond to export the taxonomic families? I ended up meganizing my .daa files and loading them into MEGAN, but my issue with megan is that the data will not export into a nice .tsv file similar to the Diamond output, but I need the taxonomic families. Thanks!

How did you meganized your daa file? Using the GUI or the command line? I've been trying to do it and it keeps failing...

katievigil commented 1 year ago

Hi @IdoBar This is what I did see below. I used Ubuntu command line:

module load anaconda3/2020.07 conda update diamond Source activate diamond-env #created a virtual environment to run diamond

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip #Download NCBI taxonomic names zip file, you have to unzip this for the viral database to work https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz #download NCBI accession #s with taxonomic IDs from NCBI

make viral database to run in Diamond with taxonomic names from NCBI, This took 4888s using 40 threads on HPC to create

diamond makedb --in /lustre/project/taw/kvigil/Reference/refseq_viral_proteins/viral.1.protein.faa.gz -d viralprotein --taxonmap /lustre/project/taw/kvigil/Reference/prot.accession2taxid.FULL.gz --taxonnames /lustre/project/taw/kvigil/Reference/names.dmp --taxonnodes /lustre/project/taw/kvigil/Reference/nodes.dmp

execute diamond with viral database that has all the taxonomic names downloaded

diamond blastx -d /lustre/project/taw/kvigil/Reference/viralprotein.dmnd -q consensus.fasta -o ONR030223oysterspooledblastxdenovo.tsv --ultra-sensitive -f 6 qseqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle

katievigil commented 1 year ago

@IdoBar I updated my python script to actually save all the viral taxonomy from NCBI in a .csv file. Note that each accession number takes at least 1 second to populate the taxonomy, so if you have 80000 rows like I do this will take up to 24 hours to complete, but its worth, because desperate measures LOL

I created a separate .csv file of only accession numbers from my Diamond output then ran my python script see code below:

Also, I have a meeting with the MEGAN people to see how to export the families from MEGAN, because I can't figure this out either.

import csv from Bio import Entrez import sys import time

Replace with your own email address

Entrez.email = "kvigil@tulane.edu"

Function to retrieve taxonomy information for an accession number

def retrieve_taxonomy(accession_number): try: handle = Entrez.efetch(db="protein", id=accession_number, retmode="xml") record = Entrez.read(handle) taxonomy = record[0]["GBSeq_taxonomy"] return taxonomy except Exception as e: print(f"Error retrieving taxonomy for {accession_number}: {str(e)}") return None

def main(input_file, output_file):

Read accession numbers from the input file (one per line)

with open(input_file, "r", encoding="ISO-8859-1") as file:
    accession_numbers = file.read().splitlines()

# Create a CSV file and write data to it
with open(output_file, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)

    # Write the header row
    csv_writer.writerow(["Accession", "Taxonomy"])

    # Retrieve and write the taxonomy information for each accession number
    for accession_number in accession_numbers:
        taxonomy = retrieve_taxonomy(accession_number)
        if taxonomy:
            csv_writer.writerow([accession_number, taxonomy])

print("Taxonomy information has been written to", output_file)

if name == "main": if len(sys.argv) != 3: print("Usage: python oysterdenovotest.py input_file output_file") sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2]

main(input_file, output_file)

Introduce a delay of 1 second between requests

time.sleep(3)
katievigil commented 1 year ago

@IdoBar Diamond does not provide the families, it stops at Phylum. I used GUI to meganize my .daa files.

katievigil commented 1 month ago

@IdoBar where you able to get families with the latest Diamond version?