Adding a custom seqid2taxid file to the NCBI accession2taxid causes error

Dear flextaxd team,

Thanks for an awesome tool! I use it to combine NCBI with GTDB taxonomy for kraken2 pathogen classification. I have recently been trying to implement the EuPathDB (http://ccb.jhu.edu/data/eupathDB/) to get clean eukaryotic genomes in the database. I tried just adding the genomes to my NCBI genomes path hoping that all would be annotated in the database. However, it seems that there has been some changes to the fasta headers causing problems when trying this since they are not all recognized and printed to the .flextaxdNotAdded file. To circumvent this I modified the seqid2taxid file (downloaded with the EuPathDB) to look like the accession2taxid files from NCBI (modified file attached: reduced_seqid2taxid_duplicate_no_univec.txt.gz) and concatenated this file to the other NCBI accession2taxid files, but somehow flextaxd recognizes that this is not the original one and throws an error message:

Traceback (most recent call last):
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/bin/flextaxd", line 8, in <module>
    sys.exit(main())
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 279, in main
    read_obj.parse_genomeid2taxid(args.genomes_path,args.genomeid2taxid)
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 100, in parse_genomeid2taxid
    raise TypeError("The supplied annotation file does not seem to be the ncbi nucl_gb.accession2taxid.gz")
TypeError: The supplied annotation file does not seem to be the ncbi nucl_gb.accession2taxid.gz

Code to build the NCBI database:

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/complete.accession2taxid_w_eupath_univec.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes/refseq/

Do you have suggestions on how to solve this?

Kind regards, Morten

Dear Morten,

Thanks for your kind words, we are really happy that you find FlexTaxD useful, in particular with helping out implementing further options for database resources.

To your question. I do think that the function requires the file to be named "accession2taxid.gz" if not annotation_file.endswith("accession2taxid.gz"): However it is just the correct ending that is requested, so I would suggest to rename your file complete_w_eupath_univec.accession2taxid.gz and it should work fine. The reason I use this code is because the read function is naive and would crash or make logic errors unless the correct file is supplied.

The read function of the accession2taxid file expect annotations to be matching the header of each sequence inside the fasta files. If this is not the case for eupathDB data and you do have annotations of "filename" to taxid instead, it is possible to use the regular --genomeid2taxid function (without --tt NCBI). The input file must then contain filename\ttaxid\n see specification in genome2taxid format

Kind regards, David

FOI-Bioinformatics / flextaxd

Adding a custom seqid2taxid file to the NCBI accession2taxid causes error #55