Issues with replacing NCBI bacterial and archeal taxonomies with GTDB taxonomies

MortenEneberg commented 2 years ago

Hi,

Thanks a lot for an excellent tool!

I want to create kraken2 and krakenuniq databases with the human genome, viral refseq complete genomes, fungi refseq genomes (also not complete!), and GTDB bacterial & archeal genomes. I followed the walk-through of your example - which is close to the same - and tried to implement what I needed to do on the way.

I already have all of the NCBI genomes as .fna files located in the folder called "genomes" below. They are concatenated into fasta files for each of the domains (e.g. bacteria.fna).

I merged the two NCBI accession2taxid files into one and used that to generate the database: zcat nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz | gzip > complete.accession2taxid.gz

However, it seems not to complete the action when running:

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/complete.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes

2021-11-17 12:41:29,380 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Creating a new FlexTaxD database databases/NCBI_GTDB_merge.db using source/ncbi/nodes.dmp, press any key to continue...
2021-11-17 12:41:32,371 custom_taxonomy_databases [INFO ]  Loading module: ReadTaxonomyNCBI
2021-11-17 12:41:57,251 DatabaseConnection [INFO ]  databases/NCBI_GTDB_merge.db opened successfully.
2021-11-17 12:41:57,254 custom_taxonomy_databases [INFO ]  Parse taxonomy
2021-11-17 12:41:57,255 ReadTaxonomyNCBI [INFO ]  Parse names file source/ncbi/names.dmp
2021-11-17 12:42:43,363 ReadTaxonomyNCBI [INFO ]  Parse nodes file source/ncbi/nodes.dmp
2021-11-17 12:43:51,524 ReadTaxonomyNCBI [INFO ]  Parsing ncbi accession2taxid, genome_path: genomes

Traceback (most recent call last):

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/bin/flextaxd", line 8, in <module>
    sys.exit(main())

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 279, in main
    read_obj.parse_genomeid2taxid(args.genomes_path,args.genomeid2taxid)

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 97, in parse_genomeid2taxid
    self.parse_genebank_file(filepath,filename)

  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ReadTaxonomyNCBI.py", line 81, in parse_genebank_file
    refseqid = f.readline().split(b" ")[0].lstrip(b">")

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 390, in readline
    return self._buffer.readline(size)

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():

  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/gzip.py", line 427, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)

gzip.BadGzipFile: Not a gzipped file (b'>N')

Hope you can help me

Kind regards, Morten

davve2 commented 2 years ago

Dear Morten,

Thank you, I´m glad to here you find FlexTaxD useful!

I have seen a similar error when I was accidentally using a file that did have .gz ending, but was not actually zipped. Doublecheck your source file completed.genomeid2taxid.gz. It could also be that this process was corrupted for some reason. Perhaps try the merge once more. One final check I would do is to run the program with either of the two files to exclude other sources of the error than the merge of the two files.

I will be away for the rest of this week, but if you cannot find a solution until next week, I will look further into the issue.

Kind regards, David

MortenEneberg commented 2 years ago

Hi David,

Thanks for getting back so quickly!

I did try running:

flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/nucl_gb.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes

But with the same error.. A NCBI_GTDB_merge.db file is created in both cases, but seems to have an error to it when continuing the walk-through

Should I format my genomes in any specific way? Is it wrong to concatenate them into a few files rather than the ~35000 files?

Kind regards, Morten

davve2 commented 2 years ago

Hi Morten,

I´ve been away on a meeting,

I implemented the function base on the usage of single files. I can´t remember exactly my implementation and if it could allow merged files and how. Right now I don´t have that time, but I will have a look and if it is possible to implement that function without too much change in the code.

Kind regads, David

MortenEneberg commented 2 years ago

Hi David,

Thanks for getting back. I will try to use the individual files to see if it solves the problem when I have the time.

Kind regards, Morten

MortenEneberg commented 2 years ago

Hi again,

Unfortunately it didnt fix the error - the exact same error message occured.

Kind regards, Morten

davve2 commented 2 years ago

Hi,

Reading the error message again I realize that the corruption seems to be happening earlier. Already during the parse of the accession.gz, not during parsing of the genomes. Did you validate the downloaded file accession file? How does the tail of the accession file look? Did you use wget or rsync to download the file? There is an md5 checksum file at NCBI that can be downloaded to verify that the downloaded file is not corrupt. Once I actually stumbled upon a problem with the original downloaded file from NCBI (that I downloaded several times using rsync). I could see that the file was corrupted through missmatching md5 sum files. A few days later the download was ok.

Kind regards, David

MortenEneberg commented 2 years ago

Dear David,

I tried downloading both accession2taxid files again and the md5 checksum files verify that neither is not corrupted. Unfortunately, I still get the same error message - both when using the complete.accession2taxid file and either of the nucl_gb and nucl_wgs.

I also tried doing the build with just a few fungal genomes (21 Refseq complete genomes) in the genomes_path - still with the same error

Kind regards, Morten

davve2 commented 2 years ago

Dear Morten,

Ok I will have to look into this a bit more.

Is it possible for you to send me the result files from flextaxd (include the logs). The files that you used for your last try (for the genomes the names are enough if the file gets to big), include the folder structure though if possible. Also if you could add versions of python and flextaxd.

Also try if you can replicate the error with the --test parameter. If so that would speed up the debug step significantly

I will try to reproduce the error locally.

MortenEneberg commented 2 years ago

Dear David,

Thanks for looking into it!

I sent you the data via WeTransfer on your emails attached to the program. Hope it works, otherwise let me know. The data should be available for a week (till December 1st)

In the log file folder you find several logs, where the one from 17-11 contains logs for both building NCBI database and GTDB databases.

In the databases folder you find outputs from all three databases. The NCBI_GTDB_merged one is built on the fungal genomes that I described above. The size of the NCBI_GTDB_merged file was always 169MB for all of the builds I tried.

Kind regards, Morten

flextaxd --version
custom_taxonomy_databases: version 0.4.2
Maintaner group: FOI bioinformatics group (bioinformatics@foi.se, david.sundell@foi.se)
Github: https://github.com/FOI-Bioinformatics/flextaxd

python --version
Python 3.8.6

davve2 commented 2 years ago

Dear Morten,

I found the issue, I could replicate the error with the files that you sent, thanks!

To start with it seems like the merged database has no annotated genomes. All genoms that you sent were fungi, so neither GTDB bac nor GTDB arr would have any annotations to those genomes. That means that the annotation has to come from the original NCBI database with genome2taxid annotations.

flextaxd --db databases/NCBI_GTDB_merge.db --stats
Tree statistics
                                        Nodes: 2380896
                                        Links: 2380896
                                        Genomes: 0

Going through your files, I discovered that, when I implemented the read genomeid2taxid function I made the assumption that all genomic files would be zipped on disk (this is the way we keep them, and usually the way they come on download). This is what causes the error since the genome files you sent were unzipped. The quick solution would therefore be to gzip all your genomes on disk ( use pigz or parallell zips to speed up the process).

If you think it is a nessesary, I could add a function that it detects weather the file is zipped or not and add this to the next relsease of flextaxd. Alternatively I will at minimum add an proper error message that explains that genomic files need to be zipped on disk.

This is the result after gzipping the files.

genomes folder read, 21 sequence files found
2021-11-25 10:30:38,162 ReadTaxonomyNCBI [INFO ]  Printing non added genome id´s (GCF) to ./FlexTaxD.not_added
2021-11-25 10:30:38,164 ReadTaxonomyNCBI [INFO ]  Genomes not matching any annotation 0
2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ]  Nodes in taxonomy tree 12923 number of taxonomies 12923
2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ]  --- process finished in 5 minutes 57.38753008842468 seconds---

2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ]  --- Time summary  5 minutes 57.3877694606781 seconds---

And the statistics

flextaxd --db test.db --stats
Tree statistics
                                        Nodes: 12923
                                        Links: 12923
                                        Genomes: 21

Kind regards, David

MortenEneberg commented 2 years ago

Dear David,

It seems to work now! Thanks for all your help

flextaxd --db databases/NCBI_GTDB_merge.db --stats
Tree statistics
                                        Nodes: 2380896
                                        Links: 2380896
                                        Genomes: 34670

Kind regards, Morten

FOI-Bioinformatics / flextaxd

Issues with replacing NCBI bacterial and archeal taxonomies with GTDB taxonomies #47