Closed MortenEneberg closed 2 years ago
Dear Morten,
Thank you, I´m glad to here you find FlexTaxD useful!
I have seen a similar error when I was accidentally using a file that did have .gz ending, but was not actually zipped. Doublecheck your source file completed.genomeid2taxid.gz. It could also be that this process was corrupted for some reason. Perhaps try the merge once more. One final check I would do is to run the program with either of the two files to exclude other sources of the error than the merge of the two files.
I will be away for the rest of this week, but if you cannot find a solution until next week, I will look further into the issue.
Kind regards, David
Hi David,
Thanks for getting back so quickly!
I did try running:
flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/nucl_gb.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path genomes
But with the same error.. A NCBI_GTDB_merge.db file is created in both cases, but seems to have an error to it when continuing the walk-through
Should I format my genomes in any specific way? Is it wrong to concatenate them into a few files rather than the ~35000 files?
Kind regards, Morten
Hi Morten,
I´ve been away on a meeting,
I implemented the function base on the usage of single files. I can´t remember exactly my implementation and if it could allow merged files and how. Right now I don´t have that time, but I will have a look and if it is possible to implement that function without too much change in the code.
Kind regads, David
Hi David,
Thanks for getting back. I will try to use the individual files to see if it solves the problem when I have the time.
Kind regards, Morten
Hi again,
Unfortunately it didnt fix the error - the exact same error message occured.
Kind regards, Morten
Hi,
Reading the error message again I realize that the corruption seems to be happening earlier. Already during the parse of the accession.gz, not during parsing of the genomes. Did you validate the downloaded file accession file? How does the tail of the accession file look? Did you use wget or rsync to download the file? There is an md5 checksum file at NCBI that can be downloaded to verify that the downloaded file is not corrupt. Once I actually stumbled upon a problem with the original downloaded file from NCBI (that I downloaded several times using rsync). I could see that the file was corrupted through missmatching md5 sum files. A few days later the download was ok.
Kind regards, David
Dear David,
I tried downloading both accession2taxid files again and the md5 checksum files verify that neither is not corrupted. Unfortunately, I still get the same error message - both when using the complete.accession2taxid file and either of the nucl_gb and nucl_wgs.
I also tried doing the build with just a few fungal genomes (21 Refseq complete genomes) in the genomes_path - still with the same error
Kind regards, Morten
Dear Morten,
Ok I will have to look into this a bit more.
Is it possible for you to send me the result files from flextaxd (include the logs). The files that you used for your last try (for the genomes the names are enough if the file gets to big), include the folder structure though if possible. Also if you could add versions of python and flextaxd.
Also try if you can replicate the error with the --test parameter. If so that would speed up the debug step significantly
I will try to reproduce the error locally.
Dear David,
Thanks for looking into it!
I sent you the data via WeTransfer on your emails attached to the program. Hope it works, otherwise let me know. The data should be available for a week (till December 1st)
In the log file folder you find several logs, where the one from 17-11 contains logs for both building NCBI database and GTDB databases.
In the databases folder you find outputs from all three databases. The NCBI_GTDB_merged one is built on the fungal genomes that I described above. The size of the NCBI_GTDB_merged file was always 169MB for all of the builds I tried.
Kind regards, Morten
flextaxd --version
custom_taxonomy_databases: version 0.4.2
Maintaner group: FOI bioinformatics group (bioinformatics@foi.se, david.sundell@foi.se)
Github: https://github.com/FOI-Bioinformatics/flextaxd
python --version
Python 3.8.6
Dear Morten,
I found the issue, I could replicate the error with the files that you sent, thanks!
To start with it seems like the merged database has no annotated genomes. All genoms that you sent were fungi, so neither GTDB bac nor GTDB arr would have any annotations to those genomes. That means that the annotation has to come from the original NCBI database with genome2taxid annotations.
flextaxd --db databases/NCBI_GTDB_merge.db --stats
Tree statistics
Nodes: 2380896
Links: 2380896
Genomes: 0
Going through your files, I discovered that, when I implemented the read genomeid2taxid function I made the assumption that all genomic files would be zipped on disk (this is the way we keep them, and usually the way they come on download). This is what causes the error since the genome files you sent were unzipped. The quick solution would therefore be to gzip all your genomes on disk ( use pigz or parallell zips to speed up the process).
If you think it is a nessesary, I could add a function that it detects weather the file is zipped or not and add this to the next relsease of flextaxd. Alternatively I will at minimum add an proper error message that explains that genomic files need to be zipped on disk.
This is the result after gzipping the files.
genomes folder read, 21 sequence files found
2021-11-25 10:30:38,162 ReadTaxonomyNCBI [INFO ] Printing non added genome id´s (GCF) to ./FlexTaxD.not_added
2021-11-25 10:30:38,164 ReadTaxonomyNCBI [INFO ] Genomes not matching any annotation 0
2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ] Nodes in taxonomy tree 12923 number of taxonomies 12923
2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ] --- process finished in 5 minutes 57.38753008842468 seconds---
2021-11-25 10:30:38,165 custom_taxonomy_databases [INFO ] --- Time summary 5 minutes 57.3877694606781 seconds---
And the statistics
flextaxd --db test.db --stats
Tree statistics
Nodes: 12923
Links: 12923
Genomes: 21
Kind regards, David
Dear David,
It seems to work now! Thanks for all your help
flextaxd --db databases/NCBI_GTDB_merge.db --stats
Tree statistics
Nodes: 2380896
Links: 2380896
Genomes: 34670
Kind regards, Morten
Hi,
Thanks a lot for an excellent tool!
I want to create kraken2 and krakenuniq databases with the human genome, viral refseq complete genomes, fungi refseq genomes (also not complete!), and GTDB bacterial & archeal genomes. I followed the walk-through of your example - which is close to the same - and tried to implement what I needed to do on the way.
I already have all of the NCBI genomes as .fna files located in the folder called "genomes" below. They are concatenated into fasta files for each of the domains (e.g. bacteria.fna).
I merged the two NCBI accession2taxid files into one and used that to generate the database:
zcat nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz | gzip > complete.accession2taxid.gz
However, it seems not to complete the action when running:
Hope you can help me
Kind regards, Morten