FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )
GNU General Public License v3.0
65 stars 8 forks source link

Flextaxd - test database inspect file has no genome names #49

Closed MortenEneberg closed 2 years ago

MortenEneberg commented 2 years ago

Dear Flextaxd team, Thanks for a great tool! I am running into issues again building the merged NCBI and GTDB databases and hope that you can help me.

I followed the 'Walkthrough merge NCBI and GTDB' for most but did use krakenuniq to download NCBI genomes instead. What I wish to create is a database with NCBI Refseq Human genome, NCBI Refseq complete viruses, NCBI Refseq (All) fungi and GTDB bacteria. To do this I downloaded GTDB genomes from the GTDB repository and made a script to extract only the bacterial genomes (archaea and bacterial genomes are present in a confusing subfolder system together). All NCBI genomes were downloaded with krakenuniq-download and comes as .fna that I have gzipped.

krakenuniq-download --db $WD/krakenuniq_31db208/ --threads 30 refseq/fungi/Any refseq/viral refseq/vertebrate_mammalian/Chromosome/species_taxid=9606 

Below I have attached the inspect file of the test database, the merged NCBI_GTDB database file and the log for the build of the test database. inspect.txt NCBI_GTDB_merge.zip build_kraken_logsFlexTaxD-create-Dec-02-2021.log

I dont understand why the inspect file does not have genome names associated (as is the case for your example).

Kind regards, Morten

davidsundell commented 2 years ago

Dear Morten.

This happens typically if the nodes and names files are incorrectly formatted for kraken. I discovered this bug and figured out that kraken removes at least one trailing column at the end of the dmp files. Default flextaxd uses the most simple format without unnessesary data and will only export two columns. Kraken would then remove the name column. To account for this i added a --dbprogram parameter to the `flextaxd --dump´ where it is possible to add kraken2. This functions simply adds two empty columns to the end of the nodes and names file so that kraken2 strips emtpy columns and not the name column.

Make sure that you do

flextaxd -db <yourdb> --dbprogram kraken2 --dump (-o taxonomy)

before running kraken. I hope this is the solution, otherwise please get back to me and I will investigate the issue further.

Kind regards, David

MortenEneberg commented 2 years ago

Dear David,

It worked, thanks a lot for the excellent support that you provide for this tool!

And sorry for the inconvenience of bothering you with such a simple issue..

Have a wonderful day

Morten

davidsundell commented 2 years ago

Dear Morten,

No problem, the tool is supposed to be easy to use. So I think it would be an excellent addition to discuss this potential issue in the text on the wiki, or potentially add a warning in the terminal output if the inspect file has only digits in the name column. No harm done and have a great day!

David

MortenEneberg commented 2 years ago

Dear David,

Sorry for reopening the issue, but I still encounter issues building the kraken2 database. Now I do manage to build a database, but only the GTDB bacteria are added (according to the inspect file 100% of sequences have Bacteria as parent). In the nodes.dmp and names.dmp fungi, virus and human are present too which is cool, but in the seqid2taxid.map they are missing.

I built the database with

flextaxd-create -db databases/NCBI_GTDB_merge.db -o taxonomy_kraken2 --genomes_path "/shared-nfs/MEN/silico_reads/gtdb_202/krakenuniq_gtdb_202_no_dust/library/" -p 30 --verbose --logs build_kraken_logs --create_db --db_name kraken2_gtdb --dbprogram kraken2

In the --genomes_path, files have the following structure. The bacterial genomes are from GTDB and have a different name and folder structure than the ones downloaded from NCBI (downloaded with krakenuniq). I wonder if that makes the difference? I "opened" some of the folders in the sketch below to examplify how it looks

library ----bacteria ---GCF_903989465.1_genomic.fna.gz ---GCF_903283865.1_genomic.fna.gz ---GCA_903989422.1_genomic.fna.gz
----virus
---scaffold
---complete
---contig
---chromosome
----fungi
---scaffold
---Suillus_fuscotomentosus_strain_FC203-tax1912939-GCF_016647785.1_Suifus1_genomic.fna.gz
---Penicillium_arizonense_strain_CBS_141311-tax1835702-GCF_001773325.1_ASM177332v1_genomic.fna.gz
---complete
---contig
---chromosome
----vertebrate_mammalian
---Chromosome

No errors were reported during the build. Attached you find these files:

inspect.txt names.txt seqid2taxid.zip nodes.txt

I hope that you can help me with the issue!

Kind regards, Morten

davidsundell commented 2 years ago

Dear Morten,

The structure seems ok, it does a full walk of your library and pics up anything with fasta,fna or fa so that should not be the problem.

I think I would need the database file (and the log) to ensure all annotations are correct in the database. Is the two genomes annotated with their full name "Suillus_fuscotomentosus_strain_FC203-tax1912939-GCF_016647785.1_Suifus1_genomic.fna" for example, otherwise they won´t be recognized if they are only annotated with their GCF number.

Try to build a new database and stepwise annotate some of the subfolders that are not working (fungi for example) when building the database use --test parameter (only uses 10 genomes) and check if that seems to be working.

Kind regards, David

MortenEneberg commented 2 years ago

Dear David,

I attached the database here NCBI_GTDB_merge3.zip

I opened the database in a DB browser, and it seems that all the GTDB genomes are annotated with the GCF numbers while the NCBI genomes have their "real name" (e.g. Homo sapiens).

I did a new kraken2 build from the database attached above, and the names, nodes and seq2taxid files are zipped below: database.zip

And the inspect file inspect.txt

And the log file build_kraken_logsFlexTaxD-create-Dec-09-2021.log. In the log file it says that 57311 genomes were found, but only 45555 were successfully added. 57311 corresponds to all the genomes that I wish to add and 45555 corresponds to all GTDB bacterial genomes that were added.

What do you mean with the stepwise annotation?

Kind regards, Morten

davidsundell commented 2 years ago

I´m not sure how you have downloaded the genomes, but typically NCBI genomes comes in a folder refseq/genbank with only GCF names, and a "human readable" folder with links including structure down to species and the full name of the genomes.

The name in the database and the name on disk have to match to use flextaxd (it can be without .fasta in the database). One option would be to update the names with a custom script. Annotations can be added using the genomeid2taxid function.

MortenEneberg commented 2 years ago

Dear David,

Merry christmas!

Now I tried downloading with ncbi-genome-download to follow your guide completely, however I dont get any "human readable" folder using this method of downloading?

Furthermore, when replacing the bacterial taxonomy with NCBI, I get the following error message:

Traceback (most recent call last):
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/bin/flextaxd", line 8, in <module>
    sys.exit(main())
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/custom_taxonomy_databases.py", line 295, in main
    modify_obj.update_database()
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/ModifyTree.py", line 477, in update_database
    self.taxonomydb.validate_tree()
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 247, in validate_tree
    res = self.check_parent()
  File "/space/sharedbin_ubuntu_14_04/software/flextaxd/0.4.2-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 338, in check_parent
    QUERY = "SELECT parent,rank_i FROM tree WHERE child in ({children})".format(children=",".join(map(str,list(*child_w_dp))))
TypeError: list expected at most 1 argument, got 32

I attached my log file: FlexTaxD-Dec-29-2021.log

This is the code I ran so far:

NCBI_GENOME_DOWNLOAD="/shared-nfs/MEN/silico_reads/gtdb_202/NCBI_GENOME_DOWNLOAD/"

cd $NCBI_GENOME_DOWNLOAD
module load ncbi-genome-download/0.3.1-foss-2020b

ncbi-genome-download -p 40 -r 50 -l complete -F fasta -s refseq bacteria --genera "Yersinia"
ncbi-genome-download -p 40 -r 50 -l all -F fasta -s refseq fungi
ncbi-genome-download -p 40 -r 50 -l complete -F fasta -s refseq viral
ncbi-genome-download --taxids 9606 --assembly-level chromosome vertebrate_mammalian

module purge

cd "/shared-nfs/MEN/silico_reads/GTDB_NCBI_MIX/"
module load flextaxd/0.4.2-foss-2020b-Python-3.8.6

mkdir databases
flextaxd -db databases/NCBI_GTDB_merge.db -tf source/ncbi/nodes.dmp -tt NCBI --genomeid2taxid source/ncbi/complete.accession2taxid.gz --verbose --logs NCBI_GTDB_merge_log --genomes_path $NCBI_GENOME_DOWNLOAD

flextaxd -db databases/bac120_gtdb.db -tf source/gtdb/bac120_taxonomy_r202.tsv -tt QIIME --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log --taxonomy_type NCBI

flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log

Kind regards, Morten

MortenEneberg commented 2 years ago

Dear David,

Did you have a chance to look at it?

Kind regards, Morten

davve2 commented 2 years ago

Dear Morten,

I´m on holiday, but I will look again at this issue next week!

Kind regards David

MortenEneberg commented 2 years ago

Dear David,

Sounds great, sorry for disturbing your holidays - it wasnt meant to be rude

Kind regards, Morten

davve2 commented 2 years ago

Dear Morten,

I unfortunately had covid (and high priority work) the after coming back from holidays so I could not attend this issue, I will look into it now and hope to find a solution during this week.

Best regards, David

MortenEneberg commented 2 years ago

Dear David,

Sorry to hear, hope you will be well soon!

I just managed to build the Kraken2 hybrid database today :-) Somehow, it seems important to download everything but the GTDB genomes through ncbi-genome-download to achieve the correct format.

Now, I am strugling with KrakenUniq, but that is a memory issue - the database requires 1.7Tb RAM to build, and I dont think you can help out with that.

Thanks for amazing support

Kind regards, Morten