Closed andrewjmc closed 4 years ago
Did you use the flextaxd-create --dbprogram ganon? function?
I have a feature adding the sequence length to the header, but that won´t garantee they become unique (although unlikely) I think your idea with GCF/GCA in the header is even better though, so this will be implemented in the next update!
I´m just wondering since I built a ganon database the other day with no problem, but perhaps I did not have a big enough database (I used a smaller one for speed)!
I did use the --dbprogram ganon
function (ganon-build running now). Haven't seen sequence lengths appended in the ganon file (I do see the lengths in the seqid2taxid.map
file).
@andrewjmc Would you share the code to create ganon database with flextaxd?
I followed the walk-through here to an extent:
https://github.com/FOI-Bioinformatics/flextaxd/wiki/Walkthrough---merge-NCBI-with-GTDB
I did it a bit differently as I described here: https://github.com/FOI-Bioinformatics/flextaxd/issues/17
With the additional information gained, I would make a few more changes and I think it would look like this (not tested and could contain silly errors):
mkdir -p seq/ncbi
cd seq/ncbi
#Get NCBI genomes (human, viruses and fungi)
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq viruses -H
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq fungi -H
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq -t 9606 vertebrate_mammalian -H
cd ../
mkdir gtdb
cd gtdb
#Get GTDB representative genomes
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/gtdb_genomes_reps.tar.gz
tar -xvf gtdb_genomes_reps.tar.gz
cd ../../
#Get GTDB taxonomies
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv
#Get taxdump files
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
unzip taxdmp.zip
#Create separate GTDB databases -- NB points to **all** genomes, not just representative ones
flextaxd -db databases/bac120_gtdb.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --log NCBI_GTDB_merge_log
flextaxd -db databases/ar122_gtdb.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --log NCBI_GTDB_merge_log
#Create NCBI database
flextaxd -db databases/NCBI_taxonomy.db -tf nodes.dmp -tt NCBI --genomeid2taxid nucl_gb.accession2taxid.gz --verbose --log NCBI_GTDB_merge_log --genomes_path seq/
#Remove those elements which don't have downloaded genomes (but preserve top level domains with -tt)
flextaxd -db databases/NCBI_taxonomy.db --clean_database --verbose --log NCBI_GTDB_merge_log -tt
#Create new NCBI database for merging
cp databases/NCBI_taxonomy.db databases/NCBI_GTDB_merge.db
#Merge in archaea and bacteria in turn (unsure if replace matters here because hierarchy will already be empty for archaea and bacteria
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log
#Create taxonomies for ganon
flextaxd -db databases/NCBI_GTDB_FT_merge.db -o taxonomy --dbprogram ganon --dump
#Create ganon databases
flextaxd-create -db databases/NCBI_GTDB_FT_merge.db -o taxonomy --genomes_path seq -p 24 --verbose --log build_ganon_logs --create --db_name NCBI_GTDB_FT_merge_ganon --dbprogram ganon
Hope this helps!
I´ve been working on a fix for this issue and will push a fix during tomorrow ( I still have a few tests I want to run)
I encountered errors like
[contig_179238] skipped - duplicated sequence identifier)
during ganon database building.Many GTDB reference sequences (https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/) contain contig names like this, and clearly the numbers sometimes clash.
Given that this issue will likely persist in GTDB (due to use of MAGs, I assume) is it worth detecting them in kraken/ganon database building and prepending GCA/GCF names when producing
.fna.gz
files for kraken/ganon build steps?