Non-unique contig names for GTDB

andrewjmc commented 4 years ago

I encountered errors like [contig_179238] skipped - duplicated sequence identifier) during ganon database building.

Many GTDB reference sequences (https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/) contain contig names like this, and clearly the numbers sometimes clash.

Given that this issue will likely persist in GTDB (due to use of MAGs, I assume) is it worth detecting them in kraken/ganon database building and prepending GCA/GCF names when producing .fna.gz files for kraken/ganon build steps?

davve2 commented 4 years ago

Did you use the flextaxd-create --dbprogram ganon? function?

I have a feature adding the sequence length to the header, but that won´t garantee they become unique (although unlikely) I think your idea with GCF/GCA in the header is even better though, so this will be implemented in the next update!

I´m just wondering since I built a ganon database the other day with no problem, but perhaps I did not have a big enough database (I used a smaller one for speed)!

andrewjmc commented 4 years ago

I did use the --dbprogram ganon function (ganon-build running now). Haven't seen sequence lengths appended in the ganon file (I do see the lengths in the seqid2taxid.map file).

SilasK commented 4 years ago

@andrewjmc Would you share the code to create ganon database with flextaxd?

andrewjmc commented 4 years ago

I followed the walk-through here to an extent:

https://github.com/FOI-Bioinformatics/flextaxd/wiki/Walkthrough---merge-NCBI-with-GTDB

I did it a bit differently as I described here: https://github.com/FOI-Bioinformatics/flextaxd/issues/17

With the additional information gained, I would make a few more changes and I think it would look like this (not tested and could contain silly errors):

mkdir -p seq/ncbi
cd seq/ncbi
#Get NCBI genomes (human, viruses and fungi)
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq viruses -H
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq fungi -H
ncbi-genome-download -p 20 -r 50 -l complete,chromosome -F fasta -s refseq -t 9606 vertebrate_mammalian -H
cd ../
mkdir gtdb
cd gtdb
#Get GTDB representative genomes
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/gtdb_genomes_reps.tar.gz
tar -xvf gtdb_genomes_reps.tar.gz
cd ../../

#Get GTDB taxonomies
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv

#Get taxdump files
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
unzip taxdmp.zip

#Create separate GTDB databases -- NB points to **all** genomes, not just representative ones
flextaxd -db databases/bac120_gtdb.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --log NCBI_GTDB_merge_log
flextaxd -db databases/ar122_gtdb.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --log NCBI_GTDB_merge_log

#Create NCBI database
flextaxd -db databases/NCBI_taxonomy.db -tf nodes.dmp -tt NCBI --genomeid2taxid nucl_gb.accession2taxid.gz --verbose --log NCBI_GTDB_merge_log --genomes_path seq/
#Remove those elements which don't have downloaded genomes (but preserve top level domains with -tt)
flextaxd -db databases/NCBI_taxonomy.db --clean_database --verbose --log NCBI_GTDB_merge_log -tt

#Create new NCBI database for merging
cp databases/NCBI_taxonomy.db databases/NCBI_GTDB_merge.db

#Merge in archaea and bacteria in turn (unsure if replace matters here because hierarchy will already be empty for archaea and bacteria
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log
flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log

#Create taxonomies for ganon
flextaxd -db databases/NCBI_GTDB_FT_merge.db -o taxonomy --dbprogram ganon --dump

#Create ganon databases
flextaxd-create -db databases/NCBI_GTDB_FT_merge.db -o taxonomy --genomes_path seq -p 24 --verbose --log build_ganon_logs --create --db_name NCBI_GTDB_FT_merge_ganon --dbprogram ganon

Hope this helps!

davve2 commented 4 years ago

I´ve been working on a fix for this issue and will push a fix during tomorrow ( I still have a few tests I want to run)

FOI-Bioinformatics / flextaxd

Non-unique contig names for GTDB #22