Open nick-youngblut opened 4 years ago
Hi Nick, I'm aware of this issue and I'll try to support a dynamic length in future versions. For now, you can also change the max length by editing src/data/taxonomy.h:45
and setting enum { max_accesion_len = 14 };
to a higher value.
Thanks for letting me know how to get around the issue! I've already converted all of the accessions (eg., GB_GCA_002778965.1
to GCA002778965.1
), with works for diamond makedb
v0.9.30.131.
diamond makedb
is stating that my taxonomy includes a lot of "no rank" nodes:
[...]
Accession mappings = 24706
Loading taxonomy nodes... [0.608s]
Loading taxonomy names... [0.517s]
Loaded taxonomy names for 182187 taxon ids.
Writing taxon id lists... [17.373s]
82930832 sequences mapped to taxonomy, 82930832 total mappings.
Building taxonomy nodes... [0.001s]
180131 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank 140636
superkingdom 2
kingdom 0
subkingdom 0
superphylum 0
phylum 151
subphylum 0
superclass 0
class 152
subclass 0
infraclass 0
cohort 0
subcohort 0
superorder 0
order 158
suborder 0
infraorder 0
parvorder 0
superfamily 0
family 159
subfamily 0
tribe 0
subtribe 0
genus 169
subgenus 0
section 0
subsection 0
series 0
species group 0
species subgroup 0
species 170
subspecies 38534
varietas 0
forma 0
However, most of the nodes in my nodes.dmp files are "subspecies" (n=145904). Any ideas on how to troubleshoot this? Does diamond expect anything special in the names.dmp file? My custom names.dmp file is rather minimal.
If you make the files available to me, I can look into it.
I've copied the files to /tmp/global2/nyoungblut/gtdb_diamond_db/
. I believe that you have read access. The nodes/names dump files were created with my simple code in gtdb_to_taxdump. The screenlog.0
file shows the log for my diamond makedb
job. Thanks!!
This happens because the nodes.dmp is implicitly assumed to be sorted on the taxid. I will remove this restriction in a future release, but for now you can simply sort your file.
That worked! Thanks for looking into the problem! I've changed my gtdb_to_taxdump code to order the nodes.dmp (and names.dmp) by taxID, so no worries about changing diamond
I'd like to use
diamond makedb
on a custom taxonomy created from the GTDB (via gtdb_to_taxdump). With diamond v0.9.30.131, I'm getting the error:Error: Accession exceeds supported length
. The GTDB accessions include a prefix (eg.,GB_GCA_002778965.1
), which is likely causing the issue.For now, I'll just strip off the prefixes in the fasta, acc2taxid, and names.dmp files. It would be great if future versions of diamond allowed for such prefixes in the accessions.