bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.02k stars 183 forks source link

Error: Accession exceeds supported length #339

Open nick-youngblut opened 4 years ago

nick-youngblut commented 4 years ago

I'd like to use diamond makedb on a custom taxonomy created from the GTDB (via gtdb_to_taxdump). With diamond v0.9.30.131, I'm getting the error: Error: Accession exceeds supported length. The GTDB accessions include a prefix (eg., GB_GCA_002778965.1), which is likely causing the issue.

For now, I'll just strip off the prefixes in the fasta, acc2taxid, and names.dmp files. It would be great if future versions of diamond allowed for such prefixes in the accessions.

bbuchfink commented 4 years ago

Hi Nick, I'm aware of this issue and I'll try to support a dynamic length in future versions. For now, you can also change the max length by editing src/data/taxonomy.h:45 and setting enum { max_accesion_len = 14 }; to a higher value.

nick-youngblut commented 4 years ago

Thanks for letting me know how to get around the issue! I've already converted all of the accessions (eg., GB_GCA_002778965.1 to GCA002778965.1), with works for diamond makedb v0.9.30.131.

nick-youngblut commented 4 years ago

diamond makedb is stating that my taxonomy includes a lot of "no rank" nodes:

[...]
Accession mappings = 24706
Loading taxonomy nodes...  [0.608s]
Loading taxonomy names...  [0.517s]
Loaded taxonomy names for 182187 taxon ids.
Writing taxon id lists...  [17.373s]
82930832 sequences mapped to taxonomy, 82930832 total mappings.
Building taxonomy nodes...  [0.001s]
180131 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           140636
superkingdom      2
kingdom           0
subkingdom        0
superphylum       0
phylum            151
subphylum         0
superclass        0
class             152
subclass          0
infraclass        0
cohort            0
subcohort         0
superorder        0
order             158
suborder          0
infraorder        0
parvorder         0
superfamily       0
family            159
subfamily         0
tribe             0
subtribe          0
genus             169
subgenus          0
section           0
subsection        0
series            0
species group     0
species subgroup  0
species           170
subspecies        38534
varietas          0
forma             0

However, most of the nodes in my nodes.dmp files are "subspecies" (n=145904). Any ideas on how to troubleshoot this? Does diamond expect anything special in the names.dmp file? My custom names.dmp file is rather minimal.

bbuchfink commented 4 years ago

If you make the files available to me, I can look into it.

nick-youngblut commented 4 years ago

I've copied the files to /tmp/global2/nyoungblut/gtdb_diamond_db/. I believe that you have read access. The nodes/names dump files were created with my simple code in gtdb_to_taxdump. The screenlog.0 file shows the log for my diamond makedb job. Thanks!!

bbuchfink commented 4 years ago

This happens because the nodes.dmp is implicitly assumed to be sorted on the taxid. I will remove this restriction in a future release, but for now you can simply sort your file.

nick-youngblut commented 4 years ago

That worked! Thanks for looking into the problem! I've changed my gtdb_to_taxdump code to order the nodes.dmp (and names.dmp) by taxID, so no worries about changing diamond