banhbio / Taxonomy.jl

Julia package to handle the NCBI Taxonomy database.
MIT License
12 stars 1 forks source link

Precompute name2taxids dictionary #43

Open AntonOresten opened 9 months ago

AntonOresten commented 9 months ago

Howdy!

I want to start by saying that I've found this package to be very convenient and useful! My only issue is the time complexity of the name2taxids function. It does a linear search through the db.names dictionary (of type Dict{Int, String}), cumulating all IDs that match the name, which will be slow for larger datasets. I found that you can essentially invert the db.names dictionary to get a name => taxids dictionary (of type Dict{String, Vector{Int}}), but it can take a couple of seconds to create. Although this far outweighs the minutes or even hours that might be spent on doing linear searches for every query one might have. I reckon something along the lines of a function for creating such a dictionary would be nice to have. It's rather trivial to do manually, but requires accessing stuff that are not user-facing.

This is what I've been doing:

name_to_taxids = Dict{String, Vector{Int}}()
for (taxid, name) in db.names
    push!(get!(name_to_taxids, name, Int[]), taxid)
end

Cheers!

banhbio commented 9 months ago

Thank you for the kind comment and valuable feedback! I indeed thought that the name2taxids function could be improved, and I think your ideas are very good. Currently, I am very busy and don't seem to have even a little time to devote to development. However, I definitely want to improve on this point. Of course, opinions on more detailed implementations or PRs are welcome!