Closed fedarko closed 5 years ago
So a problem with searching exact matches of ranks is that -- at least from the Byrd dataset -- there are some taxa where it's useful to search over a substring of them. The example I've run into with this is Staph phages, which share the substring Staphylococcus_phage
but have various suffixes to that (e.g. Staphylococcus_phage_Sb_1
) which prevent an exact match. I guess we could support inexact searching through something like edit distance -- there seem to be a few good options for JS libraries that might be useful here.
We should be careful regarding time complexity, though: if we do allow more complex searches (that will take more time than exact match searches), then it'd be best to allow exact matching searching as a lightweight alternative for users with large data and/or less powerful environments.
Currently thinking I'm going to set up rank-based search (search by exact rank) and text-based search (search by exact text), and maybe add in "fuzzy" searching as a potential third option.
Addendum to the above example: some phages have names that don't even follow that pattern (e.g. a lot of phages called "Streptococcusphage..." but also some phages called "Streptococcus_pyogenesphage..."), which makes fuzzy searching all the more important.
Alternately, just allow users to specify ranks to be searched for (delineated by spaces or commas or semicolons or whatever), and just search for those ranks without making any guarantees as to their actual level. That's probably the more convenient solution due to discrepancies in classifications.