Open dosumis opened 4 years ago
Analysis
List of all special characters currently in labels:
{'∩', '+', 'α', '[', '(', ']', 'β', 'γ', '{', "'", '\', '/', '.', ',', ')', '-', '}', 'ε', '_', '&'} # TODO - check synonyms
see https://gist.github.com/dosumis/7ecf8626bf88efb43134b59953341dc2
Example from UX testing (already talked about this but posting here):
Searching for 'MBON09" does not find the synonym 'MBON-09' of 'mushroom body output neuron 9'. Splitting on letter/number boundary would solve this if I understand correctly. But, I assume just always splitting on letter/number boundaries would cause problems with terms which don't need to be split, for example AV5 neuron. Is it possible to apply more 'aggressive' tokenisation if there are few results? Or if part of the string searched for matches a type?
I think it's possible to have multiple indexes following different tokenizations (so one index would have MBON 09 and another MBON09). Will have to check though.
Building the new VFB-solr for pipeline 2 gives us an opportunity to revisit tokenization (how SOLR spilts strings into separately searchable units).
As I understand it, autosuggest finds matches from the start of tokens, so deciding what to split on can have a major effect on what is discoverable.
Example decisions:
What special characters should we split on? Should we split on number/letter boundaries? Should we split on case boundaries?
Anything else?
CC @Clare72 @admclachlan @Robbie1977 - Please add examples of any names you can think of that might require some special tokenization