Decide on extensions to tokeniser settings

VirtualFlyBrain / vfb-solr

Solr config for the Virtual Fly Brain

0 stars 0 forks source link

Decide on extensions to tokeniser settings #3

Open dosumis opened 4 years ago

dosumis commented 4 years ago

Building the new VFB-solr for pipeline 2 gives us an opportunity to revisit tokenization (how SOLR spilts strings into separately searchable units).

As I understand it, autosuggest finds matches from the start of tokens, so deciding what to split on can have a major effect on what is discoverable.

Example decisions:

What special characters should we split on? Should we split on number/letter boundaries? Should we split on case boundaries?

Anything else?

CC @Clare72 @admclachlan @Robbie1977 - Please add examples of any names you can think of that might require some special tokenization

dosumis commented 4 years ago

Analysis

List of all special characters currently in labels:

{'∩', '+', 'α', '[', '(', ']', 'β', 'γ', '{', "'", '\', '/', '.', ',', ')', '-', '}', 'ε', '_', '&'} # TODO - check synonyms

see https://gist.github.com/dosumis/7ecf8626bf88efb43134b59953341dc2

admclachlan commented 4 years ago

Example from UX testing (already talked about this but posting here):

Searching for 'MBON09" does not find the synonym 'MBON-09' of 'mushroom body output neuron 9'. Splitting on letter/number boundary would solve this if I understand correctly. But, I assume just always splitting on letter/number boundaries would cause problems with terms which don't need to be split, for example AV5 neuron. Is it possible to apply more 'aggressive' tokenisation if there are few results? Or if part of the string searched for matches a type?

dosumis commented 4 years ago

I think it's possible to have multiple indexes following different tokenizations (so one index would have MBON 09 and another MBON09). Will have to check though.