clarin-eric / VLO

Virtual Language Observatory
GNU General Public License v3.0
14 stars 6 forks source link

Use string literals in suggestion query #193

Open twagoo opened 6 years ago

twagoo commented 6 years ago

Repot by @vidiecan:

In the input field, write 
pdt-
you will get a list of hints but instead of `-` youw will get ``

This is about the search bar and the autocomplete/suggester functionality. Certain punctuation marks and/or other characters are ignored or 'parsed out'. Not only - but also e.g. ., " and !. This seems to be a Lucene default and is ok for queries but not so great for the suggester because of the immediate feedback which makes it confusing.

twagoo commented 6 years ago

Discovered by @dietervu: spaces in the query triggers similar behaviour, and on top of that a character rendering issue. For example (on Firefox 63.0 on MacOS):

screenshot 2018-10-29 at 11 02 54 screenshot 2018-10-29 at 11 03 01 screenshot 2018-10-29 at 11 03 05

teckart commented 6 years ago

The default separator of the suggester (U+001E) is now explicitly set to a space character. This solves the aforementioned "polish parliamentary" problem.

teckart commented 6 years ago

Regarding the original request: Currently, Solr's Standard Tokenizer is used which splits on all punctuation characters and removes them. We could replace it (for example) with the Whitespace Tokenizer which preserves all of this characters. The drawback would be the generation of many "artifical" tokens with leading/trailing punctuation characters ("'Use this resource!'"' -> "'use", "this", "resource!'"). I expect a rather chaotic suggester output as those forms (and all their variations for different quotation and punctuation marks) would often occur together with their "clean" forms. We could also define our own pattern for splitting text in tokens (using the PatternTokenizer) but it is unrealistic to find a pattern that works in very case (for example think of "TüBa-D/S" which is a valid resource name).

twagoo commented 6 years ago

Is there no way to tokenise on white space and then strip out leading/trailing punctuation?

teckart commented 6 years ago

We could add a PatternReplaceFilter to remove those. But if this is the solution to the issue, I apparently didn't understand the problem in the first place. The set of characters that trigger a tokenization should be reduced (for example by [-'!]) or limited to "whitespace only", and those characters should then be removed in leading and trailing positions? Could you provide some examples and their desired tokenization?

twagoo commented 6 years ago

Now I'm confused as well. Some examples of what I believe we might be aiming for:

polish parliamentary -> polish, parliamentary Use this resource! -> use,this,resource pdt-whatever -> pdt-whatever

However the original complaint is indeed referring to a (related but?) different problem: the user types pdt- and gets suggestions for pdt... but none for pdt-..... Tokenisation on the hyphen is part of the (supposed) problem but probably also the query parser that ignores the hyphen. I'm not sure if this can be solved without undesirable side effects.

teckart commented 6 years ago

This sounds like if a WhitespaceTokenizer and a filter that removes all leading/trailing non-alphanumeric characters might actually be the solution. I would just implement it in the issue193 branch and deploy it on my test machine - let's see what the results look like and if there are any problems when retrieving the actual records!

teckart commented 5 years ago

This analyzer approach (whitespace-tokenization + removing leading and trailing non-word characters) seems to improve the quality of suggestions (tested with the problematic cases in this issue and many adhoc examples in comparison with the production VLO).