Open twagoo opened 6 years ago
Discovered by @dietervu: spaces in the query triggers similar behaviour, and on top of that a character rendering issue. For example (on Firefox 63.0 on MacOS):
polish
-> suggests polish
, ...polish par
-> polishparliamentary
, ....polishparliamentary
polishparliamentary
The default separator of the suggester (U+001E) is now explicitly set to a space character. This solves the aforementioned "polish parliamentary" problem.
Regarding the original request: Currently, Solr's Standard Tokenizer is used which splits on all punctuation characters and removes them. We could replace it (for example) with the Whitespace Tokenizer which preserves all of this characters. The drawback would be the generation of many "artifical" tokens with leading/trailing punctuation characters ("'Use this resource!'"' -> "'use", "this", "resource!'"). I expect a rather chaotic suggester output as those forms (and all their variations for different quotation and punctuation marks) would often occur together with their "clean" forms. We could also define our own pattern for splitting text in tokens (using the PatternTokenizer) but it is unrealistic to find a pattern that works in very case (for example think of "TüBa-D/S" which is a valid resource name).
Is there no way to tokenise on white space and then strip out leading/trailing punctuation?
We could add a PatternReplaceFilter to remove those. But if this is the solution to the issue, I apparently didn't understand the problem in the first place. The set of characters that trigger a tokenization should be reduced (for example by [-'!]) or limited to "whitespace only", and those characters should then be removed in leading and trailing positions? Could you provide some examples and their desired tokenization?
Now I'm confused as well. Some examples of what I believe we might be aiming for:
polish parliamentary
-> polish
, parliamentary
Use this resource!
-> use
,this
,resource
pdt-whatever
-> pdt-whatever
However the original complaint is indeed referring to a (related but?) different problem: the user types pdt-
and gets suggestions for pdt...
but none for pdt-....
. Tokenisation on the hyphen is part of the (supposed) problem but probably also the query parser that ignores the hyphen. I'm not sure if this can be solved without undesirable side effects.
This sounds like if a WhitespaceTokenizer and a filter that removes all leading/trailing non-alphanumeric characters might actually be the solution. I would just implement it in the issue193 branch and deploy it on my test machine - let's see what the results look like and if there are any problems when retrieving the actual records!
This analyzer approach (whitespace-tokenization + removing leading and trailing non-word characters) seems to improve the quality of suggestions (tested with the problematic cases in this issue and many adhoc examples in comparison with the production VLO).
Repot by @vidiecan:
This is about the search bar and the autocomplete/suggester functionality. Certain punctuation marks and/or other characters are ignored or 'parsed out'. Not only
-
but also e.g..
,"
and!
. This seems to be a Lucene default and is ok for queries but not so great for the suggester because of the immediate feedback which makes it confusing.