TAMULib / SAGE

Search Aggregation Engine
MIT License
6 stars 3 forks source link

Issue 481: Use case insensitive filter and add case insensitive string type. #496

Closed kaladay closed 1 year ago

kaladay commented 1 year ago

Description

Cannot a filter to the solr.StrField. According to the SOLR documentation, a filter can only be added to something tokenized and a solr.StrField does not allow tokenization. This uses a solr.TextField instead. Several fields need to have case insensitive searches. A new type is added that uses the KeywordTokenizer, called string_ci and strings_ci. The KeywordTokenizer essentialy is a pretend token. It tokenizes the whole string, which is effectively the same as not having a tokenizer. The documentation even references the KeywordTokenizer as the method of disabling the tokenizer.

Fields that should be case insensitive are moved from string to string_ci and strings to strings_ci respectively.

There are potential performance concerns with using solr.TextField rather than solr.StrField due to the loss of the docvalues optimization feature.

This change requires a change to the solr cor data structure. I consider this a breaking change.

see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer see: https://solr.apache.org/guide/7_7/docvalues.html

Fixes #481

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Checklist:

coveralls commented 1 year ago

Coverage Status

Coverage: 45.24% (+0.03%) from 45.215% when pulling 753b83a3880f538eb0b6eb32f91f1b9a5c937ff7 on 481-case_sensitive into 850bc32a56746c6318594867a67475340d5b7b62 on staging.

ghost commented 1 year ago

https://github.com/TAMULib/SAGE/issues/481

Suggested approach was a TextField using KeywordTokenizerFactory.

Additionally, was suggested to seperate between index and query time with two analyzers.

Such as

    <fieldType name="whole_strings" class="solr.TextField" omitNorms="true" sortMissingLast="true" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

This would simply change all fields of type whole_strings to afford matching case insensitive by during query by lowercasing. For exact match case sensitive searches, it is common to expect search term to be wrapped in double quotes enforcing exact match on the search field.

The question is, what search behavior changes are not desired by affording all fields to search with case insensitivity? One with minimal change approach may consider minimal change be that of changes to the versioned schema and not to the minimal changes to search behavior. Basically, adding additional field types is not minimal changes to versioned schema (obviously) and the search behavior changes may still be a minimum in term of anticipated or expected search terms.

ghost commented 1 year ago

Not sure we need the additional field types. What behavior changes are there without the additional field types?