Closed kaladay closed 1 year ago
https://github.com/TAMULib/SAGE/issues/481
Suggested approach was a TextField using KeywordTokenizerFactory.
Additionally, was suggested to seperate between index and query time with two analyzers.
Such as
<fieldType name="whole_strings" class="solr.TextField" omitNorms="true" sortMissingLast="true" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This would simply change all fields of type whole_strings
to afford matching case insensitive by during query by lowercasing. For exact match case sensitive searches, it is common to expect search term to be wrapped in double quotes enforcing exact match on the search field.
The question is, what search behavior changes are not desired by affording all fields to search with case insensitivity? One with minimal change approach may consider minimal change be that of changes to the versioned schema and not to the minimal changes to search behavior. Basically, adding additional field types is not minimal changes to versioned schema (obviously) and the search behavior changes may still be a minimum in term of anticipated or expected search terms.
Not sure we need the additional field types. What behavior changes are there without the additional field types?
Description
Cannot a filter to the
solr.StrField
. According to the SOLR documentation, a filter can only be added to something tokenized and asolr.StrField
does not allow tokenization. This uses asolr.TextField
instead. Several fields need to have case insensitive searches. A new type is added that uses theKeywordTokenizer
, calledstring_ci
andstrings_ci
. TheKeywordTokenizer
essentialy is a pretend token. It tokenizes the whole string, which is effectively the same as not having a tokenizer. The documentation even references theKeywordTokenizer
as the method of disabling the tokenizer.Fields that should be case insensitive are moved from
string
tostring_ci
andstrings
tostrings_ci
respectively.There are potential performance concerns with using
solr.TextField
rather thansolr.StrField
due to the loss of the docvalues optimization feature.This change requires a change to the solr cor data structure. I consider this a breaking change.
see: https://solr.apache.org/guide/7_7/field-types-included-with-solr.html#field-types-included-with-solr see: https://solr.apache.org/guide/7_7/field-type-definitions-and-properties.html#field-type-definitions-and-properties see: https://solr.apache.org/guide/7_7/field-properties-by-use-case.html#field-properties-by-use-case see: https://solr.apache.org/guide/7_7/tokenizers.html#keyword-tokenizer see: https://solr.apache.org/guide/7_7/docvalues.html
Fixes #481
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Checklist: