Closed Conal-Tuohy closed 3 years ago
So the title
field is a text_general
field, which is a data type in our schema whose tokenizer uses the regular expression: [\s|\p{Punct}]+
. The regex is interpreted by Java whose definition of the \p{Punct}
character class does not include the ’
character.
The essay entitled "Victor Hugo:L’homme Qui Rit." is found at http://localhost:8080/text/acs0000513-01-i001/
The title includes the ’ character U+2019 RIGHT SINGLE QUOTATION MARK
Successful searches: http://localhost:8080/search/?title=victor+hugo http://localhost:8080/search/?title=qui+rit http://localhost:8080/search/?title=L%E2%80%99Homme
Unsuccessful searches: http://localhost:8080/search/?title=homme http://localhost:8080/search/?title=l%27homme
Tweak the Solr tokenizer in use? https://lucene.apache.org/solr/guide/6_6/tokenizers.html#tokenizers