Conal-Tuohy / swinburne

Algernon Charles Swinburne website
4 stars 0 forks source link

search by title failing when contains ’ #5

Closed Conal-Tuohy closed 3 years ago

Conal-Tuohy commented 3 years ago

The essay entitled "Victor Hugo:L’homme Qui Rit." is found at http://localhost:8080/text/acs0000513-01-i001/

The title includes the ’ character U+2019 RIGHT SINGLE QUOTATION MARK

Successful searches: http://localhost:8080/search/?title=victor+hugo http://localhost:8080/search/?title=qui+rit http://localhost:8080/search/?title=L%E2%80%99Homme

Unsuccessful searches: http://localhost:8080/search/?title=homme http://localhost:8080/search/?title=l%27homme

Tweak the Solr tokenizer in use? https://lucene.apache.org/solr/guide/6_6/tokenizers.html#tokenizers

Conal-Tuohy commented 3 years ago

So the title field is a text_general field, which is a data type in our schema whose tokenizer uses the regular expression: [\s|\p{Punct}]+. The regex is interpreted by Java whose definition of the \p{Punct} character class does not include the character.