Closed simonhm closed 1 year ago
Extra: Bounded searching "23th street" is working fine and returns the content with title as "My beautiful 23th street"
@rosiel can confirm, since we did have a quick testing session about this bounded search problem on Slack. This issue is maybe quite confusing, isn't it? Simon.
Another point that I think is key, is that while "My beautiful 23th street" will not retrieve a piece of content with that title, "My beautiful 23 th street" will retrieve that title.
I'm not sure if there's any preprocessing done to the query by advanced search. We also didn't test if we uninstall advanced search, does this still happen? (I can do that, brb)
Sorry, I was wrong to blame Advanced Search. This exact problem happens also when using a solr view with the "Fulltext search" filter.
It seems to have more to do with how we index things in solr, and how words are broken apart. Again,
search string: "my beautiful 23rd street" gives no results even when that's the actual title. search string: "my beautiful 23 rd street" pulls up the result that does not have a space after the numbers.
Oh, nice catch. Good one!
I tried out the solution Rosie sent. I added the <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />
to text_ws and text_general fields:
<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_ws" class="solr.TextField" omitNorms="true" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The issue is till existed. maybe I'm applying the fix for wrong fields ? do you know which field type for full text ? since the title is full text field. Thanks.
I made the following change to /var/solr/data/ISLANDORA/conf/schema_extra_types.xml
(I added the splitOnNumerics="0"):
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1"
protected="protwords_en.txt" splitOnCaseChange="0" splitOnNumerics="0" generateWordParts="1"
preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
Then I restarted solr and reindexed and now the search works.
Does this look like a good fix?
Hi @RodBruce, I tested it. it works. Thanks.
@rosiel @kylehuynh205 Should we close this issue? This issue isn't caused by this module. And thank you all for your quick response, investigation and finding out the solution. Simon.
I would move it but I don't have permission. Maybe it has to be closed and a new one created, though I'm not sure where!
There is a step where the solr config is set up in the islandora playbook and it's also somewhere in Isle-buildkit or isle-dc.
Perhaps islandora documentation would be the best place for this.
I tried to transfer the issue to https://github.com/Islandora/documentation/issues but I don't think it's possible, so I'm closing this issue please make a ticket for Solr and tag this ticket in https://github.com/Islandora/documentation/issues
35E or 23th is only my sample. For example, we're having a content with title as "My beautiful 23th street" I don't know exactly what you want to call those "numbers". But you will get no results when you search the titles with the quotes around them (bounded search), while there are those "special" numbers, such as: 123ABC, or 2023whatever, .... in the titles. So: a bounded searching "My beautiful 23th street" returns no results, but it should. The bounded search is still working fine with "normal" numbers. For example: Searching "My beautiful 23 street" can be retrieved if you're having a content with title as "My beautiful 23 street". I will ask our catalogers to describe this problem again in their understanding if needed. Hopefully you can re-procedure the above problem. Simon.