Islandora / advanced_search

This module creates several blocks to support searching. It also enables the use of Ajax with search blocks, facets, and search results.
https://www.drupal.org/project/advanced_search
GNU General Public License v2.0
3 stars 9 forks source link

Search issue: quotation search does not retrieve titles having 35E or 23th appears in the middle of the title #29

Closed simonhm closed 1 year ago

simonhm commented 1 year ago

35E or 23th is only my sample. For example, we're having a content with title as "My beautiful 23th street" I don't know exactly what you want to call those "numbers". But you will get no results when you search the titles with the quotes around them (bounded search), while there are those "special" numbers, such as: 123ABC, or 2023whatever, .... in the titles. So: a bounded searching "My beautiful 23th street" returns no results, but it should. The bounded search is still working fine with "normal" numbers. For example: Searching "My beautiful 23 street" can be retrieved if you're having a content with title as "My beautiful 23 street". I will ask our catalogers to describe this problem again in their understanding if needed. Hopefully you can re-procedure the above problem. Simon.

simonhm commented 1 year ago

Extra: Bounded searching "23th street" is working fine and returns the content with title as "My beautiful 23th street"

simonhm commented 1 year ago

@rosiel can confirm, since we did have a quick testing session about this bounded search problem on Slack. This issue is maybe quite confusing, isn't it? Simon.

rosiel commented 1 year ago

Another point that I think is key, is that while "My beautiful 23th street" will not retrieve a piece of content with that title, "My beautiful 23 th street" will retrieve that title.

I'm not sure if there's any preprocessing done to the query by advanced search. We also didn't test if we uninstall advanced search, does this still happen? (I can do that, brb)

rosiel commented 1 year ago

Sorry, I was wrong to blame Advanced Search. This exact problem happens also when using a solr view with the "Fulltext search" filter.

It seems to have more to do with how we index things in solr, and how words are broken apart. Again,

search string: "my beautiful 23rd street" gives no results even when that's the actual title. search string: "my beautiful 23 rd street" pulls up the result that does not have a space after the numbers.

rosiel commented 1 year ago

aha! https://stackoverflow.com/questions/28994205/solr-query-with-words-containing-letters-and-numbers

simonhm commented 1 year ago

Oh, nice catch. Good one!

kylehuynh205 commented 1 year ago

I tried out the solution Rosie sent. I added the <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />to text_ws and text_general fields:

 <!-- A text field that only splits on whitespace for exact matching of words -->
    <fieldType name="text_ws" class="solr.TextField" omitNorms="true" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory" language="English" />
      </analyzer>
    </fieldType>

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnNumerics="0" />
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory" language="English" />
      </analyzer>
    </fieldType>

The issue is till existed. maybe I'm applying the fix for wrong fields ? do you know which field type for full text ? since the title is full text field. Thanks.

RodBruce commented 1 year ago

I made the following change to /var/solr/data/ISLANDORA/conf/schema_extra_types.xml (I added the splitOnNumerics="0"):

 <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <charFilter class="solr.MappingCharFilterFactory" mapping="accents_en.txt"/>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"/>
     <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" 
     protected="protwords_en.txt" splitOnCaseChange="0" splitOnNumerics="0" generateWordParts="1" 
     preserveOriginal="1" catenateAll="0" catenateWords="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100"/>
   <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>

Then I restarted solr and reindexed and now the search works.

Does this look like a good fix?

kylehuynh205 commented 1 year ago

Hi @RodBruce, I tested it. it works. Thanks.

simonhm commented 1 year ago

@rosiel @kylehuynh205 Should we close this issue? This issue isn't caused by this module. And thank you all for your quick response, investigation and finding out the solution. Simon.

rosiel commented 1 year ago

I would move it but I don't have permission. Maybe it has to be closed and a new one created, though I'm not sure where!

There is a step where the solr config is set up in the islandora playbook and it's also somewhere in Isle-buildkit or isle-dc.

Perhaps islandora documentation would be the best place for this.

kylehuynh205 commented 1 year ago

I tried to transfer the issue to https://github.com/Islandora/documentation/issues but I don't think it's possible, so I'm closing this issue please make a ticket for Solr and tag this ticket in https://github.com/Islandora/documentation/issues