AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

catalogNumber search is case sensitive #676

Closed javier-molina closed 9 months ago

javier-molina commented 3 years ago

Background Search for catalog number is case sensitive however catalog number is added to text which is case insensitive.

https://biocache.ala.org.au/occurrences/search?q=catalogue_number%3Am13929 - No results (case sensitive search)

https://biocache.ala.org.au/occurrences/search?q=catalogue_number%3AM13929 - Gets results after correct capitalisation is used

https://biocache.ala.org.au/occurrences/search?taxa=m13929 or https://biocache.ala.org.au/occurrences/search?taxa=M13929 gets the same results, Not case sensitive search.

Requirement Enable case insensitive search for catalog number.

Related to https://support.ehelp.edu.au/a/tickets/112910

alexhuang091 commented 3 years ago

@djtfmartin

occurrences/search?taxa=m13929 and occurrences/search?taxa=M13929 are converted to q=text:"m13929" and q=text:"M13929"

In solr config <field name="text" type="textgen" multiValued="true" indexed="true" stored="false" /> will have a lower case filter applied that's why it's case insensitive.

catalogNumber is just a normal String field so it's case-sensitive.

nickdos commented 2 years ago

Question is whether we need to facet on catalogNumber. If yes then it has to stay a string type field but if no, then can be converted to a text based field (we'd want to use a derivative that did not do stemming etc. but did apply a case insensitive filter.

Also, if faceting is required then it could be copied into a text copyTo field for this use-case.

adam-collins commented 9 months ago

SOLR managed-schema additions required

<field name="text_catalogNumber"        type="textgen" multiValued="true" indexed="true" stored="false" />
<copyField source="catalogNumber" dest="text_catalogNumber"/>

biocache-hubs search field change required for catalog_number to text_catalogNumber

adam-collins commented 9 months ago

pipelines pull request https://github.com/gbif/pipelines/pull/1001

adam-collins commented 9 months ago

in version 2.18.0-SNAPSHOT

peggynewman commented 7 months ago

@adam-collins could you please clear up what should we be testing here? Searching catalogNumber is still case sensitive, but we're happy for that. What did the change do? I'm not sure whether there is a requirement to facet on catalogNumber.

In prod:

The example in this issue: catalogue_number:m13929 returns nothing catalogue_number:M13929 returns 3 results ... all valid

perth example: catalogue_number:perth 9639314 catalogue_number:PERTH 9639314 both return 2 matches: one for 9639314 (a birdlife record) and one for "PERTH 9639314" from WA herbarium

mel example: catalogue_number:MEL%202526538A search returns over 100m records catalogue_number:%22MEL%202526538A%22 in quotes, returns the correct result catalogue_number:%22mel%202526538A%22 without quotes, returns nothing

In test: The example in this issue: catalogue_number:m13929 returns nothing catalogue_number:M13929 returns 3 results ... all valid

perth example: catalogue_number:perth 9639314 catalogue_number:PERTH 9639314 both return 2 matches: one for 9639314 (a birdlife record) and one for "PERTH 9639314" from WA herbarium

mel example: catalogue_number:MEL%202526538A search returns over 100m records catalogue_number:%22MEL%202526538A%22 in quotes, returns the correct result catalogue_number:%22mel%202526538A%22 without quotes, returns nothing

Please clarify

adam-collins commented 7 months ago
  1. catalogue_number is the pre-pipelines name. Use catalogNumber instead.
  2. Use text_catalogNumber for case insensitive search. This is consistent with the other text_* fields e.g. https://biocache-test.ala.org.au/fields?filter=text_
  3. It is typical that after this field is in production biocache-hubs is then updated to use this field, here https://biocache-test.ala.org.au/search#tab_catalogUpload and the Catalogue number field here https://biocache-test.ala.org.au/search#tab_advanceSearch
adam-collins commented 6 months ago

text_catalogNumber is new and for case insensitive searching. If you want a specific match use double quotes.

The example in this issue: text_catalogNumber:"m13929" 3 records catalogNumber:m13929 returns 0 results catalogNumber:M13929 returns 3 results

perth example: Above example is missing from test.

mel example behaves the same as the example in this issue. e.g. Use double quotes, catalogNumber is case sensitive, text_catalogNumber is case insensitive.

peggynewman commented 2 months ago

Ok, this is fine, happy to go ahead