AtlasOfLivingAustralia / la-pipelines

Living Atlas Pipelines extensions
3 stars 4 forks source link

Investigate stemming for query input string #485

Closed nickdos closed 3 years ago

nickdos commented 3 years ago

Related to #423.

Stemming does not appear to be working for me. E.g. these two queries should return the same results set but don't:

https://biocache-dq-test.ala.org.au/occurrences/search?q=text:parrots (0 results) https://biocache-dq-test.ala.org.au/occurrences/search?q=text:parrot (4,319,873 results)

Edit: could be explained if the query parsers is not also using a stemming parser. Both input and index parsers need to support stemming for it to work as expected...

djtfmartin commented 3 years ago

On test

/occurrences/search?text:parrots = 5,350,721 /occurrences/search?text:parrot = 762,782

I think this is because we now include indexing of the speciesSubgroup in the SOLR copyField

Related commit is: https://github.com/gbif/pipelines/commit/2282e7a2c1e5a921b859c67b5aac2b74a11cdaa3

nickdos commented 3 years ago

If text is being stemmed, then the index should only contain the term parrot as any instances of parrots will be stemmed to remove the s before writing to the index. The larger result for parrots indicates this is not the case on the indexing side of things.

javier-molina commented 3 years ago

Possibly related to #391

adam-collins commented 3 years ago

Both https://biocache.ala.org.au/ws/occurrences/search?q=text:parrots and https://biocache.ala.org.au/ws/occurrences/search?q=text:parrot match 5672543 occurrences today. I think this is fixed now.

djtfmartin commented 3 years ago

@nickdos can you test again ? LGTM

nickdos commented 3 years ago

Yep, looks good - will close 👍