Figure out how the Solr Search Engine parses search terms

nickumia-reisys commented 1 year ago

User Story

In order to provide help to users, the Data.gov Search Team wants to understand how the search engine dissects search terms to return relevant results (i.e. why does searching for NGDAID72 return the additive results of NGDAID and 72 vs. why "NGDAID72" return the single result?)

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[ ] GIVEN research into Solr+CKAN searching is complete \ WHEN I visit this ticket \ THEN I see an outline of the nuances of Solr+CKAN searching

Background

Came out of discussions with Census. There are a number of weird things that happen while searching, this is supposed to start documenting at some some of the major ones.

Security Considerations (required)

While doing this research, (although HIGHLY UNLIKELY) vulnerabilities in Solr relating to data integrity might arise.

Sketch

(Someone with more vision, please update this haha..)

jbrown-xentity commented 1 year ago

https://solr.apache.org/guide/8_1/tokenizers.html

nickumia-reisys commented 1 year ago

In our catalog schema.xml, there are only two types tokenizers referenced:

(For Catalog) When Solr creates a search engine, it has three types of INDEXING methods:

Normal?
- Uses the whitespace tokenizer
- Adds a WordDelimiterFilterFactory filter which does the crazy thing of splitting up words and numbers.
- To fix this, we could implement a different type of indexing method... but I don't know how this will play with CKAN and everything else..
- Adds a LowerCaseFilterFactory filter which forces everything to compare as lowercase
- Adds a SnowballPorterFilterFactory filter which does lemmatization of words.
- Adds a ASCIIFoldingFilterFactory filter which (from what I can tell) only allows ascii characters for comparison.
Default (for non-english?)
- Uses the whitespace tokenizer
- Adds a WordDelimiterFilterFactory filter
- Adds a LowerCaseFilterFactory filter
NGram method?
- Uses the NGramTokenizerFactory tokenizer.
- Adds a LowerCaseFilterFactory filter

(For Catalog) When a query is sent to each of these Solr engines, it applies the following to the search terms:

Normal?
- Uses the whitespace tokenizer
- Adds a SynonymFilterFactory filter.
- Adds a WordDelimiterFilterFactory filter which does the crazy thing of splitting up words and numbers.
- To fix this, we could implement a different type of indexing method... but I don't know how this will play with CKAN and everything else..
- Adds a LowerCaseFilterFactory filter which forces everything to compare as lowercase
- Adds a SnowballPorterFilterFactory filter which does lemmatization of words.
- Adds a ASCIIFoldingFilterFactory filter which (from what I can tell) only allows ascii characters for comparison.
Default (for non-english?)
- Uses the whitespace tokenizer
- Adds a SynonymFilterFactory filter.
- Adds a WordDelimiterFilterFactory filter
- Adds a LowerCaseFilterFactory filter
NGram method?
- Uses the NGramTokenizerFactory tokenizer.
- Adds a LowerCaseFilterFactory filter

The other important part of the search definition is the q.op = "AND" line.

This is just highlighting the parts that effect us and should guide further exploration into how searching works for catalog.

hkdctol commented 9 months ago

We do need to work on this one but moving this to icebox for now

GSA / data.gov