GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
626 stars 99 forks source link

Figure out how the Solr Search Engine parses search terms #4166

Open nickumia-reisys opened 1 year ago

nickumia-reisys commented 1 year ago

User Story

In order to provide help to users, the Data.gov Search Team wants to understand how the search engine dissects search terms to return relevant results (i.e. why does searching for NGDAID72 return the additive results of NGDAID and 72 vs. why "NGDAID72" return the single result?)

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

Came out of discussions with Census. There are a number of weird things that happen while searching, this is supposed to start documenting at some some of the major ones.

Security Considerations (required)

While doing this research, (although HIGHLY UNLIKELY) vulnerabilities in Solr relating to data integrity might arise.

Sketch

(Someone with more vision, please update this haha..)

jbrown-xentity commented 1 year ago

https://solr.apache.org/guide/8_1/tokenizers.html

nickumia-reisys commented 1 year ago

In our catalog schema.xml, there are only two types tokenizers referenced:

(For Catalog) When Solr creates a search engine, it has three types of INDEXING methods:

(For Catalog) When a query is sent to each of these Solr engines, it applies the following to the search terms:

The other important part of the search definition is the q.op = "AND" line.

This is just highlighting the parts that effect us and should guide further exploration into how searching works for catalog.

hkdctol commented 9 months ago

We do need to work on this one but moving this to icebox for now