TAMULib / SAGE

Search Aggregation Engine
MIT License
6 stars 1 forks source link

Searches are likely not using solr properties correctly. #515

Open kaladay opened 1 year ago

kaladay commented 1 year ago

Describe the bug The search logic seems confusing and wrong. Searching for words by themselves either don't work at all or work depending on things like the operand and and the operand or. For example, searching for apple may often not work. Some of the work-arounds would be to search * apple or * +apple.

Not only that but, seemingly randomly, searches end up included results that are clearly not in the selected field.

It has been discovered that using q for searching and prepending the field like q=title:apple to be a likely part of the problem. The property df (default field) is likely the cause of the seemingly random unrelated results.

The search may be improved by using df and q like this example: q=apple&df=title.

It may be possible to still use *:apple in q.

This needs to be investigated and a solution needs to be provided.

Solving this may solve #514 because that issue may be a symptom of the problem observed in this issue.

To Reproduce Steps to reproduce the behavior:

  1. Go to any discovery view.
  2. search for a single word using a field, such as 'title'.
  3. Investigate the query created, looking at the service logs.

Expected behavior Searching should make sense. A search for apple should find matches for apple if they exist and should not find matches where apple does not exist.

kaladay commented 1 year ago

All the df does is prepend the specified field onto each word. For example, with a search of "red apple" and a df of title, we get:

There are problems with this and we might need to have sow enabled. With sow=true, we instead get:

The wildcards also introduce a problem. Wildcards are not expanded the way in which we think. The search of "red apple" actually searches for (when sow is false):

Using df is a step forward, but sow needs to be used. When not using df, the default appears to be _text_ which is where we copy everything into for the all_fields matches.

There is also this important documentation note:

NOTE: If you want to be able to sort on a field whose contents you want to tokenize to facilitate searching, use a copyField directive in the the Schema to clone the field. Then search on the field and sort on its clone.

I strongly suspect that the rest of the problems are in how we structure the solr core and use the properties.

see: https://solr.apache.org/guide/7_7/the-standard-query-parser.html

jcreel commented 1 year ago

All the fields in the Metadata Application Profile (http://oaktrust.library.tamu.edu/handle/1969.1/175368) and the new ones that we have accumulated will need to have exact-match facets, tokenizations, and search fields - potentially achieved with copy-fields.