CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

Revisions to search page to use OpenSearch instead of database #593

Open sfisher opened 3 months ago

sfisher commented 3 months ago

I'm going to revise this ticket and add some based on additional discussions. It seems like no one is prepared to make decisions about how to change search, simplify it or make it different.

So this is turning into a re-implementation of the current search (more or less feature exact) into OpenSearch. This may not have big performance or usability advantages, but I suppose at least it keeps from making the database and whole app unresponsive when queries are too large or general.

I also have a better idea how the current elaborate search/reporting implemented at this point and can make more detailed tickets.

Backend filter features that current search implements and to be re-created in utility class:

Filters to re-implement in OpenSearch based on the database functionality

Search from other places besides search?

From comments in code

Dashboard pages.

    search function is executed from the following areas, s_type determines search
    parameters:

        Public Search (default):     ui_search   "public"
        Manage:                      ui_manage   "manage"
        Dashboard - ID Issues:       ui_admin    "issues"
        Dashboard - Crossref Status: ui_admin    "crossref"

Some original notes below about how search could be better, but this is likely all deferred until there is a business case from the product manager.

Paging

How we work with paging is a bit of an open question since the UI currently shows a full exact number of results and many search systems (Reference: google/bing/duck duck go) only show the first pages and with a See more results or something similar (or limited results) rather than wasting time calculating numbers of results if there are a lot of them.

Also who is going to page through more than say 100 to 1000 results?

Sorting and relevance ranking mismatch

The idea of strictly sorting fields by columns means items are not sorted by relevance which is the general way most search engines rank things. There is a _score field for the documents returned which I believes ranks the most relevant documents highest based on keywords. I believe the score is a (tf)/(id*f) algorithm (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf ) and I think it's a better way to display results than returning all and making people sort.

The page at https://www.linkedin.com/pulse/demystifying-elasticsearch-aws-opensearch-scoring-suraj-anand/ gives background about score and we can tweak such as boosting certain field matches to score higher.

I'm not sure we even need sorting as long as people can narrow results by specific fields.

Move to relevance (instead of ANDing) as default ?

Making everything an AND query by default also doesn't allow relevance ranking to operate well.

Something may still be returned if it contains most of the keywords. Because of relevance ranking having less exact results at the bottom doesn't usually hurt and is sometimes beneficial.

OpenSearch can also use "stemming" and other search techniques so things may not be exact matches. (Stemming matches related "stem" words so different forms of a word can return results -- see https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html ).

Perhaps if AND querying is required then it's a feature people activate by either a checkbox or some kind of word (like "AND") in the query box.

Results display

The entire set of results is shown in columns to facilitate field sorting in the current UI. Maybe the display isn't a big priority to change but . . .

Most general search interfaces have gone to display without strict columns and a few important fields shown per record and then a snippet that shows bits of the document with word matches highlighted or in bold. (see for example https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html ). This can still link to the full record if wanting to examine more.