CDRH / orchid

Rails Engine for site integration with CDRH API
MIT License
3 stars 0 forks source link

faceting person.name by person.role and other nested aggregations #228

Open karindalziel opened 2 years ago

karindalziel commented 2 years ago

Elsticsearch 6.8 up (at least) has the ability to create a nested aggregation based on another nested value. details here

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-nested-aggregation.html

in orchid, this would be very useful to return a list of facets (aggregations) by a role, for instance person.role = attorney

this would necessitate a change to both the API to handle the query and orchid to handle the results (if they are changed) and handle the query setup in public.yml

thinking through this a bit, the api query could look something like this

facet[]=person.name[by(person.role=attorney)]

from there, it would probably return a pretty similar list of facets as it currently does, with only a change to the facet return being changing

facets:
   person.name:

to something like

facets:
    person.name[by(person.role=attorney)]

And then in Orchid, we'd have to add the same functionality to detect in public.yml, though it may be handled by the search term in the API

      person.name:
        label: People

becomes

    person.name[by(person.role=attorney)]:
      label: Attorneys
karindalziel commented 2 years ago

Note that this would probably come with updating the API to the latest version of ES

karindalziel commented 2 years ago

A bit more detail:

I want to add data that looks like this to the API:

"person": [
          {
            "id": "per.001",
            "name": "Smith, Emily",
            "role": "Sender"
          },
          {
            "id": "per.002",
            "name": "Thomas, Frank",
            "role": "Recipient"
          },
          {
            "id": "per.003",
            "name": "Franklin, Gina",
            "role": "Sender"
          },
          {
            "id": "per.004",
            "name": "Bell, James",
            "role": "Recipient"
          }
        ]

And then on the browse page or search facets, I would like to have the option to browse by "Sender" or "Recipient"

In orchid currently, you can only, for instance, select person.name, which will select all the "name" keys from the person field, but you can't select only the names from the people with Role =X

Will had pointed out that you can choose one facet and that will limit the others, for instance, if you add person.role and then facet by that, the resulting name list is only those with person.role, but that doesn't quite work because ALL the facets would be limited, and I want an initial list with no faceting but Sender and Receiver.

As an example, here is an API facet return for person.role and person.name

{
  "req": {
    "query_string": "/collection/test/items?num=0&sort[]=title_sort|asc&facet_limit=20&facet_sort=count|desc&browse_sort=term|asc&hl_fl=annotations_text%2C+transcription_t%2C+text&hl_num=5&facet[]=person.name&facet[]=person.role"
  },
  "res": {
    "code": 200,
    "count": 4,
    "facets": {
      "person.role": {
        "": 4,
        "recipient": 3,
        "sender": 3
      },
      "person.name": {
        "Chesnutt, Charles W., (Charles Waddell)": 4,
        "Washington, Booker T., 1856-1915": 3,
        "Bruce, Blanche Kelso": 1,
        "Green, John Patterson": 1,
        "Smith, Harry C.": 1
      }
    },
    "items": [

    ]
  }
}

But what I want to show on the search facets is:

I don't think there is currently a way to get that info from the API.

NEXT STEP: Determine if the API can results facets as indicated

karindalziel commented 2 years ago

I created a test repository to post the kind of data I want to look at: https://github.com/CDRH/data_test

and posted, cdrh dev api call is here: https://cdrhdev1.unl.edu/api/v1/collection/test/items?num=0&sort[]=title_sort|asc&facet_limit=20&facet_sort=count|desc&browse_sort=term|asc&hl_fl=annotations_text%2C+transcription_t%2C+text&hl_num=5&facet[]=person.name&facet[]=person.role

wkdewey commented 2 years ago

Elasticsearch can handle the necessary queries, but we need to modify Orchid and the API to handle them as facets and as query strings in the API GET request. To find subcategory="manuscripts", {"aggs": {"marginalia": {"terms": {"field":"subcategory", "include":"manuscripts"}}}} or {"query": {"term":{"subcategory":"marginalia"}},"aggs": {"subcategory": {"terms": {"field":"subcategory”}}}}. The former method creates a new aggregation, the latter stores the result in "hits" For a matching a nested value, i.e. creator.name="Walt Whitman":

{"aggs": {"creator.name":{"nested":{"path":"creator"}, "aggs":{"creator.name":{"terms":{"field":"creator.name", "order":{"_count":"desc"}, "size":"20","include":"Walt Whitman"} }}}}}

Though I realize your query above is more complicated. (But can the above queries be faceted with the current API?)

wkdewey commented 2 years ago

This may also be useful: https://discuss.elastic.co/t/nested-filter-aggregation/82639

wkdewey commented 2 years ago

The encode_param method in app/services/api_bridge/query.rb cannot handle an equal sign in a facet name (i.e. facet[]=person.name[person.role=judge]), since it splits on '='. I think equal signs are confusing in the request URI anyway, so we should come up with another symbol. Maybe the pipe symbol, like person.name[person.role|judge]?

wkdewey commented 2 years ago

Here is a bigger problem: a facet name like that gives the following elasticsearch error:

Invalid aggregation name [person.name[person.role|judge]]. Aggregation names must be alpha-numeric and can only contain '_' and '-'"
wkdewey commented 2 years ago

One possible solution: create "alternate" keys in the YAML file allowing elasticsearch to store the aggregation under a different name. Then Orchid will have to be changed to use this alternate key to display the facets.

wkdewey commented 2 years ago

Here is a query that apparently works:

{"aggs":{"people":{"nested":{"path":"person"},"aggs":{"includes_judge":{"filter": {"term": {"person.role": "judge"}}, "aggs": { "judges": {"terms": {"field": "person.name"}}}}}}}}

Do we want to restrict this to nested fields? I think the pattern of filter aggregations is broader.