CDRH / api

Codenamed "Apium": An API to access all public Center for Digital Research in the Humanities resources
https://cdrhdev1.unl.edu/api_frontend
MIT License
3 stars 1 forks source link

Issue with multivalued keyword_normalized field display #109

Open jduss4 opened 4 years ago

jduss4 commented 4 years ago

While working on #96 , in which my goal was to ignore markup (<em>), some unicode chars (Á, ø, etc), and unimportant characters at the beginning of titles (", [), I was 99% of the way there when I ran into an interesting problem with fields that had multiple values pushed to them.

When using a top_hits aggregation and asking for the _source field back, on single valued fields I got something along the lines of (pseudo code):

"facets": {
  "author":{
    "aaa" : {
       "num" : 4,
       "source": "Áaa"
    },
    "ben benjamin" : {
      "num" : 10,
      "source": "[Ben] Benjamin"
    }
  }
}

HOWEVER if there were multiple values from specific documents which were determined to be the "top hit", then this happened:

"facets": {
  "title":{
    "my antonia" : {
       "num" : 40,
       "source": [ "Death Comes for the Archbishop", "My Ántonia", "The Professor's House" ]
    }
  }
}

I looked into the idea of using a "scripted" field instead to try to return only the SINGLE most relevant result, but I kind of bogged down there trying to figure it out. Also, the documentation for script fields says (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-script-fields):

It’s important to understand the difference between doc['my_field'].value and params['_source']['my_field']. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (you can’t return a json object from it) and makes sense only for non-analyzed or single term based fields. However, using doc is still the recommended way to access values from the document, if at all possible, because _source must be loaded and parsed every time it’s used. Using _source is very slow.

For now, I am normalizing the "source" fields coming back only if they are an array and then attempting to match them against the already normalized version in order to figure out which one to display. It is not a good solution, and I would like to investigate this more in the future.

jduss4 commented 4 years ago

Some documents that might be important while solving this problem.

Schema Setup: Small Example

settings:
  analysis:
    char_filter:
      escapes:
        type: mapping
        mappings:
          - "<em> => "
          - "</em> => "
          - "<u> => "
          - "</u> => "
          - "<strong> => "
          - "</strong> => "
          - "- => "
          - "& => "
          - ": => "
          - "; => "
          - ", => "
          - ". => "
          - "$ => "
          - "@ => "
          - "~ => "
          - "\" => "
          - "' => "
          - "[ => "
          - "] => "
    normalizer:
      keyword_normalized:
        type: custom
        char_filter:
          - escapes
        filter:
          - asciifolding
          - lowercase
mappings:
  properties:
    works:
      type: keyword
      normalizer: keyword_normalized

Crude format of Elasticsearch request

# if nested, has extra syntax
      elsif f.include?(".")
        path = f.split(".").first
        aggs[f] = {
          "nested" => {
            "path" => path
          },
          "aggs" => {
            f => {
              "terms" => {
                "field" => f,
                "order" => { type => dir },
                "size" => size
              },
              "aggs" => {
                "top_matches" => {
                  "top_hits" => {
                    "_source" => {
                      "includes" => [ f ]
                    },
                    "size" => 1
                  }
                }
              }
            }
          }
        }
      else
        aggs[f] = {
          "terms" => {
            "field" => f,
            "order" => { type => dir },
            "size" => size
          },
          "aggs" => {
            "top_matches" => {
              "top_hits" => {
                "_source" => {
                  "includes" => [ f ]
                },
                "size" => 1
              }
            }
          }
        }
      end
    end

Ends up looking like

image
techgique commented 4 years ago

Do you think your current solution will really be a big performance problem? It seems like pretty straightforward code that won't be operating over huge sets of data.

I assume the lack of enthusiasm is just that you haven't figured out a way to get what you want back directly from Elasticsearch without further massaging it in the Rails app. Am I missing anything? :thinking:

jduss4 commented 4 years ago

Yes, that's essentially the source of my lack of enthusiasm. Also, I'm just not that excited that I have to imitate the normalization logic that we were already doing when things are ingested into elasticsearch, but I don't know a better way around it for this particular task. Sigh.