Open jduss4 opened 4 years ago
Some documents that might be important while solving this problem.
Schema Setup: Small Example
settings:
analysis:
char_filter:
escapes:
type: mapping
mappings:
- "<em> => "
- "</em> => "
- "<u> => "
- "</u> => "
- "<strong> => "
- "</strong> => "
- "- => "
- "& => "
- ": => "
- "; => "
- ", => "
- ". => "
- "$ => "
- "@ => "
- "~ => "
- "\" => "
- "' => "
- "[ => "
- "] => "
normalizer:
keyword_normalized:
type: custom
char_filter:
- escapes
filter:
- asciifolding
- lowercase
mappings:
properties:
works:
type: keyword
normalizer: keyword_normalized
Crude format of Elasticsearch request
# if nested, has extra syntax
elsif f.include?(".")
path = f.split(".").first
aggs[f] = {
"nested" => {
"path" => path
},
"aggs" => {
f => {
"terms" => {
"field" => f,
"order" => { type => dir },
"size" => size
},
"aggs" => {
"top_matches" => {
"top_hits" => {
"_source" => {
"includes" => [ f ]
},
"size" => 1
}
}
}
}
}
}
else
aggs[f] = {
"terms" => {
"field" => f,
"order" => { type => dir },
"size" => size
},
"aggs" => {
"top_matches" => {
"top_hits" => {
"_source" => {
"includes" => [ f ]
},
"size" => 1
}
}
}
}
end
end
Ends up looking like
Do you think your current solution will really be a big performance problem? It seems like pretty straightforward code that won't be operating over huge sets of data.
I assume the lack of enthusiasm is just that you haven't figured out a way to get what you want back directly from Elasticsearch without further massaging it in the Rails app. Am I missing anything? :thinking:
Yes, that's essentially the source of my lack of enthusiasm. Also, I'm just not that excited that I have to imitate the normalization logic that we were already doing when things are ingested into elasticsearch, but I don't know a better way around it for this particular task. Sigh.
While working on #96 , in which my goal was to ignore markup (
<em>
), some unicode chars (Á
,ø
, etc), and unimportant characters at the beginning of titles ("
,[
), I was 99% of the way there when I ran into an interesting problem with fields that had multiple values pushed to them.When using a
top_hits
aggregation and asking for the_source
field back, on single valued fields I got something along the lines of (pseudo code):HOWEVER if there were multiple values from specific documents which were determined to be the "top hit", then this happened:
I looked into the idea of using a "scripted" field instead to try to return only the SINGLE most relevant result, but I kind of bogged down there trying to figure it out. Also, the documentation for script fields says (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-script-fields):
For now, I am normalizing the "source" fields coming back only if they are an array and then attempting to match them against the already normalized version in order to figure out which one to display. It is not a good solution, and I would like to investigate this more in the future.