MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

ElasticSearch performance/memory for field analysis #206

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

When clicking a single field name in field analysis table, the query is prohibitively slow and large, e.g.:

http://192.168.45.10/combine/analysis/es/index/j147/field_analysis?field_name=dc_title

This routes to field_analysis, which runs the following:

# get field name
field_name = request.GET.get('field_name')

# get ESIndex
esi = models.ESIndex(es_index)

# get analysis for field
field_metrics = esi.field_analysis(field_name, metrics_only=True)

Which, specifically, runs field_analysis from core.models.

A similar method is run quickly and fairly effortlessly for the Job as a whole, count_indexed_fields, which provides similar stastical metrics for the field (coverage, uniqeuness, etc.).

Need to investigate breaking/expensive difference between the two.

This is constrasted by clicking the link within column "Documents with Field", e.g.:

http://192.168.45.10/combine/analysis/es/index/j147/field_analysis/docs/exists?field_name=dc_title&exists=true
ghukill commented 6 years ago

The offending bit is adding an aggregation for field terms:

# add agg bucket for field values
self.query.aggs.bucket(self.field, A('terms', field='%s.keyword' % self.field, size=terms_limit))

This happens in both ESIndex.field_analysis and DTElasticFieldSearch.values_per_field (which is used for JSON response for Datatables). ESIndex.field_analysis is run with metrics_only flag which ensures the terms aren't fetched twice, but does not address core problem.

Formerly, terms_limit was set at 1,000,000, which was clearly for testing / dev purposes, but not sustainable. Requesting via DTElasticFieldSearch helps, some, because the results are paginated, but for a moment the entirety is returned before truncated sending off as JSON.

What to do here? It would be trivial to limit the amount of terms, but this was a defining feature that a user could look for a specific term and see if it exists, and how many instances.

Viewing all values is very cheap (the second link in first comment), but this does not provide a count.

ghukill commented 6 years ago

This looks promising: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

ghukill commented 6 years ago

But, appears only available in 6.x, as it does not show in 5.5 documentation: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-terms-aggregation.html

ghukill commented 6 years ago

Fix for now has been to limit this to 10k terms in request, with the understanding that filtering (in DT search box), will still filter all results, and return the top 10k that match.

So if you had term aggs like:

horse (10)
goober (9)
tronic (8)
foo (2)
bar (1)

and limited the size to 4, you would get the following without bar:

horse (10)
goober (9)
tronic (8)
foo (2)

but a search for bar would return:

bar (1)

Closing.