Closed gjost closed 3 years ago
ddrstage (stage) is running v5.0.4 pub: 6df729f49444453c72d16ed14f4177dcf3b5716d (HEAD -> master, origin/master, origin/HEAD) 2020-08-05 12:54:39 -0700
ddr (production) is running v5.0.3 pub: a3f2a796d6651cfe9e0966639df2eb040ecfd5b3 (HEAD -> master, origin/master, origin/HEAD) 2020-07-23 10:21:18 -0700
ES stage cluster is running v7.9.0 ES production cluster is running v7.8.0
rgstage (stage) and resourceguide (production) appear to be functioning correctly as well. Both are running v3.3.1 of encyc-rg
Note that the Elasticsearch production cluster has not yet been updated to 7.9.0.
For the time being apt
upgrades for the elasticsearch
package have been disabled in the production cluster (rosie,faker,butch) using sudo apt-mark hold elasticsearch
. Holds may be listed using apt-mark showhold
and removed using sudo apt-mark unhold PACKAGE
.
It looks like there's a way to do this using Ansible but it uses dpkg
which seems to have a different hold mechanism from apt
's. For the time being I've added a note that breaks the elastic7.yml
playbook just to make things obvious and record this state in the ansible-colo
repository.
I found a backup deb deb10es
(local development Elasticsearch server VM) that still had 7.8.1 and loaded it. It has the same issue with Search but NamesDB is fine. I guess Elasticsearch may not be the actual problem. However, I rolled my local ddr-public
back to v5.0.3 (a3f2a79
) and it has the same error.
I really want this to be Elasticsearch's fault and not my own code's, but even if it is we're not going back to 7.8.1. Sigh.
Tracebacks for ddrpublic/us/search.py:298
.
In both production and stage the Elasticsearch query
looks like this:
{
"_source": [
"id", "model", "links_html", "..."
],
"aggs": {
"facility": {
"aggs": {
"facility_ids": {"terms": {"field": "facility.id", "size": 1000}}
},
"nested": {"path": "facility"}
},
"format": {"terms": {"field": "format"}},
"genre": {"terms": {"field": "genre"}},
"rights": {"terms": {"field": "rights"}},
"topics": {
"aggs": {
"topics_ids": {"terms": {"field": "topics.id", "size": 1000}}
},
"nested": {"path": "topics"}
}
},
"query": {
"query_string": {
"allow_leading_wildcard": False,
"analyze_wildcard": False,
"default_operator": "AND",
"fields": [
"id", "model", "links_html", "..."
],
"query": "minidoka"}
}
}
The only difference is that v5.0.4 (stage) adds the record_lastmod
field to query['_source']
and query['query_string']['fields']
.
On the (working) production ddr.densho.org, pointing at the production Elasticsearch cluster, results.aggregations.to_dict()
looks like this:
{
"topics": {
"doc_count": 1542,
"topics_ids": {
"buckets": [
{"doc_count": 221, "key": "416"},
{"doc_count": 183, "key": "173"},
"..."
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0}
},
"facility": {
"doc_count": 1352,
"facility_ids": {
"buckets": [
{"doc_count": 1217, "key": "8"},
{"doc_count": 31, "key": "55"},
"..."
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0}
},
"format": {
"buckets": [
{"doc_count": 612, "key": "doc"},
{"doc_count": 532, "key": "img"},
"..."
],
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0
},
"genre": {
"..."
},
"rights": {
"..."
}
}
On the (broken) ddrstage.densho.org, also pointing at the production Elasticsearch cluster, results.aggregations.to_dict()
looks like this:
{
'topics': {
'doc_count': 0
},
'facility': {
'doc_count': 0
},
'format': {
'buckets': [],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0
},
'genre': {
'buckets': [],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0
},
'rights': {
'buckets': [],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0
}
}
Using git diff a3f2a79..6df729f
to compare the two versions (a3f2a79
is v5.0.3
and 6df729f
is v5.0.4
), the only differences in the code (aside from edits to Makefile
and other non-code files) are:
ddrpublic/ddrpublic/settings.py
adds django.contrib.sitemaps
and robots
to INSTALLED_APPS
.ddrpublic/ui/search.py
adds record_lastmod
to SEARCH_INCLUDE_FIELDS
.ddrpublic/ui/sitemaps.py
added.ddrpublic/ui/templates/ui/base.html
links to favicons.ddrpublic/ui/urls.pt
adds URL paths to robots.txt
and sitemap.xml
.Looks like the elasticsearch-py
is different between the two installs:
(ddrpublic) ansible@PRODUCTION:/opt/ddr-public$ pip freeze | grep elastic
elasticsearch==7.7.1
elasticsearch-dsl==7.2.1
(ddrpublic) ansible@STAGE:/opt/ddr-public$ pip freeze|grep elastic
elasticsearch==7.8.1
elasticsearch-dsl==7.2.1
I rolled elasticsearch-py
on stage back to version 7.7.1
, restarted stage ddrpublic
, cleared the Redis cache, and rand the search again, and got the same error as before.
elasticsearch-py
and elasticsearch-dsl-py
.Just for laughs I will now roll ddrpublic
back to v.5.0.3 on stage and see what happens...
Rolled stage ddrpublic
back to v.5.0.3 and it works with the production Elasticsearch 7.8.0.
Pointed stage at the stage Elasticsearch 7.9.0 cluster and... it works.
The more I look at this the more surprised I am that production ddrpublic search is working at all.
This is a Python shell distillation of what's happening in the search view:
from django.conf import settings
from ui import search
params = {'fulltext': 'minidoka'}
searcher = search.Searcher()
searcher.prepare(
params=params,
params_whitelist=search.SEARCH_PARAM_WHITELIST,
search_models=search.SEARCH_MODELS,
fields=search.SEARCH_INCLUDE_FIELDS,
fields_nested=search.SEARCH_NESTED_FIELDS,
fields_agg=search.SEARCH_AGG_FIELDS,
)
limit,offset = settings.RESULTS_PER_PAGE, 0
start = int(offset)
stop = (start + int(limit))
response = searcher.s[start:stop].execute()
ui.search
prepares an elasticsearch_dsl.search.Search
object from params in request
. None of this code has changed since v5.0.3 with the exception of adding a few fields to those returned from the function.
In the last line, searcher.s
is the elasticsearch_dsl.search.Search
object.
The .execute()
function is elasticsearch-dsl
code not ddr-public
code.
Basically all this is leading up to running elasticsearch_dsl.search.Search.execute
. This returns response
which is an elasticsearch_dsl.response.Response
object.
The results.aggregations
mentioned above is this elasticsearch_dsl.response.Response
object (please forgive the naming - I guess a response from the server contains search results for the user?).
This works when running in the production ddrpublic
virtualenv but does not work in the stage virtualenv, even though they both use the same version of elasticsearch-py
and elasticsearch-dsl-py
and they are both pointed at the same Elasticsearch cluster.
ddrpublic | 5.0.3 | 5.0.3 | 5.0.3 | 5.0.4 | 5.0.4 | 5.0.4 |
---|---|---|---|---|---|---|
elastic-py | 7.7.1 | 7.7.1 | 7.7.1 | 7.7.1 | 7.9.1 | 7.9.1 |
elastic-dsl-py | 7.2.1 | 7.2.1 | 7.2.1 | 7.2.1 | 7.2.1 | 7.2.1 |
elasticsearch | 7.8.0 | 7.9.0 | 7.8.0 | 7.8.0 | 7.9.0 | 7.9.0 |
record_lastmod | yes | yes | no | yes | ||
OK | OK | FAIL | FAIL | OK | FAIL |
Looks like it was the presence of record_lastmod
in SEARCH_INCLUDED_FIELDS that broke things. WTF Elasticsearch?
Answer: SEARCH_INCLUDED_FIELDS
is the list of fields on which fulltext search is performed, NOT (as I thought) the list of fields that are returned by the search. record_lastmod
is a datetime so you can't perform a fulltext search on it [thanks @GeoffFroh!].
That was a hard one. I think it was a combination of factors:
elasticsearch-py
and elasticsearch-dsl-py
updatesAll that prejudiced me to focus on Elasticsearch and libs as the problem instead of looking at my own code.
I need to go through my search code and look for try/except clauses. I might have hidden an error that would have surfaced this more easily
Fixed as of ddr-public
commit fc0c958
for upcoming package ddrpublic-master_5.0.5~deb10
.
On Friday I ran the usual
apt-get update / upgrade
in the colo. I upgraded the stage cluster and it looks like it broke search and possibly other parts of the DDR site. I'm taking stock of what's broken.Looks like the Encyclopedia (encycstage) is OK.
ddrpublic search is throwing a
KeyError
looking for'topics_ids'
inddrpublic/ui/search.py
line 305:ddrpublic-161-KeyError_at search.zip