During solr restart, web UI caches error page, API returns 409 and 502

FuhuXia commented 1 year ago

During solr restart, for a short time period solr might send bad responses to CKAN, with which CKAN generates a search error page. It might be cached by CloudFront, user will see the error page for quite a while before the cache expires.

How to reproduce

Hard to catch it during solr restart.

Expected behavior

No error page

Actual behavior

Search Error page cached.

Sketch

For now the manual step to fix the error is to clear CloudFront cache for /*. A quick fix would be shortening the most visited page (e.g. /dataset) to use 1 minute cache time so that the error wont last long.

Long term wise I can think of three ways to resolve/alleviate the issue:

Tell CKAN to respond with 5xx code instead of 200 code for solr error, so that the error page is not cached, or
Optimize solr restarting process to shorten/eliminate bad response time, or
Automate the process to follow solr restart with a CF cache clear action.
Make catalog-web to solr connection sticky, so each ckan app is taking to one solr follower. We can terminate the ckan app once we detect it is talking to a bad solr, and then it will randomly connects to a new solr (hopefully a good one) after restart.

nickumia-reisys commented 1 year ago

For Option 2 above, we could implement additional functionality into our solr docker image that does a more comprehensive server check of solr and publishes the health to a new endpoint. As an example, we could add an nginx service into the solr docker image which checks solr health and then updates the /healthcheck route for the AWS target group to be better informed about the state of the solr service.

FuhuXia commented 1 year ago

User reported package_search API calls returning 409 then 502 error during the solr restarts.

GSA / data.gov