GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
624 stars 99 forks source link

Harvester causes SOLR cloud to crash #3783

Open jbrown-xentity opened 2 years ago

jbrown-xentity commented 2 years ago

When a production CKAN catalog is implemented and indexed fully, the harvesters crash. Not sure yet why, need to investigate further.

Probably related to #3784.

Part of #1342

How to reproduce

  1. Use DB prod backup to restore for catalog
  2. Spin up and reindex catalog
  3. Turn on harvesting

Expected behavior

Harvesting succeeds, solr lives on

Actual behavior

SOLR crashes/fails

Sketch

Need to isolate what is causing the problem, check the following use cases:

jbrown-xentity commented 2 years ago

In our clean use case, https://catalog-fxia-datagov.app.cloud.gov/dataset/, 5 harvest sources worked ok (including largest DCAT-US source and several WAF's). When a 6th was added, https://catalog.data.gov/harvest/about/ioos, SOLR crashed. We confirmed the SOLR state is similar as before (no leader available). If so, we have our test case for reproducing and debugging. It harvested > 200 datasets (but probably < 5K) before SOLR crashed.

nickumia-reisys commented 2 years ago

The above findings leads me to want to test the following:

**Note: the CKAN 2.8/2.9 differentiator answers the question of it it's a weird encoding issue with the database dump taken from PY2/CKAN2.8 on FCS. If the same harvest source can be reliably harvested from scratch in a PY3/CKAN2.9 context, then the issue is that the data is not compatible between PY2/PY3.

Apparently, the error was not happening with CKAN/Solr 5 (Standalone Mode). We're not sure if this is a byproduct of,

Without testing all of these cases in order (until the error no longer happens), we won't know what a good configuration is for our application and how to properly choose the right solution. Obviously, we can't go live with anything but Solr 8, but if the problem doesn't exist prior to Solr 8, then we have a bigger Solr problem than if our setup is the issue.