Closed hkdctol closed 1 month ago
catalog-fetch constantly crashing due to solr error.
2024-05-20T09:19:17.10-0400 [APP/PROC/WEB/3] ERR ckan.lib.search.common.SearchIndexError: Solr returned an error: Solr responded with an error (HTTP 400): [Reason: Error:[doc=0e4a1d3078ff068641f1cea20231812bc4d66124] Unknown operation for the an atomic update: fn]
There were total 7 stuck jobs. stopped the DOJ one, see following error in the catalog-fetch log:
2024-05-20T10:14:51.12-0400 [APP/PROC/WEB/0] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
2024-05-20T10:14:51.12-0400 [APP/PROC/WEB/0] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later
This above error crash the catalog-fetch process, it is daily job, @hkdctol FYI. for now we just changed the frequency of this harvest source hhs-cas-json
to Manual
to avoid blocking other harvest jobs.
In ticket https://github.com/GSA/data.gov/issues/4223 we list ParentNotHarvestedException is one of scenario that cause catalog-fetch to crash. It is noticed that this error become a bigger issue for the past a few weeks, blocking system-wide harvest jobs. We need either tell agencies to fix their sources, or we need to fix the code that catalog-fetch does not crash on this error.
This morning the catalog-fetch crashing again, samilar errors reported like yesterday, also found ParentNotHarvestedException error for dot-socrata-data-json harvest source:
2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later
This is daily job too, we may need to fix the code that catalog-fetch does not crash on this error.
catalog-fetch constantly crashing due to solr errors. Restart solar leader did not help.
Rebuild individual index has no issue.
Manually stopped 13 running jobs and scaled the catalog-fetch to 1, re-harvest one by one and try to reproduce this issue.
So far the doj-json harvesting reported same solr error, continue testing to narrow down the resource could cause this issu.
https://github.com/GSA/data.gov/
Check Harvesting Emails
[x] Catalog:
[x] DB-Solr Sync:
4 packages need to be removed from Solr
0 packages need to be updated/added to Solr
974 packages without harvest_object need to be mannually deleted
Finished 528s
The catalog-fetch service is frequently crashing due to the following issues:
doj-json
harvesting.ERR ckan.lib.search.common.SearchIndexError: Solr returned an error: Solr responded with an error (HTTP 400): [Reason: Error:[doc=0e4a1d3078ff068641f1cea20231812bc4d66124] Unknown operation for the an atomic update: fn]
dot-socrata-data-json
and hhs-cas-json
harvesting. 2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR raise ParentNotHarvestedException('Unable to find parent dataset. Raising error to allow re-run later')
2024-05-21T09:56:07.47-0400 [APP/PROC/WEB/1] ERR ckanext.datajson.exceptions.ParentNotHarvestedException: Unable to find parent dataset. Raising error to allow re-run later
We just set dot-socrata-data-json
and hhs-cas-json
frequency to Manual
for now to avoid blocking other harvest jobs.
Prod catalog-gather
ProxyError error was reported for multiple harvest sources. Doing some troubleshooting steps, including restarting app proxy-gsa-datagov-prod-catalog-gather
in space prod-egress
. Evenually the issue is resolved, most likely thanks to the app restart.
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.
Check the O&M Rotation Schedule for future planning.
Acceptance criteria
You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.
Daily Checklist
Weekly Checklist
Monthly Checklist
ad-hoc checklist
Reference