Open nickumia-reisys opened 1 year ago
@FuhuXia will review and create issue.
Not sure why db-solr-sync
errored out on 6/22, but maybe cloud.gov needed to terminate the task for some reason? I think it was mostly finished doing what it needed to do. It's just weird..
5306 packages without harvest_object need to be mannually deleted
hmmm... mannually
-> manually
typo.
This error means 5306 packages have discrepancies between db and solr, but this db-solr-sync script does not know how to handle them. Even you manually index them, CKAN will send bogus harvest_object_id
to solr then you still end up with discrepancies again.
The count (5306) will not go away until we do two steps.
run command ckan geodatagov harvest-object-relink
. This will fix packages that has a good but not current harvest_object_id
. After this, another run of db-solr-sync
or a manual reindex will fix the package. We want to run this command manually when catalog-fetch is idling. A ongoing harvest job does not like his relink script.
run a batch delete via api. This will purge those packages that have no harvest_object_id
. All of our datasets are harvested and all should have a harvest_object_id
. For those package without harvest_object_id
, it is bad, catalog has no way to manage them, we should not hesitate to eliminate them. One scenario that can cause this kind of package is that the source of a duplicated package was removed, the good one of the duplicate was deleted on harvesting, the bad one stays behind, becoming a package without harvest_object_id
. I use this script.
After the two steps run, the count should be 0, but it is expected to become hundreds then thousands again in a couple weeks.
After the two steps run, the count should be 0, but it is expected to become hundreds then thousands again in a couple weeks.
"expected" ๐คจ .... suuuuurrre.
Workflow with Issue: 4 - Automated CKAN Jobs Job being auditied: ckan-auto-command CKAN Command (in question): ckan geodatagov db-solr-sync CKAN Command Schedule: 0 3 * Cloud.gov Environment: prod Total Execution Time: 496
Last Commit: de963a357068574c8da0580434779d8db7076d03 Number of times run: 1 Last run by: nickumia-reisys Github Action Run: https://github.com/GSA/catalog.data.gov/actions/runs/12002518348