GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
547 stars 87 forks source link

ParentNotHarvestedException error crashes catalog-fetch #4775

Closed FuhuXia closed 4 days ago

FuhuXia commented 3 weeks ago

When the parent dataset is deleted from a datajson source, each child dataset makes catalog-fetch process crash. This slows down harvest process and often make other harvest jobs stuck for days.

How to reproduce

https://github.com/GSA/data.gov/issues/4755#issuecomment-2122747761 https://github.com/GSA/data.gov/issues/4772#issuecomment-2147608839

Expected behavior

logs an error and continues to the item in the fetch queue

Actual behavior

crashes catalog-fetch

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

FuhuXia commented 4 days ago

It is ok to crash the process, as long as the app stays up.

Crashing fetch-consumer process on ParentNotHarvestedException is related to how child datasets are processed before parent dataset is created. Rather than changing the behaviour, we can simply restart the process right after crash so that catalog-fetch app stays up, asdone in the PR above.

FuhuXia commented 4 days ago

Deployed on prod. Rerun some datajson sources that were previously turn off to manual. Still sees ParentNotHarvestedException, but no more app crashes.