ParentNotHarvestedException raised when parent dataset is present

FuhuXia commented 5 months ago

When harvesting a data.json source in catalog.data.gov, if parent dataset is listed after children datasets in the data.json, we can see ParentNotHarvestedException in a multiple fetch-consumer process environment, if some children datasets are processed by a fetch-consumer process that does not process the parent dataset.

How to reproduce

Have 4 catalog-fetch instances. Harvest a data.json file with multiple children datasets and a parent dataset. Put the parent dataset as the last one in the data.json file. The sample file can be used. It has 8 children dataset (c1 - c8) and a parent dataset (p1) data.json

Expected behavior

All 9 datasets are harvested.

Actual behavior

6 datasets harvested. 3 report Parent identifier not found error.

Context

Parent-child situation is handled well within one fetch-consumer process.
The next harvest job will have the 3 datasets harvested.
Sketch

jbrown-xentity commented 5 months ago

We should add an automated test covering this case in H2.0, and make sure this won't be an issue in the future.

FuhuXia commented 5 months ago

Quite a few things to consider to have a consistent behavior on the parent-child dataset. For example, if we allow deleting parent dataset afterward and leaving orphaned children datasets hanging, then we should allow children datasets be harvested without parent.

btylerburton commented 1 month ago

Resolved by https://github.com/GSA/data.gov/issues/4847

btylerburton commented 4 weeks ago

Left in the queue to confirm this isn't happening with changes from #4847

FuhuXia commented 3 weeks ago

This is happening in current catalog harvesting, but rarely noticeable, since a re-harvest job will correct it. This will not happen in H2.0.

GSA / data.gov