Open FuhuXia opened 5 months ago
We should add an automated test covering this case in H2.0, and make sure this won't be an issue in the future.
Quite a few things to consider to have a consistent behavior on the parent-child dataset. For example, if we allow deleting parent dataset afterward and leaving orphaned children datasets hanging, then we should allow children datasets be harvested without parent.
Resolved by https://github.com/GSA/data.gov/issues/4847
Left in the queue to confirm this isn't happening with changes from #4847
This is happening in current catalog harvesting, but rarely noticeable, since a re-harvest job will correct it. This will not happen in H2.0.
When harvesting a data.json source in catalog.data.gov, if parent dataset is listed after children datasets in the data.json, we can see
ParentNotHarvestedException
in a multiple fetch-consumer process environment, if some children datasets are processed by a fetch-consumer process that does not process the parent dataset.How to reproduce
Have 4 catalog-fetch instances. Harvest a data.json file with multiple children datasets and a parent dataset. Put the parent dataset as the last one in the data.json file. The sample file can be used. It has 8 children dataset (c1 - c8) and a parent dataset (p1) data.json
Expected behavior
All 9 datasets are harvested.
Actual behavior
6 datasets harvested. 3 report
Parent identifier not found
error.Context
Sketch