GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
547 stars 87 forks source link

ParentNotHarvestedException raised when parent dataset is present #4801

Open FuhuXia opened 3 days ago

FuhuXia commented 3 days ago

When harvesting a data.json source in catalog.data.gov, if parent dataset is listed after children datasets in the data.json, we can see ParentNotHarvestedException in a multiple fetch-consumer process environment, if some children datasets are processed by a fetch-consumer process that does not process the parent dataset.

How to reproduce

Have 4 catalog-fetch instances. Harvest a data.json file with multiple children datasets and a parent dataset. Put the parent dataset as the last one in the data.json file. The sample file can be used. It has 8 children dataset (c1 - c8) and a parent dataset (p1) data.json

Expected behavior

All 9 datasets are harvested.

Actual behavior

6 datasets harvested. 3 report Parent identifier not found error.

Context

jbrown-xentity commented 3 days ago

We should add an automated test covering this case in H2.0, and make sure this won't be an issue in the future.

FuhuXia commented 3 days ago

Quite a few things to consider to have a consistent behavior on the parent-child dataset. For example, if we allow deleting parent dataset afterward and leaving orphaned children datasets hanging, then we should allow children datasets be harvested without parent.