gbif / crawler

The crawling pieces - ws, cli, coordinator
Apache License 2.0
4 stars 3 forks source link

Sequential ingestion attemps can run in parallel #37

Closed MattBlissett closed 1 year ago

MattBlissett commented 4 years ago

Message-based ingestion ensured that only one attempted crawl–process–index of a dataset could run at a time.

If crawling or ingestion were in process, a subsequent crawl request would be denied. This was detected by the presence of a crawl record from crawler-ws (i.e. entries in ZooKeeper), that record would be removed by the Cleanup Coordinator based on processState.occurrence and so on.

Crawl requests during processing are very common, as a metadata update (e.g. new EML from a DWCA) triggers a registry change, which triggers a crawl request after a minute or two.

Now that message-based processing is not used, the Cleanup Coordinator is removing the ZK entries as soon as the dataset has been crawled (downloaded) and any checklist processing has completed, but the dataset is sometimes still being processed by Pipelines.

The Cleanup Coordinator needs to wait until pipeline processing has completed before cleaning up a crawl record in ZK. Either it needs to look in the current pipelines place, or (preferable for using what the API specifies) Pipelines should set the original processState-occurrence entry in ZK for a dataset. If pipelines makes the changes, the cleanup coordinator here will need small changes to wait for the FINISHED occurrence/sampling status.

This is probably also necessary when a step is rerun manually, but that is less likely to happen, and much less likely to happen outside working hours. If pipelines makes the changes, the cleanup coordinator here will need small changes.

MattBlissett commented 1 year ago

Shouldn't be a problem any more.