gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Reprocessing old event datasets causes registry state update issues #938

Closed timrobertson100 closed 1 year ago

timrobertson100 commented 1 year ago

Reprocessing an event dataset that was unchanged for 2 years hung in the "running ingestions". On forcing it to finish in the ingestion history and retrying, the result appeared in the history as this screenshot.

Page 8 (attempt 24) image

The suspicion is this was processed before the Event pipelines were deployed and there is some kind of mismatch in the messages being sent around.

We could fixup the code, or perhaps it is simpler to touch all the DwC-A to an earlier date and force crawl all the old event datasets to avoid this situation?

muttcg commented 1 year ago

Fixed history data for all related datasets Page 8, attempt 24 https://registry.gbif.org/dataset/c1e31227-6595-4797-b75a-d9d9f75e4cca/ingestion-history

timrobertson100 commented 1 year ago

Thanks @muttcg

MattBlissett commented 1 year ago

NB mass-force-crawling is generally not a great solution, as datasets will have gone offline (temporarily or permanently).

muttcg commented 1 year ago

@MattBlissett
Reinterpretation scripts use verbatim-to-interpreted information with additional steps. I suggest using the same approach. Starting from dwca-avro only makes sense if the schema for extended-record.avro has been modified