gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Reprocessing from UI doesn't run all steps #951

Closed timrobertson100 closed 9 months ago

timrobertson100 commented 10 months ago

Reprocessing a dataset using the UI and stage VERBATIM_TO_INTERPRETED ran only the GRSciColl step (see interpretTypes below).

"message": "{\"datasetUuid\":\"e17bfff0-4cf4-4130-98ea-f9f053eaef4f\",\"attempt\":227,\"interpretTypes\":[\"GRSCICOLL\"],\"pipelineSteps\":[\"HDFS_VIEW\",\"INTERPRETED_TO_INDEX\",\"VERBATIM_TO_INTERPRETED\"],\"runner\":\"STANDALONE\",\"endpointType\":\"DWC_ARCHIVE\",\"extraPath\":null,\"validationResult\":{\"tripletValid\":false,\"occurrenceIdValid\":true,\"useExtendedRecordId\":null,\"numberOfRecords\":13527,\"numberOfEventRecords\":null},\"resetPrefix\":\"202309062059\",\"executionId\":3265940,\"datasetType\":null,\"routingKey\":\"occurrence.pipelines.verbatim.finished.standalone\",\"datasetInfo\":{\"datasetType\":null,\"containsOccurrences\":true,\"containsEvents\":false}}",

It should run all the steps, or give the user to choice

timrobertson100 commented 10 months ago

The API it calls is e.g. https://api.gbif.org/v1/pipelines/history/run?steps=DWCA_TO_VERBATIM&useLastSuccessful=true&reason=scientificNameID

One can add the following parameters to instruct all steps to run: &interpretTypes=LOCATION,GRSCICOLL,TAXONOMY,METADATA,BASIC,TEMPORAL,CLUSTERING

However, rather than adding those to the UI, I suggest changing the API so that if no interpretType are given, then ALL steps run by default and leave the UI as it is. This would be more robust as new steps may be added in the future which could unwittingly outdate scripts people have written.

muttcg commented 9 months ago

Fixed in Registry

timrobertson100 commented 9 months ago

Awesome - thanks @muttcg

If we're using scripts, do we need to pass all the steps or can we omit the parameter and it'll default to all steps please?

muttcg commented 9 months ago

@timrobertson100 when you don't pass steps to the query, registry adds all possible steps to the existing message list. So, it is not necessary if you want to run all steps now.

timrobertson100 commented 9 months ago

Thanks @muttcg