gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

VERBATIM_TO_IDENTIFIER runs small datasets on Spark #953

Closed timrobertson100 closed 1 year ago

timrobertson100 commented 1 year ago

The VERBATIM_TO_IDENTIFIER stage is running everything on Spark (Yarn), even for tiny datasets such as this one.

We should either fix the config to be something reasonable (e.g. 1M records or >1GB uncompressed size or so) or rework this stage so that it doesn't require distributed computing.

muttcg commented 1 year ago

There is only one implementation of that workflow - yarn/beam

muttcg commented 1 year ago

Deployed to PROD