gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Schedule download tables build in K8 #308

Closed fmendezh closed 1 year ago

fmendezh commented 1 year ago

Downloads tables build (https://github.com/gbif/occurrence/tree/dev/occurrence-table-build-spark) must be scheduled to run periodically, at moment this is done by Oozie and an example Jenkins Job exists to run the build process using Spark https://builds.gbif.org/job/dev-occurrence-table-build/, this same process must be ported to Spark 3, K8/Stackable. The Jenkins job can be used as the starting point to migrate this scheduled to job to something that is able to run periodically and submit the job to K8/Stackable.

zaultooz commented 1 year ago

The code changes for enabling the map builder to run within the K8 cluster is located on the feature/stackable-hadoop-3-test in the Occurrence repository. The main changes resolved around dependencies and scoping to coop with class collision at run time with Stackable Spark image.

For scheduling purposes it uses airflow like the map builder to schedule and run the job within the cluster. The DAG describing the job can be found here: https://github.com/gbif/stackable/blob/master/DAGs/dag-files/gbif-occurrence-table-builder-spark.py