gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
31 stars 14 forks source link

Migrate the checklistbank index builder to Spark 3 and K8 #311

Open fmendezh opened 1 week ago

fmendezh commented 1 week ago

Currently, an Oozie workflow is used to build an Elasticsearch index from scratch. The workflow has two main tasks:

  1. AvroExporterApp.java:This task reads from the NameUsage API to export the data into Avro records that can later be easily imported into Elasticsearch.
  2. EsBackfill: This task reads the exported Avro records and creates a new Elasticsearch index. It also handles alias and index swapping.

This process needs to be migrated to Apache Airflow and Spark 3.5.1.

mdoering commented 1 week ago

Id would be good to keep the alias swapping and index build separate, we never run it as one job.