gbif / stackable

GBIF Stackable Infrastructure
Apache License 2.0
4 stars 0 forks source link

Test: Airflow launching Spark and Trino #4

Closed timrobertson100 closed 1 year ago

timrobertson100 commented 1 year ago

Today GBIF uses Oozie for several key things:

  1. To run our data "downloads". These workflows run a Java stage that gets a count from Elastic, and then either runs a process that calls ES or runs a Hive job to produce a set of files on HDFS. A final Java stage zips this up and moves into the hdfs folder that is surfaced through HTTP and NFS to the web layer
  2. To run map builds on a coordinated schedule. These run setup Java, a Spark stage (same as #1) and then a Java stage to load them into HBase, run some cleanup of old tables and register them in ZK
  3. To run our nightly "big table" build. This merges the Avro files produced from crawling into a Hive registered table (has some locking going on)
  4. Other batch processing jobs (spark based) that populate HBase tables (GRSciColl) or (I think) POST to REST APIs (gridded datasets)

We should test that Airflow can achieve these tasks in some representative workflow. Perhaps something along the lines of:

  1. A Java stage that takes an Avro file on HDFS and copies it onto a new location in HDFS
  2. A Spark stage that registers an external table in Hive, and then runs a Spark SQL to transform the table (e.g. `CREATE TABLE t2 STORED AS parquet AS SELECT * FROM t1)
  3. A Trino stage that uses Hive (e.g. CREATE TABLE t3 WITH (format='orc') AS SELECT * FROM t2 WHERE year>2000)

I think this would flush out the main issues, and provide the kind of skeleton needed (i.e. solve all the CP hell and config needs) for developers to port our Oozie work.

We can skip the Elastic needs as that is external to both ST and Cloudera anyway.