gbif / stackable

GBIF Stackable Infrastructure
Apache License 2.0
4 stars 0 forks source link

Test: Trino for a CTAS #2

Closed timrobertson100 closed 1 year ago

timrobertson100 commented 1 year ago

We currently use a lot of Hive Create Table As Select jobs.

In this test we should run Trino and the Hive Metastore to:

  1. Register an Occurrence table in Hive, backed by a (parquet?) file exported from production
  2. Use Trino and command line tools to CREATE TABLE sample AS SELECT * FROM occurrence WHERE ... some filter
    1. Stored as a parquet file
    2. Stored as a CSV file
    3. Bonus if we can store as a CSV file using Deflate2 compression
  3. Explore how Trino scales up and down workers
    1. Possibly using something like https://keda.sh/ as a way of scaling up/down based on some observable source (e.g. RabbitMQ)
    2. In practice we may find the need to resize Trino based on the backlog seen on Oozie/Airflow or so