gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

K8s: Getting rid of Apache Beam #1048

Open muttcg opened 3 months ago

muttcg commented 3 months ago

Apache Beam appears to be a redundant extra abstraction layer for pipelines, not providing significant benefits. Moreover, the combination of the latest Beam version with the pipelines-k8s version introduces bugs, complicating the diagnosis of actual issues.

Avro: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#load-and-save-functions Elasticseach: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html Metrics: https://kb.databricks.com/metrics/spark-metrics Options/CLI arguments: Picocli/Args4j/Apache Commons CLI/JCommander/etc