K8s: Spark jobs logging/history

gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)

Apache License 2.0

40 stars 28 forks source link

K8s: Spark jobs logging/history #1018

Closed muttcg closed 4 months ago

muttcg commented 8 months ago

Some logs are available through Airflow, but not all. We need to have comprehensive logs or a detailed history, perhaps using ELK or another common approach for K8s.

fmendezh commented 8 months ago

To take into consideration for this task:

[ ] Spark jobs logging: being able to "aggregate" all the log lines of job (maybe in Kibana?), filter by: spark job id, pipelines execution id, pipeline job type (identifier, indexing, interpretation, etc.)
[x] Airflow logging: "aggregate" logs by execution/attempt, filter by: dag name, dag id, dag type (maps,downloads, etc)
[x] Cluster components: Hbase, Hadoop, Trino, HDFS,Zookeeper

Filters could be implemented as new Elasticseach fields produced from the Vector transformations for each service.

zaultooz commented 7 months ago

Logging for Hbase, Hadoop, Trino, Zookeeper and Spark (Outside of Downloads) has been added to the stackable charts and Airflow.

https://github.com/gbif/gbif-configuration/pull/15 and https://github.com/gbif/stackable-spark/pull/2 contains the changes to add logging to the pipelines spark jobs. Before they are merged it requires them to use the 23.11 version of the operator.

muttcg commented 7 months ago

There is a conflict between yunikorn and running a sidecar pod for logging. When the template uses the yunikorn driver and executor pod override to add yunikorn.apache.org/task-groups and a logging section, the job gets stuck in pending mode. Removing the pod override or the logging part resolves this issue.

zaultooz commented 6 months ago

The conflict is resolved in PR: https://github.com/gbif/gbif-airflow-dags/pull/9