log aggregation from kubernetesExecutor using kubernetesAPI

Description there are three main ways for storing logs in airflow in k8s including shared persistVolume, s3 object storage, and elasticsearch putting shared volume aside, the elasticsearch way haven't been implemented nicely when using kubernetesExecutor as your executor in airflow

the current flow is something like this :

for each task kubernetesExecutor creates a pod acting as the hosting worker node
task logs are persisted either on the pods local disk or stored in containers log via elasticsearch.write_json=true in airflow.cfg file
from there a log aggregator service (in my case filebeats+logstash) reads from the log files and sends it to elastic search

the solution above can be improved because as of now

you either have logs persisted on your containers local disk so you need a filebeat/logstash container next to each worker container with shared volume mounted on log directory
or you have used elasticsearch.write_json=true config which enables you to read from log file generated by k8s (usually in /var/log/container/*.log) via a daemonset of filebeat which is responsible for log aggregation for each node

although both of the solutions work, there is still a waste of resources in both solutions

meanwhile we have the kubernetesPodOperator that does the fascinating job of retrieving logs from the task-pods stdout via kubernetes APIs by default!

i was thinking that combining these two features (elasticsearch.write_stdout=true and kubernetesPodOperator default behavior) we are able to send logs from worker pods to the scheduler directly and have them stored in the scheduler pod instead

Use case / motivation well first of all in case you are deploying your scheduler and webserver service outside of your k8s cluster, thats the end of the road for you since you have your logs stored in a disk both visible from web-server and scheduler

if you are deploying your scheduler and webserver on k8s (which is a common practice) then you still need the logstash/filebeat service to send logs to your elasticsearch instance but this time you wont be needing a whole deamonset or one instance per worker pod , just one per each scheduler pod would suffice which is much less recourse usage (in my case i have only one scheduler pod so its only 1!)

What do you want to happen? the whole process of remote logging to elasticsearch is so hard compare to other parts of deploying airflow when using kubernetesExecutor and i am trying to ease up the process

also i feel like its more k8s-ish way to do!!

Are you willing to submit a PR? if pointed to the right directions to look at, yes!

apache / airflow

log aggregation from kubernetesExecutor using kubernetesAPI #17678