Description
there are three main ways for storing logs in airflow in k8s including shared persistVolume, s3 object storage, and elasticsearch
putting shared volume aside, the elasticsearch way haven't been implemented nicely when using kubernetesExecutor as your executor in airflow
the current flow is something like this :
for each task kubernetesExecutor creates a pod acting as the hosting worker node
task logs are persisted either on the pods local disk or stored in containers log via elasticsearch.write_json=true in airflow.cfg file
from there a log aggregator service (in my case filebeats+logstash) reads from the log files and sends it to elastic search
the solution above can be improved because as of now
you either have logs persisted on your containers local disk so you need a filebeat/logstash container next to each worker container with shared volume mounted on log directory
or you have used elasticsearch.write_json=true config which enables you to read from log file generated by k8s (usually in /var/log/container/*.log) via a daemonset of filebeat which is responsible for log aggregation for each node
although both of the solutions work, there is still a waste of resources in both solutions
meanwhile we have the kubernetesPodOperator that does the fascinating job of retrieving logs from the task-pods stdout via kubernetes APIs by default!
i was thinking that combining these two features (elasticsearch.write_stdout=true and kubernetesPodOperator default behavior) we are able to send logs from worker pods to the scheduler directly and have them stored in the scheduler pod instead
Use case / motivation
well first of all in case you are deploying your scheduler and webserver service outside of your k8s cluster, thats the end of the road for you since you have your logs stored in a disk both visible from web-server and scheduler
if you are deploying your scheduler and webserver on k8s (which is a common practice) then you still need the logstash/filebeat service to send logs to your elasticsearch instance but this time you wont be needing a whole deamonset or one instance per worker pod , just one per each scheduler pod would suffice which is much less recourse usage (in my case i have only one scheduler pod so its only 1!)
What do you want to happen?
the whole process of remote logging to elasticsearch is so hard compare to other parts of deploying airflow when using kubernetesExecutor and i am trying to ease up the process
also i feel like its more k8s-ish way to do!!
Are you willing to submit a PR?
if pointed to the right directions to look at, yes!
Description there are three main ways for storing logs in airflow in
k8s
including sharedpersistVolume
,s3
object storage, andelasticsearch
putting shared volume aside, the elasticsearch way haven't been implemented nicely when usingkubernetesExecutor
as your executor in airflowthe current flow is something like this :
kubernetesExecutor
creates a pod acting as the hosting worker nodeelasticsearch.write_json=true
inairflow.cfg
filefilebeats
+logstash
) reads from the log files and sends it to elastic searchthe solution above can be improved because as of now
filebeat
/logstash
container next to each worker container with shared volume mounted on log directoryelasticsearch.write_json=true
config which enables you to read from log file generated byk8s
(usually in/var/log/container/*.log
) via adaemonset
offilebeat
which is responsible for log aggregation for each nodealthough both of the solutions work, there is still a waste of resources in both solutions
meanwhile we have the
kubernetesPodOperator
that does the fascinating job of retrieving logs from the task-pods stdout via kubernetes APIs by default!i was thinking that combining these two features (
elasticsearch.write_stdout=true
andkubernetesPodOperator
default behavior) we are able to send logs from worker pods to the scheduler directly and have them stored in the scheduler pod insteadUse case / motivation well first of all in case you are deploying your scheduler and webserver service outside of your
k8s
cluster, thats the end of the road for you since you have your logs stored in a disk both visible from web-server and schedulerif you are deploying your scheduler and webserver on
k8s
(which is a common practice) then you still need thelogstash
/filebeat
service to send logs to yourelasticsearch
instance but this time you wont be needing a wholedeamonset
or one instance per worker pod , just one per each scheduler pod would suffice which is much less recourse usage (in my case i have only one scheduler pod so its only 1!)What do you want to happen? the whole process of remote logging to
elasticsearch
is so hard compare to other parts of deploying airflow when usingkubernetesExecutor
and i am trying to ease up the processalso i feel like its more k8s-ish way to do!!
Are you willing to submit a PR? if pointed to the right directions to look at, yes!