Open thesuperzapper opened 3 years ago
hey guys,
until we have it as native solution i created a sidecar container for syncing dags from aws s3 take a look :)
Hi @thesuperzapper ,
I started working on this and implementing kind of similar to syncing dags from git as you mentioned - My approach is that we can use rclone sync "running as k8s job" to fetch data from s3 bucket containing the dags and store these dags in a mount volume , that volume is also mounted to AF scheduler pod - Should I continue implementing that ?
Best Regards,
i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags
after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3
kind: CronJob
metadata:
name: s3-sync
namespace: airflow
spec:
schedule: "* * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 2
jobTemplate:
spec:
template:
spec:
containers:
- name: aws-cli
image: amazon/aws-cli
env:
- name: AWS_REGION
value: us-east-1
args:
- --no-progress
- --delete
- s3
- sync
- s3://bucket-name
- /opt/airflow/dags/
volumeMounts:
- name: dags-data
mountPath: /opt/airflow/dags/
volumes:
- name: dags-data
persistentVolumeClaim:
claimName: airflow-dags
restartPolicy: OnFailure
ttlSecondsAfterFinished: 172800
i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags
after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3
kind: CronJob metadata: name: s3-sync namespace: airflow spec: schedule: "* * * * *" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 2 failedJobsHistoryLimit: 2 jobTemplate: spec: template: spec: containers: - name: aws-cli image: amazon/aws-cli env: - name: AWS_REGION value: us-east-1 args: - --no-progress - --delete - s3 - sync - s3://bucket-name - /opt/airflow/dags/ volumeMounts: - name: dags-data mountPath: /opt/airflow/dags/ volumes: - name: dags-data persistentVolumeClaim: claimName: airflow-dags restartPolicy: OnFailure ttlSecondsAfterFinished: 172800
Not a bad idea, I'd also add that if you want the GitOps approach, you can disable the schedule via suspend: true
Then create an ad-hoc s3-sync Job/Pod from the CronJob as a template from your CICD via kubectl create --from=cronjob/s3-sync
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-job-em-
I just want to say that while baked-in support for s3-sync
did NOT make it into version 8.9.0
of the chart, you can use the extraInitContainers
and extraContainers
values that were added in https://github.com/airflow-helm/charts/pull/856.
Now you can effectively do what was proposed in https://github.com/airflow-helm/charts/pull/828, by using the following values:
airflow.extraContainers
(looping sidecar to sync into dags folder)airflow.extraInitContainers
(initial clone of S3 bucket into dags folder)airflow.extraVolumeMounts
(mount the emptyDir)airflow.extraVolumes
(define an emptyDir volume)airflow.kubernetesPodTemplate.extraContainers
~ (you don't need the sidecar for transient Pods)airflow.kubernetesPodTemplate.extraInitContainers
airflow.kubernetesPodTemplate.extraVolumeMounts
airflow.kubernetesPodTemplate.extraVolumes
If someone wants to share their values and report how well it works, I am sure that would help others.
PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in https://github.com/airflow-helm/charts/issues/249#issuecomment-1196360490
I just want to say that while baked-in support for
s3-sync
did NOT make it into version8.9.0
of the chart, you can use theextraInitContainers
andextraContainers
values that were added in #856.Now you can effectively do what was proposed in #828, by using the following values:
For Scheduler/Webserver/Workers (but not KubernetesExecutor):
airflow.extraContainers
(looping sidecar to sync into dags folder)airflow.extraInitContainers
(initial clone of S3 bucket into dags folder)airflow.extraVolumeMounts
(mount the emptyDir)airflow.extraVolumes
(define an emptyDir volume)For KubernetesExecutor Pod template:
- ~
airflow.kubernetesPodTemplate.extraContainers
~ (you don't need the sidecar for transient Pods)airflow.kubernetesPodTemplate.extraInitContainers
airflow.kubernetesPodTemplate.extraVolumeMounts
airflow.kubernetesPodTemplate.extraVolumes
If someone wants to share their values and report how well it works, I am sure that would help others.
PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in #249 (comment)
Hi, I'm using KubernetesExecutor and my extra container gets stuck and doesn't let the executor pod finish. Any tips on what to do?
https://janetvn.medium.com/s3-sync-sidecar-to-continuously-deploy-dags-for-airflow-running-on-kubernetes-ab4d417dd8e6 I'm trying to do the same thing but follow this link. I'd like to see Airflow can support this officially
Currently we support git-sync with the
dags.gitSync.*
values, but we can probably do something similar for S3 buckets. That is, let people store their dags in a folder on an S3 bucket.Possibly we should generalise this to include GCS and ABS, but these probably have different libraries needed to do the sync (so might need to be separate features/containers). However, clearly S3 is the best place to start, as it's the most popular.