airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
650 stars 475 forks source link

support loading DAG definitions from S3 buckets #249

Open thesuperzapper opened 3 years ago

thesuperzapper commented 3 years ago

Currently we support git-sync with the dags.gitSync.* values, but we can probably do something similar for S3 buckets. That is, let people store their dags in a folder on an S3 bucket.

Possibly we should generalise this to include GCS and ABS, but these probably have different libraries needed to do the sync (so might need to be separate features/containers). However, clearly S3 is the best place to start, as it's the most popular.

yossisht9876 commented 2 years ago

hey guys,

until we have it as native solution i created a sidecar container for syncing dags from aws s3 take a look :)

https://github.com/yossisht9876/airflow-s3-dag-sync

tarekabouzeid commented 2 years ago

Hi @thesuperzapper ,

I started working on this and implementing kind of similar to syncing dags from git as you mentioned - My approach is that we can use rclone sync "running as k8s job" to fetch data from s3 bucket containing the dags and store these dags in a mount volume , that volume is also mounted to AF scheduler pod - Should I continue implementing that ?

Best Regards,

yossisht9876 commented 2 years ago

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3


kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800
darren-recentive commented 12 months ago

i have a better solution but you have to configure pvc for the dag bag folder /opt/airflow/dags

after the pvc is ready you just need to create a cronejob that run every X min and sync 2 ways from s3

kind: CronJob
metadata:
  name: s3-sync
  namespace: airflow
spec:
  schedule: "* * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: aws-cli
              image: amazon/aws-cli
              env:
                - name: AWS_REGION
                  value: us-east-1
              args:
                - --no-progress
                - --delete
                - s3
                - sync
                - s3://bucket-name
                - /opt/airflow/dags/
              volumeMounts:
                - name: dags-data
                  mountPath: /opt/airflow/dags/
          volumes:
            - name: dags-data
              persistentVolumeClaim:
                claimName: airflow-dags
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800

Not a bad idea, I'd also add that if you want the GitOps approach, you can disable the schedule via suspend: true Then create an ad-hoc s3-sync Job/Pod from the CronJob as a template from your CICD via kubectl create --from=cronjob/s3-sync https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-job-em-

thesuperzapper commented 5 months ago

I just want to say that while baked-in support for s3-sync did NOT make it into version 8.9.0 of the chart, you can use the extraInitContainers and extraContainers values that were added in https://github.com/airflow-helm/charts/pull/856.

Now you can effectively do what was proposed in https://github.com/airflow-helm/charts/pull/828, by using the following values:

If someone wants to share their values and report how well it works, I am sure that would help others.

PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in https://github.com/airflow-helm/charts/issues/249#issuecomment-1196360490

dantonbertuol commented 3 months ago

I just want to say that while baked-in support for s3-sync did NOT make it into version 8.9.0 of the chart, you can use the extraInitContainers and extraContainers values that were added in #856.

Now you can effectively do what was proposed in #828, by using the following values:

  • For Scheduler/Webserver/Workers (but not KubernetesExecutor):

    • airflow.extraContainers (looping sidecar to sync into dags folder)
    • airflow.extraInitContainers (initial clone of S3 bucket into dags folder)
    • airflow.extraVolumeMounts (mount the emptyDir)
    • airflow.extraVolumes (define an emptyDir volume)
  • For KubernetesExecutor Pod template:

    • ~airflow.kubernetesPodTemplate.extraContainers~ (you don't need the sidecar for transient Pods)
    • airflow.kubernetesPodTemplate.extraInitContainers
    • airflow.kubernetesPodTemplate.extraVolumeMounts
    • airflow.kubernetesPodTemplate.extraVolumes

If someone wants to share their values and report how well it works, I am sure that would help others.

PS: You can still use a PVC-based approach, where you have a Deployment (or CronJob) that syncs your S3 bucket into that PVC as described in #249 (comment)

Hi, I'm using KubernetesExecutor and my extra container gets stuck and doesn't let the executor pod finish. Any tips on what to do?

nicolasge commented 3 weeks ago

https://janetvn.medium.com/s3-sync-sidecar-to-continuously-deploy-dags-for-airflow-running-on-kubernetes-ab4d417dd8e6 I'm trying to do the same thing but follow this link. I'd like to see Airflow can support this officially