airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
630 stars 474 forks source link

feat: add s3-sync sidecar #828

Open chirichidi opened 4 months ago

chirichidi commented 4 months ago

What issues does your PR fix?

What does your PR do?

Overview

This Pull Request introduces a new feature, s3Sync, designed to enhance our application's ability to synchronize data with AWS S3. This addition aims to provide a more robust and flexible solution for managing cloud storage synchronization tasks.

Details

No Changes to Existing gitSync Functionality

Testing and Validation

image 스크린샷 2024-02-17 오전 1 27 10

Conclusion

This enhancement is a step forward in our ongoing efforts to provide a seamless and powerful toolset for our users. By introducing s3Sync, we are expanding our capabilities while ensuring the integrity and performance of our existing features remain intact.

I look forward to your feedback and any discussions regarding this PR. Thank you for considering these enhancements.

Checklist

For all Pull Requests

For releasing ONLY

rsotogar commented 3 months ago

I am curious to know how syncing DAGs from S3 work? do we need to create a kubectl secret with our AWS key and secret key, and how often will it poll for new DAG files/ folders?

wikitops commented 2 months ago

This would be useful to have in the Helm chart. Git sync is sometime not the best option.

pedorro commented 2 months ago

just another bump on this PR. This seems like the best option for AirFlow deployments in EKS.

And just to clarify about AWS credentials, in general we would be using IAM roles rather than user credentials, so there should be no need for additional k8s secrets, or anything like that.

thesuperzapper commented 2 months ago

@chirichidi thanks for the very interesting PR, I would love to get "s3-sync" as a concept into the chart (as it will help users migrate from MWAA).

The main thing we need to finalize is the "reconciliation loop", everything else is secondary and can be updated later.

If I understand your PR correctly, you have done the following:

My main concerns are:

  1. What happens when a sync is halfway, but airflow starts refreshing the DAGs (so we have some old and some new)?
    • This is avoided in git-sync by using symbolic link switching.
    • It's likley rare for something seriously bad to happen when this occurs, so it might not matter.
  2. Wouldn't it make more sense to also use aws s3 sync for the init-container?
    • Because init-containers can sometimes run again (when the pod restarts), so we can save a bit of time by not re-downloading everything.
  3. I wonder if we might want to use some or all of the following aws sync parameters:
    • --quiet
    • --only-show-errors

Are there any other things I have missed?

PS: if/when we merge this, I will update the values/docs in your PR to match the style of the chart.