feat: add s3-sync sidecar

chirichidi commented 4 months ago

What issues does your PR fix?

fixes #249

What does your PR do?

Overview

This Pull Request introduces a new feature, s3Sync, designed to enhance our application's ability to synchronize data with AWS S3. This addition aims to provide a more robust and flexible solution for managing cloud storage synchronization tasks.

Details

New Feature: Implemented the s3Sync functionality, leveraging the official aws-cli library and straightforward logic to establish core functionalities such as stability and the automatic detection of changes.
Ensured that the new s3Sync feature is fully compatible with our existing infrastructure and does not introduce any breaking changes or dependencies.

No Changes to Existing `gitSync` Functionality

It's crucial to note that while developing the s3Sync feature, special care was taken not to modify or affect the existing gitSync functionality. Our commitment was to add value without disrupting current operations or workflows.
Comprehensive testing has been conducted to confirm that gitSync remains unaffected and operates as expected.

Testing and Validation

Conducted basic tests such as ct lint and helm template, and performed operational tests on an actual Kubernetes cluster, especially gitSync, continue to operate without any issues.

Conclusion

This enhancement is a step forward in our ongoing efforts to provide a seamless and powerful toolset for our users. By introducing s3Sync, we are expanding our capabilities while ensuring the integrity and performance of our existing features remain intact.

I look forward to your feedback and any discussions regarding this PR. Thank you for considering these enhancements.

Checklist

For all Pull Requests

[x] Commits are signed off
[x] Commits have semantic messages
[x] Documentation updated
[x] Passes ct linting

For releasing ONLY

[ ] Chart.yaml version bumped
[ ] CHANGELOG.md updated

rsotogar commented 3 months ago

I am curious to know how syncing DAGs from S3 work? do we need to create a kubectl secret with our AWS key and secret key, and how often will it poll for new DAG files/ folders?

wikitops commented 2 months ago

This would be useful to have in the Helm chart. Git sync is sometime not the best option.

pedorro commented 2 months ago

just another bump on this PR. This seems like the best option for AirFlow deployments in EKS.

And just to clarify about AWS credentials, in general we would be using IAM roles rather than user credentials, so there should be no need for additional k8s secrets, or anything like that.

thesuperzapper commented 2 months ago

@chirichidi thanks for the very interesting PR, I would love to get "s3-sync" as a concept into the chart (as it will help users migrate from MWAA).

The main thing we need to finalize is the "reconciliation loop", everything else is secondary and can be updated later.

If I understand your PR correctly, you have done the following:

You have implemented a "sidecar" pattern similar to our gitsync sidecar
An init-container which runs the following command (to populate the dags folder as the pod starts)
- aws s3 cp --recursive s3://<BUCKET>/<PATH> ./path/to/dags
A sidecar container which runs the following command on loop (to keep the dags folder up to date):
- aws s3 sync --delete s3://<BUCKET>/<PATH> ./path/to/dags

My main concerns are:

What happens when a sync is halfway, but airflow starts refreshing the DAGs (so we have some old and some new)?
- This is avoided in git-sync by using symbolic link switching.
- It's likley rare for something seriously bad to happen when this occurs, so it might not matter.
Wouldn't it make more sense to also use aws s3 sync for the init-container?
- Because init-containers can sometimes run again (when the pod restarts), so we can save a bit of time by not re-downloading everything.
I wonder if we might want to use some or all of the following aws sync parameters:
- --quiet
- --only-show-errors

Are there any other things I have missed?

PS: if/when we merge this, I will update the values/docs in your PR to match the style of the chart.

airflow-helm / charts