Open alexakra opened 2 years ago
cc: @ephraimbuddy
cc @jedcunningham. I'm a little surprised that the test cases didn't catch this
@ephraimbuddy
One reason I would think is that this case makes very little sense.
Maybe it is bit unrelated - but I personally think that having both git-sync AND persistentcy at the same time makes very little sense in general. I have yet to hear a good argument why just "git-sync" would not cut it. No-one yet was able to sufficiently explain to me why they want to use Git Sync to atomically sync DAGs to one machine and then let the "network file system" distribute it farther to all the other components. IMHO it makes no sense, lacks the atomicity guarantees that Git-sync provides, wihile incurring much more traffic, instabilities and (usually) generating a lot of cost connected with continuously accessing remote filesystems by Airflow Scheduler. Because this is what effectively happens when you use persistency. There is no magic and the DAGs need to be distributed - and when you use persistency you pay the cost connected (but it's a bit hidden).
It is so much better to use Git as synchronisation protocol - it's way better designed and optimized to share even huge number of files which is the source code (actually the more - the better Git's optimisations work) and has everything that is needed to synchronise incremental changes (i.e. commits) in atomic way (git-sync does it) only when needed (when new commit is there).
Plus it has this great capability that you set it as an init container and it will have to fully synchronise the whole folder before not allowing a component of Airflow to start before all DAGs are already present in the folder. All those properties are basicaly gone when you add persistency to Git Sync. And IMHO you have no benefits when you add persistency. Only problems.
For all practical purposes having just git-sync and local filesystems completely separated from each other for each components is way better IMHO.
This is explained in detail in this blog post https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
@alexakra @chuxiangfeng - maybe you can explain to me and argue what is the reason you need both Git-sync and persistency? What do you miss if you just use GitSync? Why do you REALLY need persistency - i.e. what you THINK you will get by using it?
It puzzles me what is the reasoning and I asked already many times and never got a satisfactory answer.
In the article you mentioned, there is no even a single reference to Kubernetes. In the Kubernetes ecosystem this is the way to share files between pods (git-sync runs on the scheduler pod, webserver pod and worker pods).
In the article you mentioned, there is no even a single reference to Kubernetes. In the Kubernetes ecosystem this is the way to share files between pods (git-sync runs on the scheduler pod, webserver pod and worker pods).
Using persistent volumes is one of the ways to synchronise files between pods. But it is by far not the only one. This is why git-sync was invented - to provide alternative. And the article IS all about K8S. Git-Sync is in fact K8s project https://github.com/kubernetes/git-sync that allows you to sync files to all K8S components without persistency.
Persistency is really only useful when you do not use any other mechanism. And if you are running git-sync for all the components, there is absolutely no need to again synchronize them via persisttent storage.
Why do you want to do htat?
What does it give you additionally to git-synced files?
And if you are running git-sync for all the components, there is absolutely no need to again synchronize them via persisttent storage.
What do you mean "running git-sync for all the components"? Again, git-sync runs inside scheduler pod. How do you provide DAGs to webserver pod and worker pods? Using git-sync in all of them, doesn't solve the "network issue". https://airflow.apache.org/docs/apache-airflow/stable/executor/kubernetes.html
What do you mean "running git-sync for all the components"?
When persistency is disabled, git sync runs as an side-container for workers and scheduler - all of them. They also run as init-containers first to make sure that you have fully synced DAG folder before the component starts (when you use persistency, this depends on how much distribution you have and your remote filesystem could introduce significant latency on distributing files - there is no guarantee what your worker sees locally for example. IT could be no DAGS if you have race condition where Git sync has not yet completed the update in the first component of yours.
Again, git-sync runs inside scheduler pod. How do you provide DAGs to webserver pod and worker pods?
No. Without persistency git-sync runs as side container where it is needed. Just note that your statement is not accurate - you do not need DAGs in webserver in Airflfow 2 (and it neither has persistent volume mapped to it - only workers, scheduler, trigerer and dag file processor - if it is run standalone) need it. In the latter case (when dag file processor(s) are run as standalone) also scheduler does not need the DAG folder.
Using git-sync in all of them, doesn't solve the "network issue".
It does - it's far less of networking (likely several orders of magnitude in fact), and you do not pay for it. Also the problem with shared/persistent volumes is that schedulers continuously reads and re-reads the files when scheduling. continuously. all the time. No matter if they changed. This generates a lot of traffic if you have a lot of files when your files are mounted on persistent (i.e. networked) volument - even if files are not changed, just acccessing the files takes roundrips to a server and when you have a lot of the files this is very costly (in terms of money but also in terms of latency - side effect is that picking up new files is severly delayed in not even that extreme cases and stability of the connections suffer.. Many of our users paid a lot of money for EFS IOPS to overcome latency. Git-sync has none of it. The only traffic happens when we a) run single HTTP call to check if anything changed every minute or so b) download files when they changed. All the scanning by scheduler is done locally. This generates orders of magnitude less traffic - simply because reading files by scheduler happens thousands of times more frequently than changing the DAGs by the authors. Git-sync simply optimises Airflow use case way better. This is all in detail explained in my article.
Alright, it is clear now, thank you. But, one point, we also use git-sync and persistency to provide plugins from git to webserver. The outcome of this issue could be like the following:
@alexakra - absolutely, feel free to add those docs. That would be great help for people like you in the future.
This is a good idea - now that you got all the explanation, you are completely free to make a PR with such documentation improvements. Airlfow is created by more than 2200 contributors - mostly people like you who wanted to do something and got helped by maintainers and decided to contribute back what they learned in exchange for the free software they got for absolutely free and help they got absolutely for free. And will all the discussion and explanations you just got, you are likely the best person in the world to add such documentation - simply because it will be easy for you to write it in the way that will be easy to grasp by people like you - people like maintainers have far too many assumptions in their head to write such documentation in the way that will be good for the users. On the other hand you seem to know already what kind of documentation we miss on it.
And it's rather easy to add a new documentation - it's almost as easy as writing comments or posting issues. Just go to the page where you want to add docs and click "suggest a change on this page" button - and you will get "Pull Request" where you will get the page to edit and you can use GitHub UI to write the docs - literally the same UI you used for posting the issue and teh whole discussion above.
I willl be super happy to review and guide you while doing such documentation PR.
Regarding the plugins and failure/warning - I also heartily welcome you to make PRs for those - that is a bit more involved, unlike the documentation it requires more experience and writing tests, but if you feel like editing the chart and adding new options - it is great opportunity that you test it in your local copy (charts can be very easily modifiable and once you get it working you can contributre the PR back. That would be very cool contribution. This is also the fastest way to get it implemented. Alternatively - you can create feature requests for those. I can mark them as "good first issue" - then there is a chance someone will pick it up and implement it. But there is very litle influence on that, someone must simply pick an interest and implementt it and we as maintainers are happy to help with reviewig and merging.
Just wanted to let you know that this is how open-source works (just in case you are not aware).
First, The initial idea was to use the external dag files as a repository for code, but it was later discovered that git-sync might be a better option. So git-sync was turned on but not set dags.persistence to false, both were set to true, and the official documentation showed that it was possible to do this. Second,In the development phase, there is not much resource in the local built airflow, when testing or debugging code, does not want to add some temporary code to submit to the git warehouse, only to modify the local file to take effect, all debugging completed finally together to submit to git, rather than some useless debugging process code submission
First, The initial idea was to use the external dag files as a repository for code, but it was later discovered that git-sync might be a better option. So git-sync was turned on but not set dags.persistence to false, both were set to true, and the official documentation showed that it was possible to do this.
Yep. I understand that. I am gathering the ideas and evidences to make a stronger case on strongly discouraging turning on both persistence and git sync at the same time. As mentioned in the https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca - this is my (strong) opinion but otherwise it is a controversial subject. It's very difficult to convince people who think that "shared volumes" is just K8S magic that works.
If you do not dig deeply into understanding that git-sync and shared volumes are actually doing the same thing in parallel and that git-sync performance optimisations are way better for Airflow case, it's very difficult to understand that this is the case (see the discussion above).
But - as discussed above - I have not see a single argument yet that convinces me that there is even a single case where git-sync + persistence is better in anything than just git-sync (and it is worse in a number of aspects). So far I saw precisely 0
arguments and cases that could convince me.
I hope I will manage to convince other commiters and PMC members eventually that we should strongly discourage it in the docs, and that rather than "fixing" the case we should at least warn the users they should not do it (or maybe even outright disable this possibility in our chart).
But we are not there yet.
Second,In the development phase, there is not much resource in the local built airflow, when testing or debugging code, does not want to add some temporary code to submit to the git warehouse, only to modify the local file to take effect, all debugging completed finally together to submit to git, rather than some useless debugging process code submission
Yeah. As I mentioned in the article - shared volumes are perfect for such "easy start" and quick iterations. But having Git Sync + dag persistence in this case makes no sense at all (git sync will clash with manually overwritten files so it is very bad idea to have both if you want to manually iterate on those files without git). This the relevant excerpt (in the "good" part):
And if your users are mostly data scientists, who are used to iterate and change their files locally and experiment and quickly deploy stuff by just copy pasting they do not need to learn any new tools, nor follow any rigorous deployment workflow — hey we just copy file here and … it works.
Sounds cool? Yeah. Because it is cool.
When you have a small-ish installation and a handful of DAGs that are mostly accessed by one user, this is a perfect solution. And yeah, in such a case I’d heartily recommend deploying Airflow with shared volumes.
@potiuk, the biggest gap with git-sync is that we effectively have no way to poll less frequently. This isn't a concern at small scale, but there is a point with many instances using a monorepo where it becomes problematic.
There are probably better solutions to that problem, but polling less frequently when you aren't on 1 LocalExecutor is asking for heartache eventually. And "polling less frequently" is a natural knob to reach for, unfortunately, if you hit that case or think you might, try and get ahead of the game, etc.
Relying on (presumably) syncing from github also introduces risk in your production system, persistence is at least "local" to some degree. I'm generally pro gitsync without persistence, but there are definitely scenarios where it isn't enough.
@potiuk, the biggest gap with git-sync is that we effectively have no way to poll less frequently. This isn't a concern at small scale, but there is a point with many instances using a monorepo where it becomes problematic.
Hmm. Unless you mean something else, this is entirely possible. You could even set different period for workers, and diffferent for scheduler and even disable it completely for K8S pod (they only need the init container to run the sync once when they are started): https://github.com/kubernetes/git-sync
--period <duration>, $GIT_SYNC_PERIOD
How long to wait between sync attempts. This must be at least
10ms. This flag obsoletes --wait, but if --wait is specified, it
will take precedence. If not specified, this defaults to 10
seconds ("10s").
And let's not forget the fact that running scheduler(s) and even workers using persistent volumes perform far more frequent polling. Polling of GitSync (if no files are modified) is just single HTTP call with latest Commit Hash in the branch and and the files were not updated, it just returns "no change". Polling of the persistency server by scheduler and worker is far worse. Because both scheduler (continuously) and worker (every time new task is picked) scan remote directory listing of such persistent volume. For example when it is NFS - which is vast majority of persistent volumes implementation, scheduler is basically continuously polling and flooding the NFS server - non, stop, sometimes hundreds of roundrips or even thousands of roundtrips per second (see my article). Git Sync on the other hand (unless there are changes coming) is pretty much one, very small HTTP request per 10 seconds by default and you can make the period longer. And even if there are changes, Git does fantastic job on compressing and sending only incremental changes that are comiung. This is a stark contrast with Persistent Volumes. They are far less optimised and in vast majority of implementations, if a single line changes in your DAG file, the whole DAG file will have to be send. It's very different when you use Git Sync Protocol. I'd argue that because of those optimisations and fact that Git Sync has super efficient way of checking if "anything" changed, the polling when you have persistency is few orders of magnitude worse than that of Git.
Question @jedcunningham : Have you ever had a case or seen a single user complaining about "efficiency" of their git server? I have not seen any. But I saw many, many, many users complaining how EFS (which is NFS-based persistency used in Amazon) was unstable and inefficient for them and they had to pay a lot for IOPS to make it stable. This is because Airflow use case is "slow number of DAG changes, huge number of polling for DAG volumes even if no DAG changed". This makes Git far superior solution. It's not "generally" better, but it's IMHO always better for Airflow.
There are probably better solutions to that problem, but polling less frequently when you aren't on 1 LocalExecutor is asking for heartache eventually. And "polling less frequently" is a natural knob to reach for, unfortunately, if you hit that case or think you might, try and get ahead of the game, etc.
Relying on (presumably) syncing from github also introduces risk in your production system, persistence is at least "local" to some degree. I'm generally pro gitsync without persistence, but there are definitely scenarios where it isn't enough.
This can be very easily solved by polling a local Git server. Git has a fantastic super-efficient synchronisation protocol and if users are keeping the file in GitHub anyway, it's extremely simple to have a local mirror. Similarly as persistent volumes - all major cloud providers allow not only to setup "persistency" but also "Github local mirrror" with a single click - simply because this is all built-in Git protocol. In case you need to achieve GitHub independence, you just set-up a mirror, and use that mirror as your Git repo source. Case solved. Works.
Setting up mirror is also an easy way to distribute "polling" workload if you actually hit the limits of your git server. And it is much better and more controllable than shared, persistent volumes. With Git-sync you might setup a mirror and use them if you do not want to use main repo and "overload" it. With EFS for example the only way to turn the knobs when you start hitting the limits is to start paying A LOT for IOPS provisioning - and many of our users complained that they could not get the stability without paying a lot for pre-provisioned IOPs when they reached certain size and it was a completely unexpected and huge expense that they have not foreseen before, but because they were already vested in persistency, they were simply "forced" to do that. With pure Git sync approach you have much more knobs to turn - and they are super easy - both on premis and in the cloud.
I am still not convinced at all of any benefit of Persistency + Git Sync.
Have you ever had a case or seen a single user complaining about "efficiency" of their git server?
Yes, though not directly "efficiency" - GitHub does rate limiting. And yes, a local mirror is an excellent solution to that.
HTTP request per 10 seconds by default and you can make the period longer.
We default to 60s (which is a mistake imo, PR incoming!). You can't go beyond "0" without some risk of task failures from being out of sync. Acceptable for some, not others. Going to a high interval, say hourly, is really asking for trouble. That's what I mean when I say, unless you are on a single LocalExecutor, you really don't have a safe way to turn down the sync frequency naively (e.g. without a mirror, and even then, Airflow needs to poll the mirror very frequently and the mirror sync interval turned down).
Ultimately, Airflow expects the DAGs to be consistent across the environment. All of these different options introduce risk that that won't be the case. There are tradeoffs all over the place here.
But when you already use Git to store your files, and you want to use GitSync IMHO adding persistency only make things worse, and never better IMHO.
Generally, yes, but I'm not ready to say "never". There are definitely places where NFS is better supported/funded by multiple orders of magnitude over the git service - git just may not be built to be hit real time from everywhere across the org. Letting NFS take the brunt of it and syncing from git infrequently can make sense.
Don't get me wrong though, I'm not going to die on this hill. I'm more of a "bake it into the image" + KubernetesExecutor guy anyways 🤷♂️.
We default to 60s (which is a mistake imo, PR incoming!). You can't go beyond "0" without some risk of task failures from being out of sync
There is no magic, again. shared volumes do none of sync quarantees either because there are no atomicity guarantees. Actually it is FAR worse (and our users suffered from those).
In most cases of shared volumes (as far as I know) this risk is far greater. You can have partially synchronized directories, partially synchronized files in them - when you have biggish DAG folder on Amazon EFS with low IOPS the consistency of your local version of DAG folder is simply impossible - because NFS synchronizes and flushes to each file are treated separately. With any sizeable shared volume, this problem is FAR worse than that - you can have locally A new dag importing an old version of a library. Or a new library imported by an old DAG. And absolutely no control over that. You can have arbitrary snapshots of arbitrary versions of files that were copied to a shared part of a volume at the other end. This is a major source fo instability for the users with EFS and low IOPS and likely it causes occasional untraceable errors when race conditions happens even if your IOPS are high.
I think effect of that is way worse than "slight" sync delays. Especially if you are not aware of that effect.
In case of GitSync because atomic replacement (symbolic link for the whole DAG replacement, we at least have guarantee of consistency.
And it gest FAR worse when you combine GitSync + Shared volumes. Far, Far, Far Worse, Precistly because GitSync does this atomic replacement. The way how GitSync works - there are points in time where GitSync has exactly TWO FULL COPIES of full dag folder. One old and one new. When git sync retrieves a new commit, the copy of FULL DAG FOLDER is created and when git pull is completed it replaces a symbolic link to that new copy. That "git sync" process is extremely heavy on shared folders. Assume NFS
1) new files are synced (and NFS starts syncing them while git sync works) 2) once finished (NFS is still syncing) Git replaces the symbolic link to the new directory 3) Now - this means that effectively NFS has to delete all those files from old link and replace them with new files from the new link - they are effectively different files, with no relation whatsoever so NFS has to synchronise all of them again 4) all this while the files are continously read by multiple ends.
The effect is tht Git Sync even with single line of change causes an Avalanche of changes in NFS-based system. Basically whole DAG folder is deleted and recreated on remote ends from scratch with every single commit (and multiple times at that).
Of course various optimisations and versions of NFS and other shared volumes have some optimisations but none of those is prepared to the scenario that suddenly whole huge DAG folder is replaced by another (this is what git sync does with every single change coming).
Generally, yes, but I'm not ready to say "never".
Well if I consider the way how Airlfow Scheduler accesses the files and how GitSync atomic commit + SharedVolume cause an avalanche of traffic and commmunication and that shared volumes do not provide ANY consistency guarantee - I am quite ready to say that "GitSync + Shared Volumes" is never.
Don't get me wrong though, I'm not going to die on this hill. I'm more of a "bake it into the image" + KubernetesExecutor guy anyways man_shrugging.
I am also not dying on that hill. I just very strongly advocate for it.
Related: #27545
Came to report the same bug with the helm chart version 1.8.0
, quite a discussion going on here 😅 .
For anyone wishing to get a very quick workaround, add the below override to create the missing volume:
scheduler:
extraVolumes:
- name: git-sync-ssh-key
secret:
secretName: airflow-ssh-secret
I am using Github Enterprise. Personnaly I am willing to use both git-sync and a persistent volume for redundancy, as relying solely on git-sync would make Github a single point of failure for all my DAGs.
I think a slightly more involved way to go would be to create a git-sync K8s job in CI/CD. That way instead of constantly polling Github, git-sync becomes PR event-driven.
So if it's not expected that we will use git-sync with shared volumes, it would be deleted from the documentation and stop misleading people?
So if it's not expected that we will use git-sync with shared volumes, it would be deleted from the documentation and stop misleading people?
We alredy heavily discourage it in the docs: https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html#notes-for-combining-git-sync-and-persistence
because for some people, some setups it might still work.
However yes, I would heartily welcome any PRs (to documentation or maybe even finding a mechanism to warn users while they are installing helm chart with thsi combinations. Do you have ideas and suggestions how you can do it @tutunak ?
If so - I'd love to see a PR to the docuemntation or maybe chart coming from people like you that got confused. Airflow (and it's documentation) is contributed by > 2700 contributors - this is free, open source software that thousands of users contribute to and you - same as anyone other - hav the same possibility to improve as anyone else, and probably - if you are confused and now know - you can come up with good ideas - if not how to warn but also which part of the documentation got you confused - and fix them.
It is super easy - just click "Suggest a change on this page" at the bottom right of the document you want to improve and PR will be opened for you and you will be able to contribute proposal to the doc update directly from UI.
Can we count on it? Can you help to identify confusing places and propose a fix? I am happy to review and merge such proposals.
I was just hit by this issue as well. While I see there's a lot of discussion on whether it is a good or bad idea to use git-sync together with persistence, I believe the main point is missed here: The documentation promotes a setup that renders invalid manifests: https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html#mounting-dags-using-git-sync-sidecar-with-persistence-enabled
When looking for git sync in the docs, this is the first option mentioned and it can never work. So it would be great to either fix the helm chart so the example from docs works, or remove it from the docs altogether. Otherwise newcomers like me spend hours troubleshooting helm chart ...
When looking for git sync in the docs, this is the first option mentioned and it can never work. So it would be great to either fix the helm chart so the example from docs works, or remove it from the docs altogether. Otherwise newcomers like me spend hours troubleshooting helm chart ...
Oh absolutely. If you coud propose an update the example docs with the way how you think it shoul dbe done or add appropriate warning that would be fantastic. It's super easy. Click on the "Suggest a change on this page" and PR will be opened for you - and you can submit a proposal to make it clearer (basically this is what I proposed in my previous comment as well).
Official Helm Chart version
1.7.0 (latest released)
Apache Airflow version
2.3.4
Kubernetes Version
1.22.6
Helm Chart configuration
Docker Image customisations
No response
What happened
This commit: https://github.com/apache/airflow/commit/3fc895b9dfe8e7b77538bd80754fb17ccf92db49 causes the following error because volumeMount is created but the volume is missing:
upgrade.go:369: [debug] warning: Upgrade "airflow" failed: cannot patch "airflow-scheduler" with kind Deployment: Deployment.apps "airflow-scheduler" is invalid: spec.template.spec.containers[2].volumeMounts[0].name: Not found: "dags"
What you think should happen instead
I don't see any relation between dag persistency and git-sync configuration. The fact is that the same configuration works when
credentialsSecret
is defined but stopped working withsshKeySecret
.How to reproduce
Follow commit https://github.com/apache/airflow/commit/3fc895b9dfe8e7b77538bd80754fb17ccf92db49
Anything else
No response
Are you willing to submit PR?
Code of Conduct