astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
573 stars 142 forks source link

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

Closed tatiana closed 1 week ago

tatiana commented 10 months ago

As of Cosmos 1.2, when using ExecutionMode.VIRTUALENV, each task will create a new Python virtual environment, in a temporary directory: https://github.com/astronomer/astronomer-cosmos/blob/a433f15d6fd20f5fd713268040eba09727db4a2c/cosmos/operators/virtualenv.py#L68

This can cause delays, as discussed in the Slack thread: https://apache-airflow.slack.com/archives/C059CC42E9W/p1697614400289939

This could be improved, assuming the Airflow worker node is reused to run multiple tasks.

Proposal Persist the virtual environment directory if users set the configuration: ExecutionConfig.virtualenv_path.

LennartKloppenburg commented 10 months ago

As discussed with @tatiana I'd be happy to take a first stab at this :)

tatiana commented 10 months ago

Thank you very much for taking the lead on this, @LennartKloppenburg ! Please, let us know if you'd like any support

LennartKloppenburg commented 10 months ago

Thinking about whether it'd make sense to add a clean-up callback to delete the virtual env that should be executed after the whole DAG has completed? Or should we defer this responsibility/flexibility to the end-user?

LennartKloppenburg commented 10 months ago

More things to consider:

  1. It could be useful to also centralize and cache the installation of dbt deps, as this is currently still happening in every task with this PR. Implementing this too would invalidate the install_deps flag on the operator level (on DbtLocalBaseOperator for instance) and make more sense on the ExecutionConfig instead, at least when using using VirtualEnv as ExecutionMode.

  2. If we let an operator create a virtual env and cache it, it can cause concurrency issues with other operators simultaneously doing the same thing. My changes work when starting with 1 in isolation and then running the rest... Perhaps the most elegant solution would be to add a "set-up" operator at the start of a VirtualEnv-based DAG that sets up the env, installs python requirements, installs dbt deps and everything, and then another one that tears it down at the end. That would take care of a lot of the complexity we'd otherwise end up implementing on the operator level.

The current PR would also split the operator's responsibility into "do whatever DBT thing it needs to do" and "optionally set up a virtual env if necessary" and that feels like an anti-pattern.

tatiana commented 10 months ago

@LennartKloppenburg, we discussed this last week, and I'll add some thoughts here as well:

The challenge with having the setup in a separate operator is that due to the distributed nature of Airflow, there is no guarantee that this operator would be run in the same nodes running the other dbt/Cosmos tasks. And if users are running Airflow using the Airflow K8s executor, a new pod is created each time an Airflow worker node picks up a task - so it is necessary to have it at an operator level.

If the setup was in some remote service, we could try to leverage the Airflow 2.7 DAG-level setup/teardown: https://airflow.apache.org/docs/apache-airflow/stable/howto/setup-and-teardown.html

If Airflow had a worker-node level setup/tear down, it would be optimal for our use case. Still, let's say different tasks had different dependencies - we may still have some concurrency challenges if we use asynchronous operators (it's been discussed to have Cosmos supporting dbt Cloud, for instance, and in that case, we'd be leveraging Airflow deferrable operators).

For this reason, I believe we'll need to come up with some solution to the concurrency issue at a task level, for now. More thoughts on the PR itself.

LennartKloppenburg commented 10 months ago

Yep, you're totally right! Sorry for duplicating the question here, but your info is excellent as a paper-trail so let's leave it :D

dosubot[bot] commented 6 months ago

Hi, @tatiana,

I'm helping the Cosmos team manage their backlog and am marking this issue as stale. From what I understand, the issue proposes adding support for caching virtual environments created with ExecutionMode.VIRTUALENV in Cosmos to improve performance. There's a volunteer considering adding a clean-up callback and centralizing and caching the installation of dbt deps, and there's a discussion about potential concurrency issues and the need to address them at a task level due to the distributed nature of Airflow. Tatiana has provided insights on the challenges of having setup in a separate operator and the need to consider concurrency issues at a task level.

Could you please confirm if this issue is still relevant to the latest version of the Cosmos repository? If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!