apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.31k forks source link

Hash Virtual Environment Cache Based on Actual Package Versions in `PythonVirtualEnvOperator` #41328

Open pedro-cf opened 3 months ago

pedro-cf commented 3 months ago

Description

Update the PythonVirtualEnvOperator to hash the virtual environment cache based on the actual versions of the installed packages rather than just the checksum of the requirements. This change would ensure that the cache reflects the true state of the environment, avoiding issues with packages tagged as "latest" or other dynamic versioning.

Use case/motivation

Currently, when using the PythonVirtualEnvOperator, if dependencies in the requirements use tags like "latest", the checksum used for caching remains unchanged even if the package versions are updated. This can lead to situations where outdated versions of packages are used from the cache, causing potential inconsistencies and issues in workflows. By hashing the cache based on the actual versions of installed packages, the virtual environment would be refreshed appropriately whenever package versions change, ensuring that the most current versions are used.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

potiuk commented 3 months ago

You can't do it. You do not know what versions will be installed before you install it, at which point calculating hash is already too late - because you already installed the venv.

Technically speaking -if you do not specify == in all requirements (which you should in this case if you want reproducibilitty) using last installed venv snapshot is fully correct (it still follows the specification you gave it).

If you want full reproducibility - just pin all your requirements, that's really the only way.

pedro-cf commented 3 months ago

You can't do it. You do not know what versions will be installed before you install it, at which point calculating hash is already too late - because you already installed the venv.

Technically speaking -if you do not specify == in all requirements (which you should in this case if you want reproducibilitty) using last installed venv snapshot is fully correct (it still follows the specification you gave it).

If you want full reproducibility - just pin all your requirements, that's really the only way.

It is possible to perform a pip download -r requirements.txt which will technically parse the versions and download them if they are missing from the download location.

example:

requirements.txt

pandas
colormap==1.0.4

pip download -r requirements.txt

Collecting pandas (from -r requirements.txt (line 1))
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting colormap==1.0.4 (from -r requirements.txt (line 2))
  Using cached colormap-1.0.4.tar.gz (17 kB)
  Preparing metadata (setup.py) ... done
Collecting numpy>=1.22.4 (from pandas->-r requirements.txt (line 1))
  Using cached numpy-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting python-dateutil>=2.8.2 (from pandas->-r requirements.txt (line 1))
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas->-r requirements.txt (line 1))
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas->-r requirements.txt (line 1))
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 1))
  Using cached six-1.16.0-py2.py3-none-any.whl.metadata (1.8 kB)

tree .

.
├── colormap-1.0.4.tar.gz
├── numpy-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
├── pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
├── python_dateutil-2.9.0.post0-py2.py3-none-any.whl
├── pytz-2024.1-py2.py3-none-any.whl
├── requirements.txt
├── six-1.16.0-py2.py3-none-any.whl
└── tzdata-2024.1-py2.py3-none-any.whl

Download when the packages are already downloaded: pip download -r requirements.txt

Collecting pandas (from -r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting colormap==1.0.4 (from -r requirements.txt (line 2))
  File was already downloaded /mnt/c/git/tst/tst/colormap-1.0.4.tar.gz
  Preparing metadata (setup.py) ... done
Collecting numpy>=1.22.4 (from pandas->-r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/numpy-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting python-dateutil>=2.8.2 (from pandas->-r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/python_dateutil-2.9.0.post0-py2.py3-none-any.whl
Collecting pytz>=2020.1 (from pandas->-r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/pytz-2024.1-py2.py3-none-any.whl
Collecting tzdata>=2022.7 (from pandas->-r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/tzdata-2024.1-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 1))
  File was already downloaded /mnt/c/git/tst/tst/six-1.16.0-py2.py3-none-any.whl
Successfully downloaded colormap pandas numpy python-dateutil pytz tzdata six

We could possibly parse this output to generate the hash ?

potiuk commented 3 months ago

Yes but why are we using Cache in this case at all if it is going to take about the same amount of time as pip download every time we attempt to use the venv?

It negates the cache savings

pedro-cf commented 3 months ago

Yes but why are we using Cache in this case at all if it is going to take about the same amount of time as pip download every time we attempt to use the venv?

It negates the cache savings

if the .whl are already downloaded, they will not be re-downloaded, and also with pip download we don't install anything

potiuk commented 3 months ago

Sure you can attempt to maje PR if you think it's worth it. You seem to have good idea what to do in this case. I think it's a little too .uch complicating things but if you would like to spend time on it and make tests etc. -feel free. However please make it a optional flag so that it does not happen by default as even resolution and downloading wheels will add quite an overhead in a number of cases

pedro-cf commented 3 months ago

there's also the option to use pip index versions <package_name> , example:

pip index versions colormap

colormap (1.1.0)
Available versions: 1.1.0, 1.0.6, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0, 0.9.10, 0.9.9, 0.9.8, 0.9.7, 0.9.6, 0.9.5, 0.9.4, 0.9.3, 0.9.2, 0.9.1, 0.9.0
potiuk commented 3 months ago

Just one important comment - with this change virtualenvs will stop being immutable. Which means that you will have to handle case where venv is being reinstalled while being used (for example by another celery process on the same machine) - so likely reinstallation will have to handle symbolic links and atomic renames of changed venv. Also it will likely need to include some way of disposing the old venvs.

pedro-cf commented 3 months ago

Just one important comment - with this change virtualenvs will stop being immutable. Which means that you will have to handle case where venv is being reinstalled while being used (for example by another celery process on the same machine) - so likely reinstallation will have to handle symbolic links and atomic renames of changed venv. Also it will likely need to include some way of disposing the old venvs.

If the "latest" version of a package changes the hash for the respective venv would change too. The only thing I wanted to achieve was to parse dynamic versions into absolute version like f.e.:

potiuk commented 3 months ago

If the "latest" version of a package changes the hash for the respective venv would change too. The only thing I wanted to achieve was to paryse dynamic versions into absolute version like f.e.: colormap>=1.0.0 or colormap are converted into colormap==1.1.0

Sure. The problem with that approach is that it has the potential of balloning a number of venvs - for example boto3 releases a new version every day or so - which means that if you are dynamically recreating the venv besed on latest version available in pypi and have boto3> x.y.z - it will create a new copy of the venv every day. Previously this happened only when you actuallly changed dependency requirements.

But yes if their hashes will be different and stored separately, they will be essentially immutable (but there will be many more of those potentially and a strategy need to be worked out how to dispose the old ones essentially as they will grow in totally uncontrollable way potentially - without any DAG author action). I wonder what would be the proposal for that - because even then it could be that some tasks are still using the old version of venv with different hash, when the new one is being installed and used for a different task.

The only thing I wanted to achieve was to parse dynamic versions into absolute version like f.e.: colormap>=1.0.0 or colormap are converted into colormap==1.1.0 pip index versions colormap

Pip index is not nearly enough. You have to run algorithm to resolve the dependencies - because new versions of requirements might have different limitations - so you actually have to perform full pip install resolution to perform such installation -you cannot just take "latest" of all dependencies that are specified with lower bound). For example if colrmap==1.1.0 has foobar<3.2 and you already had foobar 3.2 installed (because colormap == 1.0.0 did not have that limit) - pip will have to resolve the dependencies and decide whether to downgrade foobar or simply not upgrade to the newer colormap (otherwise it will end up with conflict). So any time when you want to check for the "non-conflicting" dependencies, you basically have to do full dependency resolution with --eager-upgrade resolution strategy or perform a completely new installation (and dependency resolution) without looking what you have already installed in the target venv.

This is the overhead that will need to happen on every single run of a task with such venv definition - regardless if cache is there, because you need to that resolution in order to calculate the new hash and compare it with the existing one. - this is why it's an overhead as sometimes such resolution might mean some back-tracking and downloading multiple versions of the same package - even if locally you already have current version of the dependency in cache. It can take even minutes sometimes (and this was the main reason why we wanted to implement caching - to save time on the dependency resolution and downloading).

Dependency resolution in PyPI can be (and often is) quite time/network consuming.

Basically you have two options now :

1) No cache - then you always get latest (at the expense of dependency resolution and downloading packages). Often slow and not predictable.

2) Cache - then you always get the "first matching requirements installed" at that machine - which makes it potentially inconsistent between runs on different machines (but with very little overhead of only first time resolution and installation)

Essentially, what you want is option 3)

3) Cache but check if cache needs to be invalidated because some new dependencies have been released since the last time the task has been run -> which is something in-between. Part of the process is faster (if nothing changed, you only pay the price of performing resolution - which might, or might not be slow and is somewhat unprecdictable (depends on packages released by 3rd-parties). Also with the drawback of potentially leaving behind many versions of venvs - where they can grow in non-controllable way over time. So we need to find a solution for managing those.

But yes, if you want to pursue that and propose PR - feel free.