astral-sh / uv

An extremely fast Python package and project manager, written in Rust.
https://docs.astral.sh/uv
Apache License 2.0
20.47k stars 605 forks source link

Cache is hitting race conditions when when sharing the cache directory across multiple running docker containers #7434

Open henryborchers opened 4 days ago

henryborchers commented 4 days ago

While building wheels on my CI machine, I've been running into an issue involving the cache directory trying to access and write to the same files. This only happens when I have two or more containers running with the same mounted path, the location of UV_CACHE_DIR aka cache path. This doesn't happen every time, just when uv is trying to access the same files in the cache. For this reason the ci pipeline could be fine depending on which stages are running at the same time.

It looks like it's possibly a race condition that wasn't a problem when I was using pip. Is there a "process locking file" located outside of the UV_CACHE_DIR that I should also be mounting so that the cache isn't written to at the same time?

uv==2.12 build==1.2.1

Note: In this example, I'm using Python 3.8 on Linux but it's a happens with all version in Linux and Windows Docker containers.

Docker Command used:

docker run -ti -v uvcache:/.cache/uv -e UV_CACHE_DIR=/.cache/uv some_docker_image_that_has_uv_installed

Uv command used inside the container: python3.8 -m build --wheel --installer=uv

Output

+ UV_INDEX_STRATEGY=unsafe-best-match
+ python3.8 -m build --wheel --installer=uv
* Creating isolated environment: venv+uv...
* Using external uv from /usr/local/bin/uv
* Installing packages in isolated environment:
  - conan>=1.50.0,<2.0
  - setuptools>=40.8.0
  - uiucprescon.build @
  git+https://github.com/UIUCLibrary/uiucprescon_build.git@v0.2.1
  - wheel
> /usr/local/bin/uv pip install setuptools>=40.8.0 wheel "uiucprescon.build @
  git+https://github.com/UIUCLibrary/uiucprescon_build.git@v0.2.1"
  conan>=1.50.0,<2.0
< error: failed to open file `/.cache/uv/CACHEDIR.TAG`
<   Caused by: Permission denied (os error 13)

Traceback (most recent call last):
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/__main__.py", line 178, in _handle_build_error
    yield
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/__main__.py", line 429, in main
    built = build_call(
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/__main__.py", line 238, in build_package
    out = _build(isolation, srcdir, outdir, distribution, config_settings, skip_dependency_check, installer)
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/__main__.py", line 170, in _build
    return _build_in_isolated_env(srcdir, outdir, distribution, config_settings, installer)
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/__main__.py", line 135, in _build_in_isolated_env
    env.install(builder.build_system_requires)
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/env.py", line 136, in install
    self._env_backend.install_requirements(requirements)
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/env.py", line 301, in install_requirements
    run_subprocess([*cmd, 'install', *requirements], env={**os.environ, 'VIRTUAL_ENV': self._env_path})
  File "/opt/python/cp38-cp38/lib/python3.8/site-packages/build/_ctx.py", line 71, in run_subprocess
    subprocess.run(cmd, capture_output=True, check=True, env=env)
  File "/opt/python/cp38-cp38/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/bin/uv', 'pip', 'install', 'setuptools>=40.8.0', 'wheel', 'uiucprescon.build @ git+https://github.com/UIUCLibrary/uiucprescon_build.git@v0.2.1', 'conan>=1.50.0,<2.0']' returned non-zero exit status 2.

ERROR Command '['/usr/local/bin/uv', 'pip', 'install', 'setuptools>=40.8.0', 'wheel', 'uiucprescon.build @ git+https://github.com/UIUCLibrary/uiucprescon_build.git@v0.2.1', 'conan>=1.50.0,<2.0']' returned non-zero exit status 2.
charliermarsh commented 4 days ago

We very intentionally do not lock the cache, it's designed to be concurrency safe. My guess is that error is something to do with the file being created by a different user...?

charliermarsh commented 4 days ago

Are both containers running as the same user?

henryborchers commented 4 days ago

They should be running as the same user. They are based on the same image.

henryborchers commented 3 days ago

To give you a little more context, these Docker containers are spun up by Jenkins as part of the pipeline. When building a wheel with an extension, I build the wheel for each supported version of Python in their own Docker container (based on the same image that has all the supported versions of Python installed) running in parallel. When these containers run, they run with the exact same settings, same environment variables, same user, and same path to cache mounted on the host system.

When the containers run one at a time, everything runs fine. It uses the cache without an issue. However, when multiple containers are accessing the same cache directory, I get the permission errors.

zanieb commented 3 days ago

Is the error always when trying to write the CACHEDIR.TAG or do you see this for other files too?

henryborchers commented 2 days ago

Other files as well. If you want I can put the stack trace for those here too

henryborchers commented 1 day ago

Actually, I think it might be only CACHEDIR.TAG that I'm running into. At least when using Linux.

zanieb commented 1 day ago

Let's see if https://github.com/astral-sh/uv/pull/7550 resolves this then