Closed yeswalrus closed 4 months ago
Can you share more information, e.g., a trace with --verbose
to demonstrate what's failing? In general, we do already retry.
This may be an issue with our artifactory server having an artificially low rate limit or something but I'm not entirely sure. Logs here:
Using Python 3.10.12 interpreter at: /usr/bin/python3
Creating virtualenv at: test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/
Activate with: source test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/bin/activate
Built 6 editables in 1.74s
Resolved 114 packages in 10.33s
error: Failed to download distributions
Caused by: Failed to fetch wheel: plotly==4.12.0
Caused by: Failed to extract archive
Caused by: error decoding response body
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: Connection reset by peer (os error 104)
Potentially related to #3456
(worth noting that requests are already retried about 3 times by default)
We don't retry on TCP layer e.g. connection reset errors afaik. I think we'll need to add custom middleware to retry these.
If someone could link or summarize the relevant details about what exactly is retried today that'd be really helpful.
Fair, I only did a quick look and it seemed it would retry based on reqwest-retry, but is_connect
is likely not what I thought it was and io::ErrorKind::ConnectionReset
would need to be checked separately.
Relevant uv code is here https://github.com/astral-sh/uv/blob/main/crates/uv-client/src/base_client.rs#L169
Looking at it again, it does seem like it would retry on resets, see retryable_strategy.rs#L192
Complete (slightly obfuscated) log of one such instance captured with --verbose enabled, in case it helps with debugging. It's worth noting, this is a less serious issue - I only start seeing it reliably when I've got ~40 simultaneous venvs being created and --no-cache set, at least with my artificial benchmark, so this is something of an edge case. Still, having some ability to throttle or specify longer timeout windows might be helpful
Thanks for the links @samypr100, cc @konstin ?
FYI, I'm also seeing this same error semi-often in my work environment:
⠙ defusedxml==0.7.1 error: Failed to download `defusedxml==0.7.1`
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: Connection reset by peer (os error 104)
I'm not sure the details, but we have some kind of Palo Alto Firewall that is almost certainly causing the issue, running the uv command again almost always ends in success.
I'm going to test setting the new UV_CONCURRENT_DOWNLOADS
(https://github.com/astral-sh/uv/pull/3493) to 1 and see if that makes a difference in how often I see this error. Though pip also has issues in my work environment from time to time, so it may just reduce the frequency of issues.
I assume those retries don't apply to streaming operations. Like, they apply to the initial request itself, but probably not the entirety of reading from the stream? We can fix this by adding our own retry around download operations.
@charliermarsh is this any work planned around this issue? We reconfigured the UV_CONCURRENT_DOWNLOADS
to avoid running into this issue frequently. While it helped, our CI builds still run into connection failures here and there
Hi @aybidi — we think this is important to address but haven't started working on a fix yet. If anyone is willing to investigate and contribute in the meantime I'm happy to review.
I'm going to test setting the new
UV_CONCURRENT_DOWNLOADS
(#3493) to 1 and see if that makes a difference in how often I see this error.
FWIW, for my use case setting this did not help, but one day the firewall just started behaving better and I've not seen the error again.
Here's an example of a similar class of error that I observed in a CI build that would be nice to have uv retry automatically:
error: Failed to download distributions
Caused by: Failed to fetch wheel: torch==2.1.0
Caused by: Failed to extract archive
Caused by: request or response body error: error reading a body from connection: stream error received: unspecific protocol error detected
Caused by: error reading a body from connection: stream error received: unspecific protocol error detected
Caused by: stream error received: unspecific protocol error detected
I'm also hitting this issue, it's rather painful when going through a corporate proxy that may have concurrent connection limits, retry manually doesn't always work in CI.
@messense With the latest release, what's the error message, does it fail with retries or without?
I'm using uv 0.2.23, the error message is
error: Failed to prepare distributions
Caused by: Failed to fetch wheel: jaxlib==0.4.16+cuda12.cudnn89
Caused by: Failed to extract archive
Caused by: error decoding response body
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: Connection reset by peer (os error 104)
Judging from the code, reqwest-retry
simply won't retry body errors.
Thanks, good catch! Let's fix that.
Sounds like the fix would need to be on uv's end: https://github.com/TrueLayer/reqwest-middleware/issues/47#issuecomment-1170955570
I put up a patch at #4960 if anyone wants to give it a try.
I'll take this build for a spin on my testing farm. The networking issues I was suffering were infrequent so I won't be able to say for certain but if it shows up I'll let you know
We'll probably release it soon too! Thanks though :)
That build consistently gets stack overflow errors for me. I'm on Windows.
> uv.exe pip install --verbose cowsay
DEBUG uv 0.2.23
DEBUG Searching for Python interpreter in system path or `py` launcher
DEBUG Found cpython 3.12.3 at `D:\CosmosAnalytics\private\cosmos\Scope\pyscope\.venv\Scripts\python.exe` (virtual environment)
DEBUG Using Python 3.12.3 environment at .venv\Scripts\python.exe
DEBUG Acquired lock for `.venv`
DEBUG At least one requirement is not satisfied: cowsay
DEBUG Using request timeout of 30s
DEBUG Solving with installed Python version: 3.12.3
DEBUG Adding direct dependency: cowsay*
DEBUG No cache entry for: https://pypi.org/simple/cowsay/
DEBUG Searching for a compatible version of cowsay (*)
DEBUG Selecting: cowsay==6.1 (cowsay-6.1-py3-none-any.whl)
DEBUG No cache entry for: https://files.pythonhosted.org/packages/f1/13/63c0a02c44024ee16f664e0b36eefeb22d54e93531630bd99e237986f534/cowsay-6.1-py3-none-any.whl.metadata
thread 'main' has overflowed its stack
Are you using a release build?
I just grabbed the artifact from the CI build, is that not what I was meant to do?
Found this, will try with that env var
@zanieb #4960 does not work for me, still gives the same error
installing uv...
done! ✨ 🌟 ✨
installed package uv 0.2.24, installed using Python 3.10.12
...
...
Resolved 258 packages in 34.54s
error: Failed to prepare distributions
Caused by: Failed to fetch wheel: nvidia-cudnn-cu12==8.9.2.26
Caused by: Failed to extract archive
Caused by: error decoding response body
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: Connection reset by peer (os error 104)
That's so tragic 😭 okay I'll dig deeper. I'll need to reproduce it with a fake server or something. If anyone has time to poke at a reproduction, let me know!
If I add a panic!()
inside https://github.com/astral-sh/uv/blob/23c6cd774b466924d02de23add7101bcaa7b7c3e/crates/uv-client/src/base_client.rs#L291
then inject a connection reset error using sudo iptables -A OUTPUT -p tcp --dport 3128 -j REJECT --reject-with tcp-reset
(our proxy is running on port 3128), the panic never happens, only the Connection reset by peer
error message is printed, so my guess is that reqwest-retry
or reqwest-middleware
can't handle such kind of retry strategy at the moment.
^ yea, I was able to achieve similar results using @hauntsaninja https://github.com/hauntsaninja/nginx_pypi_cache and @messense's approach
I also tried with tc
linux utility by dropping/corrupting packets and gave me other types of unrelated errors such as BufErrors 😂
I think the issue is with stream_wheel
, if I change https://github.com/astral-sh/uv/blob/9a44bc1d3567e0a2ba31675bc35c50392fc2f5ad/crates/uv-distribution/src/distribution_database.rs#L211 to remove the if
to unconditional try download_wheel
when stream_wheel
fails with Extract
error, uv
correctly retries.
error: Failed to prepare distributions
Caused by: Failed to fetch wheel: beautifulsoup4==4.12.3
Caused by: Request failed after 3 retries
Caused by: error sending request for url (https://files.pythonhosted.org/packages/b1/fe/e8c672695b37eecc5cbf43e1d0638d88d66ba3a44c4d321c796f4e59167f/beautifulsoup4-4.12.3-py3-none-any.whl)
Caused by: client error (Connect)
Caused by: tcp connect error: Connection refused (os error 111)
Caused by: Connection refused (os error 111)
so my guess is that reqwest-retry
does not support retrying streaming response? Can we have a config/option/env var to force download_wheel
?
That's interesting, the actual download code is really similar for those two methods. They both use a bytes_stream
, etc...
I guess I don't see what's different between those methods. The latter just streams the wheel as-is to disk then unzips it; the former unzips as it streams.
Maybe because the unzip error causing an early error return?
Happy to report uv
v0.2.26 runs smoothly for me, no more failures when downloading wheels.
I'm unable to reproduce in my end either, curious if this resolves other's issues as well.
I'll close this while I'm here — I suspect we've fixed it. Feel free to chime in if you encounter this still!
Thanks @messense !
@zanieb I am seeing similar errors in our CI, e.g.:
https://github.com/pola-rs/polars/actions/runs/10096006345/job/27917486333?pr=17870
error: Failed to download `torch==2.4.0+cpu`
Caused by: Failed to unzip wheel: torch-2.4.0+cpu-cp312-cp312-win_amd64.whl
Caused by: an upstream reader returned an error: an error occurred during transport: error decoding response body
Caused by: an error occurred during transport: error decoding response body
Caused by: error decoding response body
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: end of file before message length reached
This pops up sometimes, rerunning the workflow fixes it. Probably something to do with the custom index (https://download.pytorch.org/whl/cpu) having some stability issues. But in this case, I would expect a retry to fix the issue. But retry behavior doesn't seem configurable for uv.
Any pointers are appreciated!
@stinodego -- Just confirming that you're on the most recent version of uv?
@charliermarsh These failures happened with uv 0.2.29 (you can check the link to our GitHub Actions to see some non-verbose logs). I'm pretty sure I've seen it on earlier versions as well.
Haven't seen it yet with 0.2.30, but I can post here if I do see it. But I don't believe 0.2.30 contains any fixes related to this issue.
If it helps, I can set our CI to verbose to get better logs on this issue?
@stinodego -- Yeah I wouldn't expect any change in 0.2.30. Verbose could be helpful because I'm trying to understand if we're retrying the download or not.
Interesting, that error indicates that we tried to download the wheel during resolution which is also slightly confusing. That would mean we failed to fetch the metadata from the index and had to fall back to downloading the wheel itself.
Verbose could be helpful because I'm trying to understand if we're retrying the download or not.
I set our CI to verbose mode - will report back if I spot the error again.
I think https://github.com/astral-sh/uv/pull/5555 should fix this.
I'm running into transient network issues when installing packages via git. Looking at the logs, there doesn't seem to be any retries for failures of git clone
operations. This is on v0.3.1 of uv. Is it possible to add retries here as well?
I've seen a regression in this recently, in the last few days, using the latest version of uv, I've started seeing:
⠼ defusedxml==0.7.1 error: Failed to download `defusedxml==0.7.1`
Caused by: request or response body error
Caused by: error reading a body from connection
Caused by: Connection reset by peer (os error 104)
I run again and it's fine, but I thought uv was now retrying these low level network errors?
It could of course just be my corporate network environment getting worse, worth reporting a new issue?
Do you have more details at which phase and with which index this happens? I simulated some connection errors but could only trigger cases with retries.
All the errors were related to running uv pip compile
, so it was only trying to collect metadata? I don't have any additional output than what I posted.
The index is https://pypi.org/, but the network involves a Palo Alto firewall that will be decrypting and encrypting traffic, and it seems occasionally this will just fail (either kill the connection or send an empty body).
I've seen a regression in this recently, in the last few days, using the latest version of uv,
I think I can confirm. Just opened https://github.com/astral-sh/uv/issues/8144 before I found this issue here.
Tested with uv 0.1.41
While looking for workarounds to https://github.com/astral-sh/uv/issues/3512, I experimented with using --no-cache. In our case, this actually increased the failure rate, as we use a local artifactory based pypi mirror which appears rate limited. Using the same bash script as in the linked issue but with --no-cache added, and with significantly more packages, we began observing instances where the artifactory server would be overwhelmed and UV would fail with
connection reset by peer
.Under pip, this was also reproducible though it required a larger number of simultaneous processes (~80), but rather that failing outright PIP would retry connection failures. Please add (ideally configurable) support for retrying connection failures when downloading or querying indexes