Add support for retrying connection failures

yeswalrus commented 6 months ago

Tested with uv 0.1.41

While looking for workarounds to https://github.com/astral-sh/uv/issues/3512, I experimented with using --no-cache. In our case, this actually increased the failure rate, as we use a local artifactory based pypi mirror which appears rate limited. Using the same bash script as in the linked issue but with --no-cache added, and with significantly more packages, we began observing instances where the artifactory server would be overwhelmed and UV would fail with connection reset by peer.

Under pip, this was also reproducible though it required a larger number of simultaneous processes (~80), but rather that failing outright PIP would retry connection failures. Please add (ideally configurable) support for retrying connection failures when downloading or querying indexes

charliermarsh commented 6 months ago

Can you share more information, e.g., a trace with --verbose to demonstrate what's failing? In general, we do already retry.

yeswalrus commented 6 months ago

This may be an issue with our artifactory server having an artificially low rate limit or something but I'm not entirely sure. Logs here:

Using Python 3.10.12 interpreter at: /usr/bin/python3
Creating virtualenv at: test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/
Activate with: source test_venvs/6c38fa92-edb8-4dc3-8c34-6a395504bd9d/bin/activate
Built 6 editables in 1.74s
Resolved 114 packages in 10.33s
error: Failed to download distributions
  Caused by: Failed to fetch wheel: plotly==4.12.0
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

samypr100 commented 6 months ago

Potentially related to #3456

(worth noting that requests are already retried about 3 times by default)

zanieb commented 6 months ago

We don't retry on TCP layer e.g. connection reset errors afaik. I think we'll need to add custom middleware to retry these.

If someone could link or summarize the relevant details about what exactly is retried today that'd be really helpful.

samypr100 commented 6 months ago

Fair, I only did a quick look and it seemed it would retry based on reqwest-retry, but is_connect is likely not what I thought it was and io::ErrorKind::ConnectionReset would need to be checked separately.

Relevant uv code is here https://github.com/astral-sh/uv/blob/main/crates/uv-client/src/base_client.rs#L169

samypr100 commented 6 months ago

Looking at it again, it does seem like it would retry on resets, see retryable_strategy.rs#L192

yeswalrus commented 6 months ago

Complete (slightly obfuscated) log of one such instance captured with --verbose enabled, in case it helps with debugging. It's worth noting, this is a less serious issue - I only start seeing it reliably when I've got ~40 simultaneous venvs being created and --no-cache set, at least with my artificial benchmark, so this is something of an edge case. Still, having some ability to throttle or specify longer timeout windows might be helpful

uv_err.log

zanieb commented 6 months ago

Thanks for the links @samypr100, cc @konstin ?

notatallshaw commented 5 months ago

FYI, I'm also seeing this same error semi-often in my work environment:

⠙ defusedxml==0.7.1                                                                                                                                                                                                                                                                                                                              error: Failed to download `defusedxml==0.7.1`
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

I'm not sure the details, but we have some kind of Palo Alto Firewall that is almost certainly causing the issue, running the uv command again almost always ends in success.

I'm going to test setting the new UV_CONCURRENT_DOWNLOADS (https://github.com/astral-sh/uv/pull/3493) to 1 and see if that makes a difference in how often I see this error. Though pip also has issues in my work environment from time to time, so it may just reduce the frequency of issues.

charliermarsh commented 5 months ago

I assume those retries don't apply to streaming operations. Like, they apply to the initial request itself, but probably not the entirety of reading from the stream? We can fix this by adding our own retry around download operations.

aybidi commented 4 months ago

@charliermarsh is this any work planned around this issue? We reconfigured the UV_CONCURRENT_DOWNLOADS to avoid running into this issue frequently. While it helped, our CI builds still run into connection failures here and there

zanieb commented 4 months ago

Hi @aybidi — we think this is important to address but haven't started working on a fix yet. If anyone is willing to investigate and contribute in the meantime I'm happy to review.

notatallshaw commented 4 months ago

I'm going to test setting the new UV_CONCURRENT_DOWNLOADS (#3493) to 1 and see if that makes a difference in how often I see this error.

FWIW, for my use case setting this did not help, but one day the firewall just started behaving better and I've not seen the error again.

kujenga commented 4 months ago

Here's an example of a similar class of error that I observed in a CI build that would be nice to have uv retry automatically:

error: Failed to download distributions
  Caused by: Failed to fetch wheel: torch==2.1.0
  Caused by: Failed to extract archive
  Caused by: request or response body error: error reading a body from connection: stream error received: unspecific protocol error detected
  Caused by: error reading a body from connection: stream error received: unspecific protocol error detected
  Caused by: stream error received: unspecific protocol error detected

messense commented 4 months ago

I'm also hitting this issue, it's rather painful when going through a corporate proxy that may have concurrent connection limits, retry manually doesn't always work in CI.

konstin commented 4 months ago

@messense With the latest release, what's the error message, does it fail with retries or without?

messense commented 4 months ago

I'm using uv 0.2.23, the error message is

error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: jaxlib==0.4.16+cuda12.cudnn89
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

messense commented 4 months ago

Judging from the code, reqwest-retry simply won't retry body errors.

charliermarsh commented 4 months ago

Thanks, good catch! Let's fix that.

benjamin-hodgson commented 4 months ago

Sounds like the fix would need to be on uv's end: https://github.com/TrueLayer/reqwest-middleware/issues/47#issuecomment-1170955570

zanieb commented 4 months ago

I put up a patch at #4960 if anyone wants to give it a try.

benjamin-hodgson commented 4 months ago

I'll take this build for a spin on my testing farm. The networking issues I was suffering were infrequent so I won't be able to say for certain but if it shows up I'll let you know

zanieb commented 4 months ago

We'll probably release it soon too! Thanks though :)

benjamin-hodgson commented 4 months ago

That build consistently gets stack overflow errors for me. I'm on Windows.

> uv.exe pip install --verbose cowsay

DEBUG uv 0.2.23
DEBUG Searching for Python interpreter in system path or `py` launcher
DEBUG Found cpython 3.12.3 at `D:\CosmosAnalytics\private\cosmos\Scope\pyscope\.venv\Scripts\python.exe` (virtual environment)
DEBUG Using Python 3.12.3 environment at .venv\Scripts\python.exe
DEBUG Acquired lock for `.venv`
DEBUG At least one requirement is not satisfied: cowsay
DEBUG Using request timeout of 30s
DEBUG Solving with installed Python version: 3.12.3
DEBUG Adding direct dependency: cowsay*
DEBUG No cache entry for: https://pypi.org/simple/cowsay/
DEBUG Searching for a compatible version of cowsay (*)
DEBUG Selecting: cowsay==6.1 (cowsay-6.1-py3-none-any.whl)
DEBUG No cache entry for: https://files.pythonhosted.org/packages/f1/13/63c0a02c44024ee16f664e0b36eefeb22d54e93531630bd99e237986f534/cowsay-6.1-py3-none-any.whl.metadata

thread 'main' has overflowed its stack

charliermarsh commented 4 months ago

Are you using a release build?

benjamin-hodgson commented 4 months ago

I just grabbed the artifact from the CI build, is that not what I was meant to do?

benjamin-hodgson commented 4 months ago

Found this, will try with that env var

messense commented 4 months ago

@zanieb #4960 does not work for me, still gives the same error

installing uv...
done! ✨ 🌟 ✨
  installed package uv 0.2.24, installed using Python 3.10.12
...
...
Resolved 258 packages in 34.54s
error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: nvidia-cudnn-cu12==8.9.2.26
  Caused by: Failed to extract archive
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

zanieb commented 4 months ago

That's so tragic 😭 okay I'll dig deeper. I'll need to reproduce it with a fake server or something. If anyone has time to poke at a reproduction, let me know!

messense commented 4 months ago

If I add a panic!() inside https://github.com/astral-sh/uv/blob/23c6cd774b466924d02de23add7101bcaa7b7c3e/crates/uv-client/src/base_client.rs#L291

then inject a connection reset error using sudo iptables -A OUTPUT -p tcp --dport 3128 -j REJECT --reject-with tcp-reset (our proxy is running on port 3128), the panic never happens, only the Connection reset by peer error message is printed, so my guess is that reqwest-retry or reqwest-middleware can't handle such kind of retry strategy at the moment.

samypr100 commented 4 months ago

^ yea, I was able to achieve similar results using @hauntsaninja https://github.com/hauntsaninja/nginx_pypi_cache and @messense's approach I also tried with tc linux utility by dropping/corrupting packets and gave me other types of unrelated errors such as BufErrors 😂

messense commented 4 months ago

I think the issue is with stream_wheel, if I change https://github.com/astral-sh/uv/blob/9a44bc1d3567e0a2ba31675bc35c50392fc2f5ad/crates/uv-distribution/src/distribution_database.rs#L211 to remove the if to unconditional try download_wheel when stream_wheel fails with Extract error, uv correctly retries.

error: Failed to prepare distributions
  Caused by: Failed to fetch wheel: beautifulsoup4==4.12.3
  Caused by: Request failed after 3 retries
  Caused by: error sending request for url (https://files.pythonhosted.org/packages/b1/fe/e8c672695b37eecc5cbf43e1d0638d88d66ba3a44c4d321c796f4e59167f/beautifulsoup4-4.12.3-py3-none-any.whl)
  Caused by: client error (Connect)
  Caused by: tcp connect error: Connection refused (os error 111)
  Caused by: Connection refused (os error 111)

so my guess is that reqwest-retry does not support retrying streaming response? Can we have a config/option/env var to force download_wheel?

charliermarsh commented 4 months ago

That's interesting, the actual download code is really similar for those two methods. They both use a bytes_stream, etc...

charliermarsh commented 4 months ago

I guess I don't see what's different between those methods. The latter just streams the wheel as-is to disk then unzips it; the former unzips as it streams.

messense commented 4 months ago

Maybe because the unzip error causing an early error return?

messense commented 4 months ago

Happy to report uv v0.2.26 runs smoothly for me, no more failures when downloading wheels.

samypr100 commented 4 months ago

I'm unable to reproduce in my end either, curious if this resolves other's issues as well.

zanieb commented 4 months ago

I'll close this while I'm here — I suspect we've fixed it. Feel free to chime in if you encounter this still!

Thanks @messense !

stinodego commented 3 months ago

@zanieb I am seeing similar errors in our CI, e.g.:

https://github.com/pola-rs/polars/actions/runs/10096006345/job/27917486333?pr=17870

error: Failed to download `torch==2.4.0+cpu`
  Caused by: Failed to unzip wheel: torch-2.4.0+cpu-cp312-cp312-win_amd64.whl
  Caused by: an upstream reader returned an error: an error occurred during transport: error decoding response body
  Caused by: an error occurred during transport: error decoding response body
  Caused by: error decoding response body
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: end of file before message length reached

This pops up sometimes, rerunning the workflow fixes it. Probably something to do with the custom index (https://download.pytorch.org/whl/cpu) having some stability issues. But in this case, I would expect a retry to fix the issue. But retry behavior doesn't seem configurable for uv.

Any pointers are appreciated!

charliermarsh commented 3 months ago

@stinodego -- Just confirming that you're on the most recent version of uv?

stinodego commented 3 months ago

@charliermarsh These failures happened with uv 0.2.29 (you can check the link to our GitHub Actions to see some non-verbose logs). I'm pretty sure I've seen it on earlier versions as well.

Haven't seen it yet with 0.2.30, but I can post here if I do see it. But I don't believe 0.2.30 contains any fixes related to this issue.

If it helps, I can set our CI to verbose to get better logs on this issue?

charliermarsh commented 3 months ago

@stinodego -- Yeah I wouldn't expect any change in 0.2.30. Verbose could be helpful because I'm trying to understand if we're retrying the download or not.

charliermarsh commented 3 months ago

Interesting, that error indicates that we tried to download the wheel during resolution which is also slightly confusing. That would mean we failed to fetch the metadata from the index and had to fall back to downloading the wheel itself.

stinodego commented 3 months ago

Verbose could be helpful because I'm trying to understand if we're retrying the download or not.

I set our CI to verbose mode - will report back if I spot the error again.

konstin commented 3 months ago

I think https://github.com/astral-sh/uv/pull/5555 should fix this.

laurence-kobold commented 2 months ago

I'm running into transient network issues when installing packages via git. Looking at the logs, there doesn't seem to be any retries for failures of git clone operations. This is on v0.3.1 of uv. Is it possible to add retries here as well?

notatallshaw-gts commented 2 months ago

I've seen a regression in this recently, in the last few days, using the latest version of uv, I've started seeing:

⠼ defusedxml==0.7.1                                                                                                                                                                                                                                          error: Failed to download `defusedxml==0.7.1`
  Caused by: request or response body error
  Caused by: error reading a body from connection
  Caused by: Connection reset by peer (os error 104)

I run again and it's fine, but I thought uv was now retrying these low level network errors?

It could of course just be my corporate network environment getting worse, worth reporting a new issue?

konstin commented 2 months ago

Do you have more details at which phase and with which index this happens? I simulated some connection errors but could only trigger cases with retries.

notatallshaw-gts commented 2 months ago

All the errors were related to running uv pip compile, so it was only trying to collect metadata? I don't have any additional output than what I posted.

The index is https://pypi.org/, but the network involves a Palo Alto firewall that will be decrypting and encrypting traffic, and it seems occasionally this will just fail (either kill the connection or send an empty body).

jgehrcke commented 1 month ago

I've seen a regression in this recently, in the last few days, using the latest version of uv,

I think I can confirm. Just opened https://github.com/astral-sh/uv/issues/8144 before I found this issue here.

astral-sh / uv

Add support for retrying connection failures #3514