cloudflare / workers-sdk

⛅️ Home to Wrangler, the CLI for Cloudflare Workers®
https://developers.cloudflare.com/workers/
Apache License 2.0
2.64k stars 692 forks source link

🐛 BUG: recently added wrangler retries doesn't seem to help with flakiness #6913

Open jiri-prokop-pb opened 1 week ago

jiri-prokop-pb commented 1 week ago

Which Cloudflare product(s) does this pertain to?

Wrangler

What version(s) of the tool(s) are you using?

3.79.0 [Wrangler]

What version of Node are you using?

20.16.0

What operating system and version are you using?

macOS Sequoia 15.0 (24A335) & Ubuntu Jammy (22.04.5 LTS) on CI

Describe the Bug

Observed behavior

We had retries on CI level for wrangler-related tasks for a while but we recently noticed that "native" retries were added in v3.79.0 with https://github.com/cloudflare/workers-sdk/pull/6801. We jumped in, upgraded to the supported version and removed our CI retry-logic only to find out that our CI is again flaky due to wrangler-related errors.

We reverted back to our CI retry logic but it's not ideal. We would welcome this to work as it will simplify our configuration.

Expected behavior

Flakiness will be low with "native"/built-in retries. Ideally we won't get random failures related to wrangler at all if everything is correct on our side.

It would be also great to have some logs for the retries. Right now it's hard to guess if the logic actually does anything. On our side it just looks like it ran once and failed immediately.

Steps to reproduce

N/A

It's random error that happens semi-regularly on our CI if we don't do retries on our side.

Please provide a link to a minimal reproduction

No response

Please provide any relevant error logs

  > wrangler deploy

   ⛅️ wrangler 3.79.0
  -------------------

  Total Upload: 296.15 KiB / gzip: 70.73 KiB
  Your worker has access to the following bindings:
  - Vars:
    - ENVIRONMENT: "dev"

  ✘ [ERROR] A request to the Cloudflare API (/accounts/xxxyyyzzz/workers/scripts/aaabbbccc/deployments) failed.

    workers.api.error.unknown [code: 10013]
emily-shen commented 1 week ago

Hiya, I don't think retries were actually added for the endpoint you've hit there 😅

In terms of general flakiness for deploy, we're definitely aware and working on it. Wrangler-level retries unfortunately can't be the whole solution because we've seen some errors getting worse when we spam the api with more retries.

Also agreed that the error message is quite unhelpful. If you run set WRANGLER_LOG="debug" in your command you can see all the requests/responses and which call fails, but it probably won't give you a more detailed error.

If you can share your account id that could help us identify the underlying api issue :)

jiri-prokop-pb commented 4 days ago

@emily-shen thanks and ok, got it; my main problem was that it's not clear if there are actually any retries being done and I believe it would be helpful to see at least some basic info printed by default but anyway, we can try with WRANGLER_LOG="debug" as well.

Regarding account id, we don't want to share it publicly here so my colleague will contact you over your official support channel.

Do you have any ETA about when we can expect more stable deploys? Once it's done we could plan some code simplification on our side as right now we have some workarounds in place for stable CI.