cloudflare / workers-sdk

โ›…๏ธ Home to Wrangler, the CLI for Cloudflare Workersยฎ
https://developers.cloudflare.com/workers/
Apache License 2.0
2.7k stars 709 forks source link

๐Ÿ› BUG: "[ERROR] Error in ProxyController: Error inside ProxyWorker" #4562

Closed ekojsalim closed 9 months ago

ekojsalim commented 11 months ago

Which Cloudflare product(s) does this pertain to?

Wrangler core, Miniflare

What version(s) of the tool(s) are you using?

3.19.0 [Wrangler]

What version of Node are you using?

20.10.0

What operating system are you using?

Linux

Describe the Bug

Repeatedly calling an endpoint calling a Durable Object will result in this error every other request:

โœ˜ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

Not sure if the cause is actually repeatedly calling a DO. In the DevTools, requests to the DO appears to be all successful.

Downgrading to 3.18.0 fixes this issue, so this is possibly a regression involving the startDevWorker refactor.

Please provide a link to a minimal reproduction

No response

Please provide any relevant error logs

No response

KrisBraun commented 11 months ago

Also seeing this error on ~50% of reloads using Remix dev.

KrisBraun commented 11 months ago

FYI @mrbbot (potentially related to startDevWorker)

beanow-at-crabnebula commented 11 months ago

I've started seeing test flakes which are based on lots of unstable_dev starts. Without DO bindings (1 KV and 1 R2 binding though).

Reverting to 3.18.0 indeed solves the issue.

Node 18, Linux, if that matters.

ShravanSunder commented 10 months ago

I'm also getting this error often, started recently with the new version

pvarsh commented 10 months ago

Getting this error ('Network connection lost') and { errno: -32, code: 'EPIPE', syscall: 'write' } on a test that used to work. About half the time one or the other, but consecutive runs tend to have the same error. Started happening over the last few days.

irekjozwiak commented 10 months ago

I'm getting this on ~70% reloads on certain workers on 3.22.1

matthewjosephtaylor commented 10 months ago

I too am seeing this. I'm using DurableObjects with WebSockets.

I suspect it might be related to something like the DO goes into hibernation and loses the websocket connection.

For me it isn't 'every other request' but rather if I leave the WebSocket open and connected without traffic for a period (at least that is what it 'feels' like.

Will revert to 3.18.0 to see if that fixes

For the record I am on 3.22.1 currently

FYI reverting to 3.18 came with other bugs and issues that were even worse. I note that this morning 3.22.2 dropped and am excited to see if this helps... ๐Ÿคž

Sadly still seeing same issue after about 30s of inactivity for the websocket in 3.22.2 :(

I'm considering pushing some 'ping' messages across just to keep the DO from sleeping which I still suspect is the cause, but in a way it is acting as a nice chaos monkey while developing...

bryanlatten commented 10 months ago

+1 seeing this semi-regularly with very small local load, 3-4 parallel requests

CameronWhiteside commented 10 months ago

+1

beanow-at-crabnebula commented 10 months ago

Just adding a note that we now have critical CVEs against the only workaround: downgrading to wrangler 3.18 https://github.com/cloudflare/workers-sdk/security

Upgrading to latest wrangler also caused Jest to find leaking request handles. This makes the unstable_dev for integration tests: as documented. Very brittle.

admah commented 10 months ago

@ekojsalim (and others) thanks for reporting this. Could you help provide us with a way to reproduce this error message?

This will help us better debug what actually may be going on here. There are a couple of changes that were made within Wrangler recently that could be a culprit here. Thanks!

bryanlatten commented 10 months ago

At the risk of being the wrong signal here, but I see more failures when a CORS (OPTIONS) request is happening? The app has access to queues and KV, not D0 like others are using above.

Screenshot 2024-01-10 at 2 00 56โ€ฏPM

ethan-tqa commented 9 months ago

@admah even the Counter example in DO docs has this issue, with any wrangler above version 3.18. It does not require any concurrent requests. I can reliably reproduce the issue using hand triggered http requests if they are less than say 3 seconds apart. And the behavior is very consistent, always 1 working request followed by 1 broken request and repeat.

matthewjosephtaylor commented 9 months ago

Work-around:

I'm using DurableObjects with WebSockets.

As long as I have the client continually sends a 'ping' message across the socket every 5 seconds I'm no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don't get the error.

aroman commented 9 months ago

Work-around:

I'm using DurableObjects with WebSockets.

As long as I have the client continually sends a 'ping' message across the socket every 5 seconds I'm no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don't get the error.

If you're talking about websocket hibernation API, we don't use those APIs in our project (we use WebSockets + DOs but not the hibernate APIs), and we are still hit hard by this issue when developing locally and running tests.

Separately, for CF folks (cc: @lrapoport-cf), I'd like to underscore the point @beanow-at-crabnebula made above. At my company, we've been unable to upgrade wrangler since early November, due to one regression after another (3 by my count: this one, #4496, and one caused by #4535 I can't seem to find). Of these, only the latter was documented in release notes.

I appreciate how frequently wrangler releases new versions, but my feeling is that something has got to change process-wise to mitigate the frequency and severity of breaking regressions โ€” or at least document them in release notes so folks are aware of the tradeoffs when upgrading. I appreciate how active and interactive with customers your team is on the CF github repos โ€” that level of communication goes a long way.

admah commented 9 months ago

@aroman that is a good callout. We are constantly reviewing our processes to see what we can do to mitigate these types of incidents, because we do understand how disruptive they are.

For this issue (and any others related to previous startDevWorker work) we have prioritized them and are working to have them resolved ASAP.

lrapoport-cf commented 9 months ago

hi @aroman and all, apologies for the delayed action on this issue. our team got pulled into high priority internal work over the last several weeks and we fell behind on our regular workers-sdk maintenance. i appreciate you calling out our engagement with the community as a positive -- we strive to keep you all informed as much as possible. this is a good reminder for us to continue to communicate out any internal discussion we have about particular issues so you all are always up to date on their status.

in terms of concrete next steps, we have prioritized this issue for this week and assigned someone to address it. and while we're also taking strides to reduce the number of regressions by increasing test coverage, going forward we'll be prioritizing fixing any regressions that do slip through as quickly as possible -- we have also just added a regression label so that items such as these get highlighted, please feel free to use it :)

thanks for raising this feedback!

filipStoklasa23 commented 9 months ago

We are experiencing this issue heavily in our azure CI during e2e (parallel cypress) test run.

Downgrading to 3.18.0 does not help

Screenshot 2024-01-24 at 10 52 25

RamIdeas commented 9 months ago

Hi @ekojsalim! Could you please describe the specific reproduction steps? Also, since you mentioned your "requests to the DO appears to be all successful", could you clarify whether this is indeed an error causing your local development environment to behave incorrectly or whether it is a noisy log message in your terminal?

I tried this week to reproduce the issue on Ubuntu but couldn't. I used the Counter example on the Cloudflare Docs and ran npx wrangler dev and in another terminal ran curl http://localhost:8787/increment?name=A repeatedly. The requests were all successful and the output from wrangler had no extraneous logs โ€“ just the request logs with 200 status codes. I also waited 10+ seconds between some requests to test whether Durable Object Hibernation was triggering the issue.

For others in the comments reporting similar but not exactly the same issue, could you please open new issues so that we can gather specific of your situation โ€“ our issue template includes fields for your wrangler and OS version. Please try to provide reproduction steps to aid us to fix the issue for you. I am committed to improving the local developer experience so please provide as much detail as possible to fix your issue.

ekojsalim commented 9 months ago

@RamIdeas We actually migrated off Cloudflare Workers not long after this issue, so my memory on this issue may not be the best.

could you clarify whether this is indeed an error causing your local development environment to behave incorrectly or whether it is a noisy log message in your terminal?

It's not just noisy log message. Every other HTTP request to the endpoint will error out with that error message.

I tried to reproduce this issue now using a stripped down version of our old code but also couldn't reproduce it now (repeatedly calling a DO). It's possible that the root issue lies on other interactions but I couldn't recall what exactly.

JamesDelfini commented 9 months ago

At the risk of being the wrong signal here, but I see more failures when a CORS (OPTIONS) request is happening? The app has access to queues and KV, not D0 like others are using above.

Screenshot 2024-01-10 at 2 00 56โ€ฏPM

Confirmed on my development environment when using CORS but in deployed environment it is working fine.

matthewjosephtaylor commented 9 months ago

@RamIdeas Here is my env. I'm using DOs with Websockets in tandem with normal REST calls all handled via the DO.

Here is my system info (apologies for the franken-template):

What version(s) of the tool(s) are you using? "wrangler": "^3.24.0"

What version of Node are you using? v20.9.0

What operating system are you using? Macbook pro Darwin Kernel Version 23.0.0

You mentioned running 10s between requests. I suggest opening that up to a minute. I never timed it but 10s feels a bit too short.

On the wild chance I'm 'doing it wrong' here is how I'm initializing the DO from my main index.ts file since the docs around this are quite confusing/inconsistent.

// In order for the workers runtime to find the class that implements
// our Durable Object namespace, we must export it from the root module.
export { AppStateDurableObject } from "./state/durable/AppStateDurableObject";

export default {
  fetch(req: Request, env: Env, ctx: ExecutionContext) {
    const id = env.APP_STATE.idFromName("app-state");
    const stub = env.APP_STATE.get(id);
    return stub.fetch(req);
  },
};
RamIdeas commented 9 months ago

@matthewjosephtaylor that worker code looks totally fine but I would need to see the Durable Object code as it essentially handles the full request/response.

I used 10s because that is the timeout workerd enforces before hibernating an idle Durable Object โ€“ in reality I waited about 15s to be sure. If you're experiencing issues specifically related to waiting a minute between requests, can you please open a separate issue with repro steps so we can track there? specifically, can you please include:

A git repo I can clone and run a test suite on would be optimal, but even a git repo with a README of step-by-step instructions would be great. This should help us be able to reproduce the bug and diagnose and address the issue faster. Thanks!

matthewjosephtaylor commented 9 months ago

@RamIdeas I note you have a potential fix drafted ๐Ÿคž that it works :)

Will open a new bug if your fix doesn't work.

FYI I'd also be happy enough with some flag to disable hibernating the DO in dev. I like that wrangler dev does it as it exercises the DO well during development, but since it is effectively broken for me, it hurts more than it helps at the moment.

Cherry commented 9 months ago

We hit this today with one of our projects. It's a relatively simple Worker, no bindings other than the Workers Sites KV binding. Our wrangler.toml is as simple as this:

name = "xxx"
account_id = "xxx"
workers_dev = true
compatibility_date = "2023-06-29"
main = "./src/worker.ts"
compatibility_flags = ["nodejs_compat"]

rules = [
  { type = "Text", globs = ["**/*.svg"], fallthrough = true }
]

[site]
bucket = "./public-dist"

It wasn't happening for everyone on the team, but at least one person was experiencing this error when hitting an endpoint multiple times in some environments. It didn't happen consistently, so creating a full repro isn't viable unfortunately, but every so often they'd see it, and it'd break their implementation, causing the request to not respond with anything.

On reverting to 3.18.0, the issues subsided and they were able to continue working without issue.

RamIdeas commented 9 months ago

Hi @matthewjosephtaylor and @Cherry! I think I managed to narrow this down to requests being fired off to the previous UserWorker just before a new UserWorker is started (after a reload caused by config or source file changes).

I have a potential fix in #4867 โ€“ please try out the prerelease build and let me know if you still experience issues.

@matthewjosephtaylor thanks for confirming the initial fix on that branch solved your issue. I've since made more changes to further address the issue. Would appreciate you trying again too.

RamIdeas commented 9 months ago

FYI I'd also be happy enough with some flag to disable hibernating the DO in dev. I like that wrangler dev does it as it exercises the DO well during development, but since it is effectively broken for me, it hurts more than it helps at the moment.

Thanks for the feature request. We'll consider exposing it via wrangler in the current and upcoming work on improving the local dev servers.

Notably, hibernation occurs after 10s and, if this issue wasn't being seen until after about a minute as other comments imply, then I don't think hibernation was relevant to the issue in question. Also, the Durable Object is going to reload when the UserWorker reloads anyway so you will definitely want to implement logic to recover from Durable Object reloads even for local dev.

nora-soderlund commented 9 months ago

@RamIdeas I just tested your PR, it does not fix it for me, unfortunately. I have the same issue as everyone else here, fetches failing during reloads etc, downgrading to 3.18.0 fixes it for me.

โ–ฒ [WARNING] Failed to get worker definitions TypeError: fetch failed

      at fetch (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:17032:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async getRegisteredWorkers (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:127052:22)
      at async getBoundRegisteredWorkers (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:127071:29) {
    cause: ConnectTimeoutError: Connect Timeout Error
        at onConnectTimeout (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5865:29)
        at <redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5821:51
        at Immediate._onImmediate (<redacted>/node_modules/.pnpm/@prerelease-registry.devprod.cloudflare.dev+workers-sdk+runs+7766885793@npm-package-wrangler-4867/node_modules/wrangler/wrangler-dist/cli.js:5853:13)
        at process.processImmediate (node:internal/timers:476:21) {
      code: 'UND_ERR_CONNECT_TIMEOUT'
    }
  }

I don't know how much info I can give that will be helpful other than this project is running Remix. I will from my end try to narrow down the issue in workers-sdk.

I did some digging the other day but didn't come very far, I'll see if the changes in your PR will guide me a bit better this time.

matthewjosephtaylor commented 9 months ago

@RamIdeas your fix #4867 improved the situation quite a bit, but now I eventually get:

โœ˜ [ERROR] *** Received signal #11: Segmentation fault: 11

  stack:

(no stack trace in console)

After I get the segfault wrangler is effectively dead, and I have to restart it.

This happens after several minutes which is a remarkable improvement on what it was before, where it would basically die as soon as hibernation set in.

For the record this is what I tested with:

npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/7766885793/npm-package-wrangler-4867 dev --port 8787 ./src/index.ts 
RamIdeas commented 9 months ago

Hi @nora-soderlund that looks like an issue with the dev registry (used for local service bindings). Although probably related to worker reloads, can you open another issue so that we can track and fix that independently (we may already have a PR up that fixes it)

@matthewjosephtaylor thanks for confirming the PR makes a marked improvement. That segfault is going to be a tough one to debug. Can you open another issue so that we can debug that in isolation? I may need to provide you a debug build of workerd and instructions to use it to get any useful leads and would make it easier to follow if we had a separate issue. Please feel free to assign me or just tag me :)

Cherry commented 9 months ago

Sadly #4867 did not solve it for one of my team members.

It's a long-running worker, run with wrangler dev without any edits being made, but now instead of seeing the ProxyController error, they just receive a 503. It's not consistent, and it's not reliably reproducible (it seems to happen mostly when hitting the same endpoint a lot), but reverting them to 3.18.0 is still our best resolution.

matthewjosephtaylor commented 9 months ago

@Cherry I opened #4995 in relation to the fix for #4867 helping but not solving my issues (new segfault error exposed). Perhaps the issues you are continuing to face are related as well, especially if the 503s become consistent once they start happening.

Cherry commented 9 months ago

Thanks @matthewjosephtaylor. They unfortunately aren't consistent for one of my team. They can send 100 requests, 98 are fine, and then 2 of them throw the new 503 introduced in #4867 ๐Ÿ˜”

RamIdeas commented 9 months ago

@matthewjosephtaylor thanks I'll follow up in that issue!

@Cherry it's helpful to know now they're receiving these new 503s. That implied they're non-GET/HEAD requests. Can you tell me more about the requests which are failing? Are they long-running requests? Do they have large bodies? Something is making them flakey to proxy to the UserWorker and, since they're workerd -> workers requests, that shouldn't be the case. Can you provide more info on their OS version/node version, any more info to go on?

Cherry commented 9 months ago

@RamIdeas thanks for the followup. While I can't find a reliable repro, here's about as much info as I've been able to gather:

aroman commented 9 months ago

We've been getting the occasional 503 error (very annoying โ€” causes our tests to be flaky) for a long time, even on version 3.11.0. It seems to only occur under load and/or with a high degree of concurrency to the dev server, which is what our playwright tests do. Reducing the total # of concurrent requests does reduce the frequency of this issue, but doesn't totally solve it.

StrandedKitty commented 9 months ago

I am getting this issue when doing several consecutive POST requests to something as simple as

export async function onRequest({request}: EventContext<unknown, string, unknown>): Promise<Response> {
    return new Response("test", {status: 200});
}

Win 11, Node v19.9.0.

It seems to be gone after switching to Wrangler 3.18.0.

RamIdeas commented 9 months ago

@aroman the changes causing this issue were not in v3.11.0 โ€“ can you upgrade to the latest version and try again. 503s returned before 3.28.0 would be a completely different issue and if you're still experiencing it on recent versions please open a new issue with your repro steps.

@Cherry thanks for the additional info! I missed in your previous comment you mentioned that there are no reloads and you're receiving a 503. It shouldn't be possible to hit the codepath to the 503 unless a request to a previous version of your worker failed and upon catching that failed request we notice a new version of your worker. Can you provide me the log file for that wrangler session? That would have more details about the state of the ProxyWorker before/during/after these failed requests.

@Cherry + @StrandedKitty I tried POSTing many concurrent requests with the following but couldn't reproduce the error. Could you try this repro on your set-ups and let me know if you experience the issue?

worker:

export default {
    async fetch(request, env, ctx) {
        await new Promise((r) => setTimeout(r, 5000));
        return new Response('HELLO');
    },
};

In one terminal, run:

$ npx wrangler@latest dev

In another terminal:

$ for i in {1..200}
do
  curl -X POST http://localhost:8787 -F 'fileX=@./path/to/any/file' &;
done
StrandedKitty commented 8 months ago

@RamIdeas Hi, I've debugged it a little and it seems like in order to reproduce it it's required to:

On 3.19.0 the error looks like this:

โœ˜ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

And on 3.28.4 it looks like this:

โœ˜ [ERROR] Your worker created multiple branches of a single stream (for instance, by calling `response.clone()` or `request.clone()`) but did not read the body of both branches. This is wasteful, as it forces the system to buffer the entire stream of data in memory, rather than streaming it through. This may cause your worker to be unexpectedly terminated for going over the memory limit. If you only meant to copy the request or response headers and metadata (e.g. in order to be able to modify them), use the appropriate constructors instead (for instance, `new Response(response.body, response)`, `new Request(request)`, etc).

So it seems like this issue it related to https://github.com/cloudflare/workers-sdk/issues/3259 is some way.

Let me know if you need further help.

Cherry commented 8 months ago

I found a way to reliably reproduce this on Windows @RamIdeas (and some success on MacOS). I filed https://github.com/cloudflare/workers-sdk/issues/5095 to track separately.

amiranvarov commented 3 months ago

this issue still persists. Any CF members planning to take a look at it?

RamIdeas commented 2 months ago

this issue still persists. Any CF members planning to take a look at it?

@amiranvarov if you're experiencing a similar error in your wrangler dev session, please open a new issue with steps to reproduce