cloudflare / workers-sdk

โ›…๏ธ Home to Wrangler, the CLI for Cloudflare Workersยฎ
https://developers.cloudflare.com/workers/
Apache License 2.0
2.75k stars 734 forks source link

๐Ÿ› BUG: `wrangler dev --experimental-local` slows down over time and leaks memory #2386

Closed IgorMinar closed 1 year ago

IgorMinar commented 1 year ago

What version of Wrangler are you using?

2.6.2

What operating system are you using?

Mac

Describe the Bug

If I test my simple worker with wrangler dev --experimental-local and monitor the browser devtools network tab I see that after the initial (cold) requests then next requests return in 2ms.

If I keep on refreshing the browser, I see that time slowly starts to creep up, 4ms, 6ms, 8ms, and if I keep on refreshing the browser tab a lot, it slows down to 50-70ms and keeps on slowing down further with additional http requests.

At the same time I observe that the memory usage increases from 17MB to 34MB, 50MB, 75MB, all the way to over 1GB.

IgorMinar commented 1 year ago

The issue is very easy to repro when load testing the local server with hey:

$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|โ€ฆ6)

Summary:
  Total:    0.8201 secs
  Slowest:  0.7063 secs
  Fastest:  0.0024 secs
  Average:  0.0334 secs
  Requests/sec: 1219.3387

$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|โ€ฆ6)

Summary:
  Total:    5.1450 secs
  Slowest:  4.2399 secs
  Fastest:  0.0153 secs
  Average:  0.2053 secs
  Requests/sec: 194.3653

$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|โ€ฆ6)

Summary:
  Total:    17.4745 secs
  Slowest:  16.9587 secs
  Fastest:  0.0227 secs
  Average:  0.8165 secs
  Requests/sec: 57.2262

$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|โ€ฆ6)

Summary:
  Total:    32.9817 secs
  Slowest:  2.6376 secs
  Fastest:  0.0976 secs
  Average:  1.6490 secs
  Requests/sec: 30.3199

Interestingly, I occasionally see the workerd crash with the following:

workerd/server/server.c++:967: info: Inspector client attaching [core:user:workers-hot]
workerd/io/worker.c++:999: info: console warning; message = There was an error while reporting an unhandled promise rejection: RangeError: Maximum call stack size exceeded
*** Received signal #10: Bus error: 10
stack: 
penalosa commented 1 year ago

Could you post the script of your worker, and your wrangler.toml, if possible?

IgorMinar commented 1 year ago

Hey Samuel, apologies for late response. This was a simple "hello world" worker with no bindings or special config.

mrbbot commented 1 year ago

Hey @IgorMinar! I've just tried to reproduce this with the latest wrangler@beta, and I'm not seeing the response slowdown over time anymore, even with large -n. Would you be able to try out npx wrangler@beta dev --experimental-local? Memory usage does creep up (not as high as you were seeing), but this might just be slow GC, as it usually comes down after a while.

That being said, I do still get *** Received signal #10: Bus error: 10 stack: (indicating memory access out of process address space) if I try -n 10000 a few times, so there's still something strange happening here. Notably, this also happens when you run workerd serve samples/helloworld/config.capnp in the workerd repo, so I don't think this is a Wrangler/Miniflare problem.

IgorMinar commented 1 year ago

Iโ€™ll retest it but it will likely take a few days before I can get around to this. Thanks for the heads up though, looks like a good progress was made on this upstream already.

On Tue, Feb 21, 2023 at 7:59 AM MrBBot @.***> wrote:

Hey @IgorMinar https://github.com/IgorMinar! I've just tried to reproduce this with the latest @., and I'm not seeing the response slowdown over time anymore, even with large -n. Would you be able to try out npx @. dev --experimental-local? Memory usage does creep up (not as high as you were seeing), but this might just be slow GC, as it usually comes down after a while.

That being said, I do still get *** Received signal #10: Bus error: 10 stack: (indicating memory access out of process address space) if I try -n 10000 a few times, so there's still something strange happening here. Notably, this also happens when you run workerd serve samples/helloworld/config.capnp in the workerd repo, so I don't think this is a Wrangler/Miniflare problem.

โ€” Reply to this email directly, view it on GitHub https://github.com/cloudflare/workers-sdk/issues/2386#issuecomment-1438724497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUZ2E4MZOY26VPFG4WUZ3WYTQ4VANCNFSM6AAAAAAS2NPN5U . You are receiving this because you were mentioned.Message ID: @.***>

lrapoport-cf commented 1 year ago

@IgorMinar were you able to confirm whether this is still a problem?

IgorMinar commented 1 year ago

oops.. I dropped the ball on this one. I'll retest today. My apologies.

lrapoport-cf commented 1 year ago

hey @IgorMinar! :) just bumping this -- did you have a chance to retest?

IgorMinar commented 1 year ago

I was not able to verify/test because I got blocked on https://github.com/cloudflare/workers-sdk/issues/3007 which seems to suggest that --experimental-local is currently broken in the 2.14.0 release for some if not for all users.

re-thc commented 1 year ago

Getting a similar issue on the latest wrangler. Don't know if it "slows down" but memory usage quickly goes to 10->20->30gb and it doesn't come back down unless killed.

avaer commented 1 year ago

Can report this is still the case in wrangler 3.0.1 on MacOS.

Note that this does not require the wrangler server to be used at all. It will leak just by starting it.

Command:

wrangler dev --port 2222

I also get this error after some time; I'm not sure if it's related to the leak.

Failed to get worker definitions TypeError: fetch failed

      at fetch (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:16584:17)
      at async getRegisteredWorkers
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:124280:22)
      at async getBoundRegisteredWorkers
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:124299:29) {
    cause: HeadersTimeoutError: Headers Timeout Error
        at Timeout.onParserTimeout [as callback]
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:8431:33)
        at Timeout.onTimeout [as _onTimeout]
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:681:17)
        at listOnTimeout (node:internal/timers:573:17)
        at process.processTimers (node:internal/timers:514:7) {
      code: 'UND_ERR_HEADERS_TIMEOUT'
    }
  }
csabatuz-chess commented 1 year ago

I'd like to double down on this issue, as it's currently blocking our effort to use wrangler dev in a local development / integration testing environment.

3.1.0 MacOS

For measurement I used

hey -c 1 -n 10000 http://127.0.0.1:8787/

Bare minimum worker, no deps, no bindings

export default {
  async fetch(request: Request, env: any, executionCtx: ExecutionContext) {
    return new Response("Hello", {status: 200})
  },
}

Observed Very progressive slowdown, and in the end crash

[mf:inf] GET / 200 OK (18ms)
[mf:inf] GET / 200 OK (20ms)
[mf:inf] GET / 200 OK (20ms)
[mf:inf] GET / 200 OK (20ms)
*** Received signal #10: Bus error: 10
stack: 
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ [b] open a browser, [d] open Devtools, [l] turn off local mode, [c] clear console, [x] to exit                                                                                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Did a control measurement with Service Worker format vs Module worker

addEventListener('fetch', (event) => {
  event.respondWith(new Response("Hello", {status: 200}))
})

_The worker performs a lot better with wrangler dev when in the Service Worker format, but crashes all the same with the same error, just later (after way more requests)_

I'll perform an experiment with emulated amd64 arch a bit later

Cragady commented 1 year ago

Hopefully this will help in some small way, but I've had issues with the slowdown and the bus 10 errors as well.

Started at 3.0.1, and now on 3.1.0 on MAC

Both gave me a Bus error: 10 message along with massive slow-downs out the gate.

I used this command, softwareupdate --install "Command Line Tools for Xcode-14.3", and that seemed to get rid of the bus 10 error. The speed was still abysmal, taking a couple of minutes to load in vs the fractions of a second that it takes on prod.

Command used:

wrangler dev --log-level debug --ip 0.0.0.0 --port 3000 --env development

I did, however, get a different error than the bus 10 error after the CLI XCode update:

workerd/jsg/util.c++:278: info: exception = kj/async-io-unix.c++:778: disconnected: ::writev(fd, iov.begin(), iov.size()): Broken pipe; iovTotal = 1108; iov.size() = 1
 stack: 10210f62f 1021100d3 10124d3cc 1020e7df8 1020ec88c 1020e37a8 1020f2c54 1009f95e8 1020f16d0 100df1130
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in response); exception = Error: Network connection lost.
 workerd/jsg/util.c++:278: info: exception = kj/async-io-unix.c++:778: disconnected: ::writev(fd, iov.begin(), iov.size()): Broken pipe; iovTotal = 1108; iov.size() = 1
 stack: 10210f62f 1021100d3 10124d3cc 1020e7df8 1020e802f 1020d349c 1020e7df8 1020e802f 1020d349c 1020e7df8 1020e802f 1020d349c 1020f9314 1020f16d0 100f77494 1009f95e8 100e9e7c0 100dd8868 100eaf900 100d40c08 100df01c8 100b074b4
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (async); exception = Error: Network connection lost.
 โœ˜ [ERROR] Uncaught (in promise) Error: Network connection lost.

 โœ˜ [ERROR] Uncaught (async) Error: Network connection lost.

For additional context, the application runs as expected on the miniflare 2.X versions.

Command used with miniflare:

miniflare --live-reload --debug --port 3000 --env .env --modules <script path>

I hope this isn't just adding static

csabatuz-chess commented 1 year ago

I can confirm what @Cragady reported:

However:

IMO the main issue here is the slowdown and outrageous memory pressure accompanying it

ohodson commented 1 year ago

We were looking into the signal #10: Bus error in workerd yesterday.

This appears to be limited to workerd on OS X and only for optimized builds (bazel build -c opt). It hasn't failed this way with fastbuild or dbg builds.

xcodebuild -version reports Xcode 14.3.1 Build version 14E300c.

The good news is that we have a stack trace and can pinpoint where the failure occurs, the bad news is that it's an optimized build and so there's only partial symbols along with registers and memory. We'll figure something out :-)


With the slowdown and memory pressure issue, you can capture heap snapshots with devtools by enabling the inspector in workerd, e.g. workerd serve -i0.0.0.0 . If you open https://devtools.devprod.cloudflare.dev/js_app?ws=localhost:9229/main (or chrome://inspect) in Chrome on the dev machine, then go to the Memory tab and take snapshot. You can keep capturing heap snapshots between runs of hey to see what is growing.

The heap looks stable across batches of 10000 requests from hey using the workerd helloworld_esm sample on OS X M1 and Linux X64 (opt and dbg) with both the last workerd release and tip-of-tree. The overall process size does not grow without bound. This check is probably too brief, but might be exposed with a config and scripts that exercise more functionality.

o-az commented 1 year ago

Having the same issue today.

ohodson commented 1 year ago

Hi @lrapoport-cf, the fix to workerd is in and available for the next workers-sdk release. We can close this ticket when that happens. (I've unassigned the ticket trying to re-assign it, but don't appear to have permissions to do that).

lrapoport-cf commented 1 year ago

thanks @ohodson ! i've linked your fix to the ticket and reassigned :)

daniloteodoro commented 1 year ago

Looking at a merged PR in the comments above, it seems like there is a fix ๐ŸŽ‰ Do we have an expect date for this to be released?

mrbbot commented 1 year ago

Hey everyone! ๐Ÿ‘‹ We've just released a new beta version of Wrangler that includes a fix in workerd for this issue. Try it out with npx wrangler@beta dev. Please let us know if you encounter any issues. ๐Ÿ™‚

BillBrower-Shopify commented 1 year ago

@mrbbot wrangler@beta seems to fix the signal #10: Bus error but we still get 500s due to the maximum call stack size being exceeded after about ~2.5 minutes of constant requests in our reproduction repo. The response times also climb from ~50ms at the start of the stress test to about ~350ms by the time 500s appear.

Screenshot 2023-06-30 at 10 07 35 AM

mrbbot commented 1 year ago

@BillBrower-Shopify thanks for letting us know. Will see if we can translate this to a workerd-only reproduction. ๐Ÿ‘

mrbbot commented 1 year ago

@BillBrower-Shopify I've been able to reproduce this just using workerd and handed it off to the runtime team. I'm OOO all of next week so won't be able to respond for a bit.

BillBrower-Shopify commented 1 year ago

Thanks @mrbbot! Is there a separate Github issue we can use to track progress on that work?

o-az commented 1 year ago

Can confirm wrangler@beta seems to fix it. Before beta I was using ^3.1.1 and kept getting the same error CleanShot 2023-07-01 at 17 23 18@2x

ohodson commented 1 year ago

@BillBrower-Shopify I've filed https://github.com/cloudflare/workerd/issues/831 against workerd to investigate the 500 error that you are seeing. Thanks.

mrbbot commented 1 year ago

@BillBrower-Shopify the maximum call stack exceeded error and slowdowns should've been resolved with #3585, and released in wrangler@3.2.0. That seems to have fixed the issue with your reproduction, but let us know if you have any other problems. ๐Ÿ‘