cloudflare / workers-rs

Write Cloudflare Workers in 100% Rust via WebAssembly
Apache License 2.0
2.53k stars 269 forks source link

[BUG] Failure mode when including non-wasm compatible stuff is poor #646

Closed joshka closed 6 days ago

joshka commented 6 days ago

Is there an existing issue for this?

What version of workers-rs are you using?

0.4.1

What version of wrangler are you using?

3.78.12

Describe the bug

Adding code that does non-wasm compatible things leads to errors that are difficult to diagnose. E.g. take the tracing example and remove features=['wasm-bindgen'] from the time dependency in Cargo.toml and the build succeeds fine, but fails at runtime with:

[wrangler:inf] Ready on http://localhost:8787
✘ [ERROR] A hanging Promise was canceled. This happens when the worker runtime is waiting for a Promise from JavaScript to resolve, but has detected that the Promise cannot possibly ever resolve because all code and events related to the Promise's I/O context have already finished.

[wrangler:inf] GET / 500 Internal Server Error (6ms)
✘ [ERROR] RuntimeError: unreachable RuntimeError: unreachable

      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at X (file:///Users/joshka/local/workers-rs/examples/tracing/build/worker/shim.mjs:2:492)
      at f (file:///Users/joshka/local/workers-rs/examples/tracing/build/worker/shim.mjs:2:331)

A similar thing goes when trying to add tower_http::trace::TraceLayer to an axum router (this is the crux of my actual problem - I want to trace requests and providing the appropriate context to the traces).

Add the following to cargo.toml:

tower = "0.5.1"
tower-http = { version = "0.6.1", features = ["trace", "util"] }

And then to your router:

.layer(ServiceBuilder::new().trace_for_http())

You get a web response of:

Error: The script will never generate a response.
    at async Object.fetch (file:///Users/joshka/local/cloudflare-webfinger-worker/node_modules/miniflare/dist/src/workers/core/entry.worker.js:1029:22)

and logs:

✘ [ERROR] A hanging Promise was canceled. This happens when the worker runtime is waiting for a Promise from JavaScript to resolve, but has detected that the Promise cannot possibly ever resolve because all code and events related to the Promise's I/O context have already finished.

✘ [ERROR] RuntimeError: unreachable RuntimeError: unreachable

      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at [object Object]
      at J (file:///Users/joshka/local/cloudflare-webfinger-worker/build/worker/shim.mjs:2:419)
      at a (file:///Users/joshka/local/cloudflare-webfinger-worker/build/worker/shim.mjs:2:258)

Looking at shim.mjs is useless as it's minified.

Expectations

These are the things that would help inspire confidence in the cloudflare workers platform:

  1. Fail at build time for unsupported things rather than runtime
  2. Clear actionable log messages when errors like this occur
  3. Document Instructions on how to diagnose / troubleshoot errors in shim.mjs
  4. Document how the shim / wasm stuff works and how cloudflare workers / rust uses that.

I'm assuming that these pain points / limitations probably come mostly from Wasm rather than cloudflare (but could be wrong there)

Steps To Reproduce

No response

kflansburg commented 6 days ago

IMO these crates should not be successfully compiling when they cannot support the target with the feature set. It looks like you are encountering a Rust unreachable panic, and I don't think there is much we can do as a platform there. Are you using console_error_panic_hook or following our coredump example. These may provide more insight into what crate is misconfigured.

joshka commented 6 days ago

IMO these crates should not be successfully compiling when they cannot support the target with the feature set.

I'd agree - that would be much better for end users.

The comment in the std lib source about why this is happening is informative here:

//! Currently all functions here are basically stubs that immediately return
//! errors. The hope is that with a portability lint we can turn actually just
//! remove all this and just omit parts of the standard library if we're
//! compiling for wasm. That way it's a compile time error for something that's
//! guaranteed to be a runtime error!

I don't think there is much we can do as a platform there.

Sure there is. Generally the idea is to help a user encountering the issue move forward at least one step. Instead of just displaying a stack trace, and treating that as good enough, provide information that would help a user encountering the failure diagnose it. That might be simply pointing at a troubleshooting guide url, or it could be better error messages. Either way right now encountering a failure is effectively a dead end state for cloudflare workers built on rust, especially for someone not intimately familiar with wasm.

Additionally, the shim.js file is generated by worker-build (I think). In dev mode perhaps it could default to not minifying the code. This might help make this a little less of a pain to move past to at least see what line of code is failing with proper method names. (Checking this idea using the NO_MINIFY env var suggests that I'm wrong here - it doesn't help much, as the location of the error in the stack trace comes from the logging of the error, not from the actual source of it, but at least it rules out a bit more of the unknowns)


Are you using console_error_panic_hook or following our coredump example. These may provide more insight into what crate is misconfigured.

Thanks for the pointers to troubleshooting further. I added the panic hook and saw that the failure for the servicebuilder / trace is also around not supporting time functionality and that the tracing code calls duration / instant methods internally.

✘ [ERROR] panicked at library/std/src/sys/pal/wasm/../unsupported/time.rs:13:9:

  time not implemented on this platform

Which correlates to https://github.com/rust-lang/rust/blob/master/library/std/src/sys/pal/unsupported/time.rs#L13

It seems like to support crates that use Instant / Duration right now on wasm32-unknown-unknown, all these crates would need to fall back to using something like https://docs.rs/web-time/latest/web_time/, which would be kinda painful to do for all the libraries that need it (This is enough of a blocker that it likely changes my focus on whether this is a problem worth solving for my purpose). Mainly I wanted a bit more info about the comings and goings deeper in the axum stack for diagnostic purposes.

The top level of the yak shave on this request was that I was seeing timeouts on workers that I'd expected to work, after changing the config from custom domain to lower level routs, This worked locally fine, so I thought it must have been something in axum playing weird with workers and wanted to add more tracing of that level (using tower_http::trace::TraceLayer). It actually turned out that I'd missed adding a star at the end of the route (combined with incorrect docs on the cloudflare site schemes being supported in the route patterns), so requests with query strings timed out while requests without query strings did not.


TL;DR, my main suggestions / feedback here are to think about the failure modes that can occur at runtime and make it easy for users to do the next thing when troubleshooting. Assume that the user is not a wasm / web expert (which is why they're using rust for running workers), and be mindful of how to present information about what to do next directly in the tooling / error messages / default templates, and not just in the docs.

kflansburg commented 6 days ago

I understand the additional information that you would like to have, I'm saying that the runtime does not do anything special for Rust WebAssembly modules and it will likely stay that way. In the workers-rs shim, I don't think its likely that we will be able to disambiguate an unreachable statement due to this problem from the many other ways that unreachable statements can be encountered.

We recommend using console_error_panic_hook in the README, all templates, and most examples. That allows the runtime to produce sensible error messages as you saw.

There are many options for time support, including the time crate and chrono, which both support WebAssembly. Unfortunately std::time is not one of them, and I can't speak to the percentage of crates in the ecosystem which choose std::time vs. the alternatives, but in my experience most do not use std::time, especially if their authors have attempted to support WebAssembly themselves.

In general, I do think this SDK assumes some familiarity with developing for WebAssembly. We do our best to abstract over JavaScript interop code, but WebAssembly is a restricted compile target with good (but not 100%) crate support. Fully solving that developer experience is not in scope for this project.

kflansburg commented 6 days ago

I will also note that the tracing layer is unlikely to be useful because we do not allow timing of CPU time on the platform for security reasons, see https://developers.cloudflare.com/workers/reference/security-model/#step-1-disallow-timers-and-multi-threading

Edit: If you could override some of that layer configuration to disable the collection of timestamps, it may work and still produce useful logs.

joshka commented 5 days ago

Yup, that makes a lot of sense. My project was generated using the pre- cargo-generate template (the wrangler experimental one), which didn't have the panic handler, so I missed the message on using this.