embassy-rs / embassy

Modern embedded framework, using Rust and async.
https://embassy.dev
Apache License 2.0
5.63k stars 781 forks source link

Using embassy primitives in webworkers #3313

Open 9SMTM6 opened 2 months ago

9SMTM6 commented 2 months ago

Hey there! Its amazing that you also offer an async runtime with synchronization primitives etc. for the web. And it works well enough, if you use it from the main thread.

However I did notice that some primitives don't seem to work if you use them from a webworker.

That is a rather big ask, and if you think that is out of scope, that's fine, but I thought I'd at least let you know of this usecase.

I made a reproduction here.

I'm very much aware that it might not be possible to make all primitives work in webworkers. However many WebAPIs are also available in webworkers, so I think that many primitives should be possible if one is a bit careful which concrete API they use.

Dirbaio commented 2 months ago

What web APIs is Embassy calling that aren't available in webworkers?

embassy-sync uses "just standard async Rust", so I'd say the issue is not there. It's probably in the executor implementation for wasm which does call JS apis to get itself polled from the JS event loop.

9SMTM6 commented 2 months ago

@Dirbaio This is only the first thing that made errors and in all likelyhood theres going to be more issues when we solve this. But the wasm example breaks because of embassy-time, concretely because that is using wasm_timer to get a replacement for std::time::Instant (which isn't available on web).

It took me a bit, because of the crazy build system you've got there - understandable but takes getting used to - , and because liker errors with wasm_bindgen are... perhaps more problematic than these on the native platform. But I did just patch embassy-time(-driver) to use web-time instead, which.... partially worked.

Ticks on the main thread work as before, and on the worker thread I didn't get a panic, but the loop never repeated.

IDK if web-time is even supposed to work on web-workers, that would be what I look at now.

Dirbaio commented 2 months ago

ah, it's embassy-time, makes sense :)

hope you figure it out, i'll gladly merge PRs that make things work in webworkers.

9SMTM6 commented 2 months ago

@Dirbaio thanks, I'll look around a bit. I've tried 2 other crates that implement std::time::Instant on the web, because that was the easiest check I came up with, all of them failed the same way web-time did.

My guess now, as to why they fail that way: They all use performance.now(). But that behaves differently on Web-Workers than it does on the main page. Their start time is different.

If that is the case, any solution would likely have to come with refactoring of perhaps not even explicitly wasm code.

9SMTM6 commented 2 months ago

NVM, instant fails differently. However that does also kindof support my theory, as it fails with 'earlier cannot be later than self.' , which is exactly what I'd expect with the API difference.

However, thinking about it I might after all be able to replace Instant minimally invasive in terms of common embassy code. However that would then require some synchronization of the start time of all instants, which might be a considerable overhead.

See: https://developer.mozilla.org/en-US/docs/Web/API/Performance/timeOrigin#synchronizing_time_between_contexts

However I'm still somewhat surprised by this, as that executor and the timer were both running on the webworker from the beginning.

Also sadly other stuff has to get done for now, just letting you know that I'm no longer actively working on this now.

9SMTM6 commented 2 months ago

didn't let me go ;).

I ended up basically vendoring in web_time, which in the case of enabled atomics (which is the case in my reproduction) does things as it should according to mdn to have correct absolute times. I vendored it it to access the underlying duration it saves in Instants, which I replaced the difference from zero_instant in driver_wasm for. Then I logged the start times in the poll method of the Timer future impl, along with some information whether its in a web worker or not.

The start times with that are roughly 40 ms apart, which fits with how they're logged. So the instant isnt the issue any longer, AFAICT.

If I were a betting man, I'd say that now the wakers don't work correctly now. Its probably doing something similar to the zero_instant, which would probably not work correctly in web workers.

9SMTM6 commented 2 months ago

I spend some more time on this.

AFAICT all the remaining code, incl. executors is fine. Current suspect is the browser implementation of performance.now() being buggy.

The pointer was when I actually read the documentation of web-time which pointed me to a section of MDN that describes that performance.now doesn't behave correctly on Linux when suspending.

I might at a later point test this on windows, to confirm it being the (only) issue (for sleeping at least).

As said, at least for the code I looked at, I also did not spot any other things that should not work on web workers, however with how complex this can evidently get I probably would not spot issues.

I would like to upstream some of the improvements, as that would get one at least closer to working executors on web workers. However, with the current behavior one at least gets a panic if one tries to use this on a web-worker, if I plainly add this it would fail silently. So I propose the following:

@Dirbaio would you find that acceptable?

Dirbaio commented 2 months ago

I will add a check whether we're in a web worker to Executor::new on wasm, that will panic if thats the case

i'm not sure I follow. If you fix embassy-time so it works in webworkers, what issue is left that warrants this?

9SMTM6 commented 2 months ago

It will probably work on Windows. But on all other browsers performance.now is broken, apparently this is caused by an unclear standard. These issues are ~3 years old, so there's hope that they might get fixed eventually, and if they do, then these changes will allow embassy to work in web workers on all platforms. Before that, I might put this check behind a feature flag, but enabling that flag by default would be a compatibility hazard, and in addition, these issues are not easy to diagnose.

Dirbaio commented 2 months ago

is the bug that performance.now() stops counting while suspended?

IMO that's acceptable, it's okay for time to "freeze" while suspended. I'd be OK with documenting in embassy-time that this behvaior is platform-dependent too.

I don't think we should actively sabotage the executor due to that. Besides, it's possible to use embassy-time without the executor, so making the executor do the check doesn't prevent the user from running into that.

9SMTM6 commented 2 months ago

IMO that's acceptable, it's okay for time to "freeze" while suspended. I'd be OK with documenting in embassy-time that this behvaior is platform-dependent too.

If you've got a web-worker that's not busy looping or something it will be stuck forever on timer.sleep.

Besides, it's possible to use embassy-time without the executor, so making the executor do the check doesn't prevent the user from running into that.

Thats true...

I think I've got another location that would warn both more completely and more specifically if you do something "wrong". I'll have a PR ready soon, you can then give feedback on there.

I would still like to panic by default. With the location I have in mind it should only trigger on the problematic functions, and on some of these there was already a panic, the one that originally caused this issue.

I will place it behind a default feature, so that people can disable that if they think their usecase is fine. But I would really like to be proactive in warning about this.

Documentation should also be added on top, thats right too.

9SMTM6 commented 2 months ago

is the bug that performance.now() stops counting while suspended?

@Dirbaio oh, and yes. The PR is around btw.

Dirbaio commented 2 months ago

If you've got a web-worker that's not busy looping or something it will be stuck forever on timer.sleep.

why? when you unsuspend the machine performance.now() would resume counting which would get things working again. Or am I misunderstanding the bug?

9SMTM6 commented 2 months ago

I've since tested this on Windows, and it still fails. So IDK, clearly this isn't the only issue, but I seem to be out out my depth.

@Dirbaio my theory was that this would lead to the future just re-suspending after it was woken. But on reconsideration, with the logs I had inserted, I should've been able to spot that behavior.

I'm not experienced with executors/implementing futures in rust at all, so IDK. whats really going on.

9SMTM6 commented 2 months ago

Anyone working on this might want to consider the notes here and might want to reference wasmtimer-rs, which is a maintained fork of wasm-timer (the current provider for Instant in embassy on wasm, though nothing else) that also has abstractions similar to embassy-timer, and also works on webworkers.