Closed tlively closed 1 year ago
@tlively, do you feel that the lack of tail call support is blocking important work? Notably, is it or will it be blocking the work on wasm GC in any significant way, or to put it another way, will the j2cl and dart pipelines depend on tail calls for performance or significant functionality?
Yes, spec, interpreter, and test suite have been done long ago. The second engine has been the main blocker for the last 2-3 years. I was hoping that JSC would be particularly motivated to implement this one. :)
@lars-t-hansen, I think this feature is independent of GC, and it has been requested from various sides. I of course understand the need for prioritisation, but it looks somewhat troublesome for the health of the Wasm process if a feature that is basically done and everybody agrees is wanted does not make any progress for years.
@rossberg, this is simply a prioritization issue for us, as you already observed - small team, competing pressures, and a question of utility relative to other proposals and ongoing infrastructural work.
In general - this has come up in a couple of threads recently - I feel that a number of wasm proposals could be more explicit about their expected utility. The tail call proposal (not worse than any other in this regard!) doesn't even name a single concrete language that would benefit from this proposal, nor does it try to quantify, even informally, the amount of pain that would result from not having tail calls in wasm, for each specific such language (or other use case). We could all name such languages, but how interested are we in trying to understand the alternatives for translators from those languages, and would any of them actually have translators absent some other feature X that is also missing (frequently X = the GC feature)? I think that often, utility is discussed in meetings during the early stages of the proposals, but by capturing the claims of utility directly and in some detail in the proposal it would be easier to understand later why a proposal should be prioritized for implementation.
Consider Scheme, which requires proper tail calls. Several successful translators from Scheme to C were written that circumvent the lack of tail calls in C in various ways, either by some sort of fairly efficient trampoline+longjmp system and/or by translating calls that implement loops to actual loops and translating the remaining tail calls as non-tail calls. Static compilers would also use the larger (whole-module) context to make such a translation more effective. At least Stalin, Bigloo, Gambit-C, and Petit Larceny are examples of translators in this class; there were others (but it's such a long time ago and I've forgotten).
So if Scheme is one of the languages named in a list of languages that would benefit from proper tail calls, it would be appropriate to mention these techniques as some that could be used by Scheme translators in the absence of tail calls, along with the problems that using these techniques might create for the translator and/or language user, and some assessment of how bad those problems are in practice. And it might also be appropriate to mention that a successful Scheme translator would require many other things (GC, call tags, a very powerful variant of stack switching, and reasonable solutions for generic arithmetic and variable-length argument lists) and is therefore not a powerful driver for tail calls at this time.
@tlively, I see you added this to tomorrow's meeting agenda. Unfortunately I'll miss that meeting, so a brief remark here. Our position is that the utility of tail calls is unclear enough that it is not a priority for us to implement it soon, nor would we want to speculate about when we might implement it.
Thanks, @lars-t-hansen! Your position that we should concretely motivate further work makes perfect sense. I started thinking about tail calls again because they came up in an Emscripten issue about supporting C++20 coroutines: https://github.com/emscripten-core/emscripten/issues/10991. I don't know have firsthand knowledge about it yet, but according to that issue, Clang can use tail calls to make coroutines much more efficient.
Do you think that finding "X C++ workloads get faster by Y% with tail calls enabled" could be sufficient motivation (for some X and Y) to prioritize tail calls? Or would you want to see new languages, etc. rather than just performance improvement?
@tlively, do you feel that the lack of tail call support is blocking important work? Notably, is it or will it be blocking the work on wasm GC in any significant way, or to put it another way, will the j2cl and dart pipelines depend on tail calls for performance or significant functionality?
No, as far as I know this is orthogonal to the GC work.
@tlively,
Do you think that finding "X C++ workloads get faster by Y% with tail calls enabled" could be sufficient motivation (for some X and Y) to prioritize tail calls? Or would you want to see new languages, etc. rather than just performance improvement?
Every bit helps, IMO. As always, the more use cases and the more compelling they are the more it makes sense to prioritize work on a feature. If real C++ code that we might want to run on the web improves significantly then that's an important fact to consider.
Lumen (an AOT compiler for Erlang) that we’re building to support Erlang, Elixir, Gleam and other languages that work on BEAM will be heavily dependent on tail calls as all those languages write their code assuming that tail calls are cheap and do not extend the stack. They also assume that tail calls don’t have to be recursive tail calls.
Lumen also would use the stack switching protocol to implement Erlang processes.
Asterius (experimental haskell-to-wasm compiler) is also heavily dependent on tail calls, since the Haskell stack is managed explicitly as heap data, and we do need the guarantee that control flow transfer has O(1) memory overhead. We emulate tail calls via trampolines by default for now, no concrete numbers on how much performance gain tail calls will bring, but it's surely an upstream feature we wholeheartedly welcome.
@TerrorJack, would you be able to measure the performance overhead of the trampoline approach? Binaryen (if you use it) supports tail calls with --enable-tail-call
and V8/Node supports them with --experimental-wasm-return-call
. Chrome supports them (IIRC) if you enable chrome://flags/#enable-experimental-webassembly-features.
I tried to implement a compiler from CakeML (or rather, some much simpler intermediate language in the CakeML compiler) to WebAssembly. While a trampoline approach might be possible to implement, it's too hard to verify (CakeML is a verified compiler). Relying on tail calls would be easier.
Edit: So probably you should not only think in terms of improvements that tail calls will yield for existing code, but also what it would enable.
@lars-t-hansen looking at the performance benchmarks for Seastar and Redpanda, which are pretty breath taking, they both rely heavily on co_await, which relies on tail calls. We could do some pretty incredible webassembly stuff. Even the chirp guys have been asking for tail calls in their efforts.
http://seastar.io/http-performance/
Until recently, I didn't realize what a massive impact tail calls could make when it comes to co_await and the compiler optimizations which come from it.
I'd suggest getting the change released on chrome asap. Once webassembly workloads start showing up which only run fast on chrome, others will follow.
@unicomp21 indeed we (Leaning Technologies) are interested in tail calls for CheerpX.
We use them to implement X86 indirect jumps.
The alternative is a trampoline-like solution, but it is quite slower. We tried really hard to avoid the problem altogether in many cases (by "devirtualizing" the jumps), but it is not always possible of course (see here for some more info).
I don't have numbers at hand, but since we have a flag to toggle tail call support in our JIT, I could try to set up some benchmarks.
We are even considering to implement it in JSC or SpiderMonkey ourselves to move things forward.
@yuri91,
Thanks for the input. (Thanks also goes to everyone else on the thread.) Now we can hopefully count on @tlively to summarize these (and other) findings into the proposal's overview ;-)
We are even considering to implement it in JSC or SpiderMonkey ourselves to move things forward.
If you're planning anything for SpiderMonkey we'd appreciate knowing about it beforehand, as we are implementing other large changes to the ABI and there would be an inevitable conflict with tail calls if these changes were to happen concurrently without synchronization.
@lars-t-hansen Good to know! We haven't seriously looked into doing this yet, but if/when we do we will get in touch for sure.
Some work has started on a compilation strategy from the OCaml language to WebAssembly. This work unfortunately had to be paused due to the pandemic, but hopefully it's not too far away from restarting. OCaml relies heavily on tail calls (both direct tail calls where the callee is known and indirect tail calls where it is not) for loops involving one or more functions. Within approximately the next year we should be in a position to have an optimiser that can compile some of these situations to loops, via contification, but it is likely that tail calls will remain important. As such it would be very welcome if there were to be further progress on the implementation of tail calls within the WebAssembly ecosystem.
indirect tail calls where it is not
apply(module, function, arguments)
and apply(anonymous_function, arguments)
also occur in tail position in Erlang and Elixir.
@lars-t-hansen these might be useful ...
https://github.com/scylladb/seastar/blob/9cb2d51ed19b63cd81d25ba9efa512f5ca01f500/doc/tutorial.md
https://github.com/vectorizedio/redpanda/search?q=co_await
https://github.com/chriskohlhoff/asio/search?q=co_await
https://github.com/search?q=org%3Afacebook+co_await&type=code
my understanding is co_await leverages tail calls heavily.
Per the last link, perhaps we could get facebook to start asking for tail call support in webassembly? I'm guessing they would be interested in having them for all the meta stuff going on.
Just as a data point, for the Guile implementation of Scheme, tail calls would be useful but not necessary to enable Scheme on wasm and not sufficient on their own for high performance. Because of varargs and multiple return values, we end up having to pass arguments and return values via tables anyway, in the general case. Because we want to capture multi-shot delimited continuations, we also have to CPS-convert and push return continuations on an explicitly-managed stack also, converting all calls and returns to tail calls, implemented via a trampoline loop. At least we can contify loops & control flow without calls.
Tail calls in wasm would do away with the need for the trampoline loop but is just one piece. See https://lists.gnu.org/archive/html/guile-devel/2021-06/msg00005.html for a further discussion.
@wingo, I'm curious why multiple return values are an issue, given that Wasm has them, too. Does that have to do with the untyped nature of Scheme?
@rossberg yes it is the lack of types. same issue as varargs, really.
@tlively and @rossberg,
Community Group has reached consensus in support of the feature.
* Judging from the lack of open issues on this repo, this shouldn't be a problem.
The overview lists one open issue, all the way down at the bottom:
Can tail calls across module boundaries guarantee tail behaviour?
There isn't an open or closed github issue whose title speaks to that problem; possibly a resolution lurks in a discussion somewhere. In any case, can we either have that open issue removed / clarified in the overview if it has been resolved, or a github issue opened for discussion and eventual resolution otherwise? Thanks.
Aren't they already adding it to chrome? @lars-t-hansen
@lars-t-hansen, yes, good question. My recollection of the state of that debate is that multiple implementers were positive that this could be implemented just fine, e.g. using auxiliary stubs or pushing custom stack frames. But I don't know if that has actually been proven in practice, e.g., in V8.
@lukewagner, @titzer, @jakobkummerow
Cross-module tail-calls should work just fine in chrome. The only difference with regular tail-calls is that we pass the target instance to the callee instead of the current instance, like we do for regular import calls. However the implementation is still unstable and we don't have plans to enable it by default yet.
@lars-t-hansen Here's more information on the massive performance benefits you can get from using proper tail calls in C and C++:
"Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C" https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html
My project is a functional language with proper tail calls. The interpreter is written in C++, and I hope to use C++ tail calls to make that interpreter efficient. Tail threaded code has emerged as the fastest known interpreter writing technique, since clang 13 added the [[clang::musttail]] attribute, see above. I also want to write a compiler for my language that generates WASM code, and if I want efficient tail calls, I'll need WASM tail calls. Many languages support proper tail calls, and many languages have interpreters, so my requirements are hardly unique.
It would be interesting to see how those results translate to WebAssembly. The blog post doesn't say how much tail calls specifically increased performance, but it would be nice to measure. @doug-moen, if you get a chance to trying using [[clang::musttail]]
and compiling your interpreter with Emscripten and running it with tail calls enabled, I'd be curious to hear how it goes.
Here's a simple explanation of how a tail-threaded Forth interpreter works.
https://github.com/snej/tails/ If you want to understand how this tech works, this is a useful starting point, to get oriented.
Next, here's a more complex explanation of how the tail-threaded Wasm3 interpreter works.
https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md This includes more information about why it works, and includes benchmarks showing large performance improvements over bytecode interpreters like Lua.
Each interpreter opcode is represented by a C function that tail-calls the next instruction in the instruction stream. The generated code for a typical opcode doesn't do any C stack manipulation. The interpreter VM's registers are kept in CPU registers. The tail-call to the next instruction is fast due to branch prediction on modern CPUs. The performance improvement is not as great for legacy 32 bit CPUs.
How well these performance improvements translate to WebAssembly would depend on the quality of the WASM implementation. If WASM code is compiled into machine code with the same characteristics as the machine code generated by
clang -O3 -fomit-frame-pointer -fno-stack-check -fno-stack-protector
then the performance should be the same. I have no empirical results to share. My guess is that the performance of tail-threaded WASM code will improve once there are projects generating this kind of WASM code, giving WASM implementors an opportunity to tune performance for this use case.
Even without a performance boost, I will still benefit:
@doug-moen, thanks.
While I can currently see a path to fairly reasonable tail calls in SpiderMonkey, there's no indication that they will be faster than regular calls, ie, the main benefit here would be the ability to express iteration as recursion without overflowing the stack. To be sure, we avoid the frame cleanup when the call returns but on the other hand, since a chain of tail calls may eventually cross a module boundary and will depend on the initial tail caller to clean up any context restoration that we currently do at the caller site, there will be a little extra work at each call site. And at this time I am unwilling to penalize non-tail calls with even a single extra instruction to accomodate tail calls.
I'm concerned that the conversation on this particular proposal is getting reset to a state several years in the past, before there was a broad consensus to move this far forward. The many replies above reaffirm the clear need for and advantages of this feature. We shouldn't reset the conversation to before consensus unless there is new information.
@lars-t-hansen Your concerns about tail calls potentially impacting the non-tail call case are well understood, as that was raised very early in the design process. But V8 has managed to implement them without any overhead to regular calls (without any modifications to its already fairly straightforward, if internal, calling convention). I appreciate that different systems have different details. I feel like not accepting a single instruction is a hardline stance. In this case, and really in any case, the question of whether some overhead is acceptable should be settled by empirical measurement. Otherwise the conversation will just cycle again without any new information. (And let's be clear--it might be a little overhead now, but engines can and do make improvements over time; functional completeness is all that's needed to complete the process here).
Tail calls needs to ship so that we can move on. From my perspective, it's extremely important that WebAssembly be able to complete the standardization process for features that have this much momentum.
I feel like not accepting a single instruction is a hardline stance.
Well, yeah. But no doubt you also noted the qualifier "at this time".
To be sure, we avoid the frame cleanup when the call returns but on the other hand, since a chain of tail calls may eventually cross a module boundary and will depend on the initial tail caller to clean up any context restoration that we currently do at the caller site, there will be a little extra work at each call site. And at this time I am unwilling to penalize non-tail calls with even a single extra instruction to accomodate tail calls.
On this I was thinking: the first call site of a chains of call-site may be arbitrary, but the second one is guaranteed to be a tail call. So there are way to leave code-gen unchanged for regular calls and adding overheads ONLY on tails-call locations. This would means that the first tail calls consume a stack frame, but that's not a big deal and I believe to be in line with the specification. The point being, there is no need to have 0 overhead per chain of tail-calls, a single stack overhead overhead is fine (given that at worst this would multiply by 2 the size of the stack).
@carlopi, yes indeed, and that is the direction i'm pursuing as it fits our internal abi.
Just to follow up here, I've been prototyping some in SpiderMonkey.
Can tail calls across module boundaries guarantee tail behaviour?
I now believe that SpiderMonkey can accommodate a requirement for cross-module tail calls to be proper tail calls within our current ABI. So that particular issue can be closed, from our point of view.
Re progress toward phase 4, due to resourcing constraints we're probably not going to be in a position to have a production-ready implementation of tail calls this year. Priorities may change, but tail calls are lower priority for us than several other proposed features.
Thanks for the update, @lars-t-hansen! I'm glad to hear the ABI won't be a problem.
@yuri91 given resource constraints listed by @lars-t-hansen, any chance Leaning Technologies might have spare cycles for this effort?
Just to be tidy about it: We are unlikely to accept outside help on this feature unless the persons doing the work already have deep expertise with SpiderMonkey and its compilers, and even then it's going to be a tough sell - tail calls really is not our highest priority at this time.
@john-livelyvideo my colleague is currently implementing it in Webkit: https://bugs.webkit.org/show_bug.cgi?id=215275
FWIW I discovered this morning that not only can tail calls make C++20 coroutines more efficient, but in fact clang is not able to compile C++20 coroutines without tail call support. Sooner or later we will have large users trying to use C++ coroutines and shipping this will become much more important.
@gornishanov ^, any way we can get help lighting a fire under this on the clang side? Maybe they can implement your workaround for webassembly in the meantime?
I am not sure. Companies that funded coroutines development were doing it because they needed coroutines to help with their internal projects.
On the other hand. Are you comfortable with llvm optimizer passes? I think the transformation you seek is possible by modifying CoroEarly and CoroSplit passes. Though, it is a lot of work.
https://llvm.org/doxygen/CoroEarly_8cpp_source.html
CoroEarly pass lowers void @llvm.coro.resume(i8* <handle>)
intrinsic. For webassebly that is where you insert the loop. This should be an easy change.
CoroSplit pass creates coroutine parts in the shape void coro$part1$resume(void* coro-frame)
. For webassembly, the parts should be void* coro$part1$resume(void* coro-frame)
and any tail calls resumes need to be purged and replaced with ret
The latter will be a more difficult change and will probably have ripple effect on other coroutine passes since the signature of the coroutine part will be changed. LLVM Coroutine documentation would need to be updated as well.
Could microsoft benefit?
On Fri, Jun 17, 2022 at 9:28 AM Gor Nishanov @.***> wrote:
I am not sure. Companies that funded coroutines development were doing it because they needed coroutines to help with their internal projects.
On the other hand. Are you comfortable with llvm optimizer passes? I think the transformation you seek is possible by modifying CoroEarly and CoroSplit passes. Though, it is a lot of work.
https://llvm.org/doxygen/CoroEarly_8cpp_source.html CoroEarly pass lowers void @llvm.coro.resume(i8*
) intrinsic. For webassebly that is where you insert the loop. This should be an easy change. CoroSplit pass creates coroutine parts in the shape void coro$part1$resume(void coro-frame). For webassembly, the parts should be void coro$part1$resume(void* coro-frame) and any tail calls resumes need to be purged and replaced with ret `.
The latter will be a more difficult change and will probably have ripple effect on other coroutine passes since the signature of the coroutine part will be changed. LLVM Coroutine documentation would need to be updated as well.
— Reply to this email directly, view it on GitHub https://github.com/WebAssembly/tail-call/issues/15#issuecomment-1158927808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEFL7J7RMCROI2ZBFLAA3DVPSDQPANCNFSM5INEL5KQ . You are receiving this because you were mentioned.Message ID: @.***>
@john-livelyvideo my colleague is currently implementing it in Webkit: https://bugs.webkit.org/show_bug.cgi?id=215275
Update: the bug has migrated to GitHub, and is getting some reviews: https://github.com/WebKit/WebKit/pull/2065
I also wrote a post that gives a high level overview of the implementation: https://medium.com/leaningtech/fantastic-tail-calls-and-how-to-implement-them-8c6bd52fd271#6524
Awesome @tomoliv30, thanks for the update! Looking forward to the PR being landed.
I'm not sure exactly who's responsible for this particular corner of things, but as long as no WASM engine supports tail call optimization, it sure seems like EMCC should find a way to return 0 for __has_cpp_attribute(clang::musttail): https://godbolt.org/z/n3cPj3185
@brianosman, let's discuss that further on https://reviews.llvm.org/D131990.
Is there any progress on this? It's hard to find anything more recent than this past summer... It's hard to tell how close the browsers are to supporting this and which feature-stage we're in now (though I assume it's still the stage where we get major browser support)
The phase 4 requirements are:
Two or more Web VMs implement the feature.
At least one toolchain implements the feature.
The formalization and the reference interpreter are usually updated (though these two can be done as part of step 3 at the Working Group chair's discretion).
Community Group has reached consensus in support of the feature.
Is there anything left to do besides get another web engine implementation?