Please Support Arbitrary Labels and Gotos.

oridb commented 8 years ago

I'd like to point out that I haven't been involved in the web assembly effort, and I'm not maintaining any large or widely used compilers (just my own toy-ish language, minor contributions to the QBE compiler backend, and an internship on IBM's compiler team), but I ended up getting a bit ranty, and was encouraged to share more widely.

So, while I'm a bit uncomfortable jumping in and suggesting major changes to a project I haven't been working on... here goes:

My Complaints:

When I'm writing a compiler, the first thing that I'd do to with the high level structure -- loops, if statements, and so on -- is validate them for semantics, do type checking and so on. The second thing I do with them is just throw them out, and flatten to basic blocks, and possibly to SSA form. In some other parts of the compiler world, a popular format is continuation passing style. I'm not an expert on compiling with continuation passing style, but it neither seems to be a good fit for the loops and scoped blocks that web assembly seems to have embraced.

I'd like to argue that a flatter, goto based format would be far more useful as a target for compiler developers, and would not significantly hinder the writing of a usable polyfill.

Personally, also I'm not a big fan of nested complex expressions. They're a bit clunkier to consume, especially if inner nodes can have side effects, but I don't strongly object to them as a compiler implementer -- The web assembly JIT can consume them, I can ignore them and generate the instructions that map to my IR. They don't make me want to flip tables.

The bigger problem comes down to loops, blocks, and other syntactic elements that, as an optimizing compiler writer, you try very hard to represent as a graph with branches representing edges; The explicit control flow constructs are a hindrance. Reconstructing them from the graph once you've actually done the optimizations you want is certainly possible, but it's quite a bit of complexity to work around a more complex format. And that annoys me: Both the producer and the consumer are working around entirely invented problems which would be avoided by simply dropping complex control flow constructs from web assembly.

In addition, the insistence of higher level constructs leads to some pathological cases. For example, Duff's Device ends up with horrible web assembly output, as seen by messing around in The Wasm Explorer. However, the inverse is not true: Everything that can be expressed in web assembler can be trivially converted to an equivalent in some unstructured, goto based format.

So, at the very least, I'd like to suggest that the web assembly team add support for arbitrary labels and gotos. If they choose to keep the higher level constructs, it would be a bit of wasteful complexity, but at least compiler writers like me wold be able to ignore them and generate output directly.

Polyfilling:

One of the concerns I have heard when discussing this is that the loop and block based structure allows for easier polyfilling of web assembly. While this isn't entirely false, I think that a simple polyfill solution for labels and gotos is possible. Whiie it might not be quite as optimal, I think that it's worth a little bit of ugliness in the bytecode in order to avoid starting a new tool with built in technical debt.

If we assume an LLVM (or QBE) like syntax for web assmembly, then some code that looks like:

int f(int x) {
    if (x == 42)
        return 123;
    else
        return 666;
}

might compile to:

 func @f(%x : i32) {
    %1 = test %x 42
jmp %1 iftrue iffalse

 L0:
    %r =i 123
jmp LRet
 L1:
    %r =i 666
jmp LRet
 Lret:
    ret %r
 }

This could be polyfilled to Javascript that looks like:

function f(x) {
    var __label = L0;
    var __ret;

    while (__label != LRet) {
        switch (__label) {
        case L0:
            var _v1 = (x == 42)
            if (_v1) {__lablel = L1;} else {label = L2;}
            break;
        case L1:
            __ret = 123
            __label = LRet
            break;
        case L2;
            __ret = 666
            __label = LRet
            break;
        default:
            assert(false);
            break;
    }
}

Is it ugly? Yeah. Does it matter? Hopefuly, if web assembly takes off, not for long.

And if not:

Well, if I ever got around to targeting web assembly, I guess I'd generate code using the approach I mentioned in the polyfill, and do my best to ignore all of the high level constructs, hoping that the compilers would be smart enough to catch on to this pattern.

But it would be nice if we didn't need to have both sides of the code generation work around the format specified.

rossberg commented 6 years ago

@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable.

dcecile commented 5 years ago

Reading through this thread, I get the impression that arbitrary labels and gotos has a major hurdle before becoming a feature:

Unstructured control flow makes irreducible control flow graphs possible
Eliminating* any "fast, simple verification, easy, one pass conversion to SSA form"
Opening up the JIT compiler to nonlinear performance
People browsing webpages shouldn't have to suffer delays if the original language compiler can do the up-front work

*although there might be alternatives such as Braun et al's on-the-fly SSA construction algorithm which handles irreducible control flow

If we're still stuck there, and tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

According to Scheme designer Guy Steele's 1977 paper, Lambda: The Ultimate GOTO, the transformation should be possible, and the performance of tail calls should be able to closely match gotos.

Thoughts?

eira-fransham commented 5 years ago

If we're still stuck there, and tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

This is essentially what every compiler would do anyway, no-one that I know of is advocating for unmanaged gotos of the kind that cause so many problems in the JVM, just for a graph of typed EBBs. LLVM, GCC, Cranelift and the rest all have a (possibly-irreducible) SSA-form CFG as their internal representation and the compilers from Wasm to native have the same internal representation, so we want to preserve as much of that information as possible and reconstruct as little of that information as possible. Locals are lossy, since they're no longer SSA, and Wasm's control flow is lossy, since it's no longer an arbitrary CFG. AFAIK having Wasm be an infinite-register SSA register machine with embedded fine-grained register liveness information would probably be the best for codegen but code size would bloat, a stack machine with control flow modelled on an arbitrary CFG is probably the best middle-ground. I might be wrong about code size with a register machine though, it might be possible to encode it efficiently.

The thing is about irreducible control flow is that if it's irreducible on the front-end it's still irreducible in wasm, the relooper/stackifier conversion doesn't make the control flow reducible, it just converts the irreducibility to be dependent on runtime values. This gives the backend less information and so it can produce worse code, the only way to produce good code for irreducible CFGs right now is to detect the patterns emitted by relooper and stackifier and convert them back to an irreducible CFG. Unless you're developing V8, which AFAIK only supports reducible control flow, supporting irreducible control flow is purely a win - it makes both frontends and backends way simpler (frontends can just emit code in the same format they internally store it as, backends don't have to detect patterns) while producing better output in the case that the control flow is irreducible and output that is just as good or better in the usual case that control flow is reducible.

Plus it would allow GCC and Go to start producing WebAssembly.

I know V8 is an important component of the WebAssembly ecosystem but it seems to be the only part of that ecosystem that benefits from the current control flow situation, all the other backends that I'm aware of convert to a CFG anyway and are unaffected by whether WebAssembly can represent irreducible control flow or not.

neelance commented 5 years ago

kazimuth commented 5 years ago

Couldn't v8 just incorporate relooper in order to accept input CFGs? It seems like large chunks of the ecosystem are blocked on implementation details of v8.

graph commented 5 years ago

Just for reference, I noticed switch statements in c++ are very slow in wasm. When I profiled code I had to convert them to other forms which operated much more quickly to do image processing. And it was never a problem on other architectures. I really would like goto for performance reasons.

lars-t-hansen commented 5 years ago

@graph, could you provide more details about how "switch statements are slow"? Always looking for an opportunity to improve performance... (If you don't want to bog down this thread, email me directly, lhansen@mozilla.com.)

graph commented 5 years ago

I'll post here as this applies to all browsers. Simple statements like this when compiled with emscripten where faster when I converted to if statements.

for(y = ....) {
    for(x = ....) {
        switch(type){
        case IS_RGBA:....
         ....
        case IS_BGRA
        ....
        case IS_RGB
        ....
....

I assume the compiler was converting a jump table to whatever wasm supports. I didn't look into generated assembly so can't confirm.

I know a couple unrelated to wasm things that can be optimized for image processing on the web. I already submitted it via "feedback" button in firefox. If you are interested let me know and I'll email you the issues.

kripken commented 5 years ago

@graph A complete benchmark would be very helpful here. In general, a switch in C can turn into a very fast jump table in wasm, but there are corner cases that don't work well yet, that we may need to fix, either in LLVM or in browsers.

In emscripten specifically, how switches are handled changes a lot between the old fastcomp backend and the new upstream one, so if you saw this a while ago, or recently but using fastcomp, it would be good to check on upstream.

lars-t-hansen commented 5 years ago

@graph, If emscripten produces a br_table then the jit will sometimes generate a jump table and sometimes (if it thinks it will be faster) it will search the key space linearly or with an in-line binary search. What it does often depends on the size of the switch. It is of course possible that the selection policy is not optimal... I agree with @kripken, runnable code would be very helpful here if you have some to share.

(Don't know about v8 or jsc, but Firefox currently does not recognize an if-then-else chain as a possible switch, so it's usually not a good idea to open-code switches as long if-then-else chains. The break-even point is probably at no more than two or three comparisons.)

aardappel commented 5 years ago

@lars-t-hansen @kripken @graph it may well be that br_table is currently just very un-optimized as this exchange seems to show: https://twitter.com/battagline/status/1168310096515883008

lars-t-hansen commented 5 years ago

@aardappel, that's curious, benchmarks i ran yesterday did not show this, in firefox on my system the break-even point was at around 5 cases as i remember it and after that br_table was the winner. microbenchmark of course, and with some attempt at even distribution of the lookup keys. if the "if" nest is biased toward the most likely keys so that no more than a couple of tests are needed then the "if" nest will win.

If it can't do the range analysis on the switch value to avoid it then the br_table will also have to do at least one filtering test for the range of the switch, which also eats into its advantage.

aardappel commented 5 years ago

@lars-t-hansen Yes, we don't know his test case, may it had an outlier value. Either way, looks like Chrome has more work to do than Firefox.

graph commented 5 years ago

Im on vacation, hence my lack of replies. Thanks for understanding.

graph commented 4 years ago

@kripken @lars-t-hansen I have run some tests it seems yes it wasm is better now in firefox. There is still some cases where if-else out-performs switch. Here is a case:

Main.cpp

``` #include #include #include class Chronometer { public: Chronometer() { } void start() { mStart = std::chrono::steady_clock::now(); } double seconds() { std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now(); return std::chrono::duration_cast>(end - mStart).count(); } private: std::chrono::steady_clock::time_point mStart; }; int main() { printf("Starting tests!\n"); Chronometer timer; // we want to prevent optimizations based on known size as most applications // do not know the size in advance. std::random_device rd; //Will be used to obtain a seed for the random number engine std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd() std::uniform_int_distribution<> dis(100000000, 1000000000); std::uniform_int_distribution<> opKind(0, 3); int maxFrames = dis(gen); int switchSelect = 0; constexpr int SW1 = 1; constexpr int SW2 = 8; constexpr int SW3 = 32; constexpr int SW4 = 38; switch(opKind(gen)) { case 0: switchSelect = SW1; break; case 1: switchSelect = SW2; break; case 2: switchSelect = SW3; break; case 4: switchSelect = SW4; break; } printf("timing with SW = %d\n", switchSelect); timer.start(); int accumulator = 0; for(int i = 0; i < maxFrames; ++i) { switch(switchSelect) { case SW1: accumulator = accumulator*3 + i; break; case SW2: accumulator = (accumulator < 3)*i; break; case SW3: accumulator = (accumulator&0xFF)*i + accumulator; break; case SW4: accumulator = (accumulator*accumulator) - accumulator + i; break; } } printf("switch time = %lf seconds\n", timer.seconds()); printf("accumulated value: %d\n", accumulator); timer.start(); accumulator = 0; for(int i = 0; i < maxFrames; ++i) { if(switchSelect == SW1) accumulator = accumulator*3 + i; else if(switchSelect == SW2) accumulator = (accumulator < 3)*i; else if(switchSelect == SW3) accumulator = (accumulator&0xFF)*i + accumulator; else if(switchSelect == SW4) accumulator = (accumulator*accumulator) - accumulator + i; } printf("if-else time = %lf seconds\n", timer.seconds()); printf("accumulated value: %d\n", accumulator); return 0; } ```

Depending on the value of switchSelect. if-else outperforms. Example output:

Starting tests!
timing with SW = 32
switch time = 2.049000 seconds
accumulated value: 0
if-else time = 0.401000 seconds
accumulated value: 0

As you can see for switchSelect = 32 if-else is way faster. For the other cases if-else is a bit faster. For the case switchSelect = 1 & 0, switch statement is faster.

Test in Firefox 69.0.3 (64-bit)
compiled using: emcc -O3 -std=c++17 main.cpp -o main.html
emcc version: emcc (Emscripten gcc/clang-like replacement) 1.39.0 (commit e047fe4c1ecfae6ba471ca43f2f630b79516706b)

Using latest stable emscripen as of Oct 20 2019. Fresh install ./emcc activate latest.

graph commented 4 years ago

I noticed above there is a typo, but it shouldn't effect the matter that the if-else is faster SW3 case as they are executing same instructions.

again with this going beyond break even point of 5: Interesting that for switchSelect=32 for this case it is similar in speed as if-else. As you can see for 1003 if-else is slightly faster. Switch should win in this case.

Starting tests!
timing with SW = 1003
switch time = 2.253000 seconds
accumulated value: 1903939380
if-else time = 2.197000 seconds
accumulated value: 1903939380

main.cpp

``` #include #include #include class Chronometer { public: Chronometer() { } void start() { mStart = std::chrono::steady_clock::now(); } double seconds() { std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now(); return std::chrono::duration_cast>(end - mStart).count(); } private: std::chrono::steady_clock::time_point mStart; }; int main() { printf("Starting tests!\n"); Chronometer timer; // we want to prevent optimizations based on known size as most applications // do not know the size in advance. std::random_device rd; //Will be used to obtain a seed for the random number engine std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd() std::uniform_int_distribution<> dis(100000000, 1000000000); std::uniform_int_distribution<> opKind(0, 8); int maxFrames = dis(gen); int switchSelect = 0; constexpr int SW1 = 1; constexpr int SW2 = 8; constexpr int SW3 = 32; constexpr int SW4 = 38; constexpr int SW5 = 64; constexpr int SW6 = 67; constexpr int SW7 = 1003; constexpr int SW8 = 256; switch(opKind(gen)) { case 0: switchSelect = SW1; break; case 1: switchSelect = SW2; break; case 2: switchSelect = SW3; break; case 3: switchSelect = SW4; break; case 4: switchSelect = SW5; break; case 5: switchSelect = SW6; break; case 6: switchSelect = SW7; break; case 7: switchSelect = SW8; break; } printf("timing with SW = %d\n", switchSelect); timer.start(); int accumulator = 0; for(int i = 0; i < maxFrames; ++i) { switch(switchSelect) { case SW1: accumulator = accumulator*3 + i; break; case SW2: accumulator = (accumulator < 3)*i; break; case SW3: accumulator = (accumulator&0xFF)*i + accumulator; break; case SW4: accumulator = (accumulator*accumulator) - accumulator + i; break; case SW5: accumulator = (accumulator << 3) - accumulator + i; break; case SW6: accumulator = (i - accumulator) & 0xFF; break; case SW7: accumulator = i*i + accumulator; break; } } printf("switch time = %lf seconds\n", timer.seconds()); printf("accumulated value: %d\n", accumulator); timer.start(); accumulator = 0; for(int i = 0; i < maxFrames; ++i) { if(switchSelect == SW1) accumulator = accumulator*3 + i; else if(switchSelect == SW2) accumulator = (accumulator < 3)*i; else if(switchSelect == SW3) accumulator = (accumulator&0xFF)*i + accumulator; else if(switchSelect == SW4) accumulator = (accumulator*accumulator) - accumulator + i; else if(switchSelect == SW5) accumulator = (accumulator << 3) - accumulator + i; else if(switchSelect == SW6) accumulator = (i - accumulator) & 0xFF; else if(switchSelect == SW7) accumulator = i*i + accumulator; } printf("if-else time = %lf seconds\n", timer.seconds()); printf("accumulated value: %d\n", accumulator); return 0; } ```

Thank you guys for taking a look into these test cases.

aardappel commented 4 years ago

That is a very sparse switch though, which LLVM should convert to the equivalent of a set of if-then's anyway, but apparently it does so in a way that's less efficient than the manual if-thens. Have you tried running wasm2wat to see how these two loops differ in code?

This also depends strongly on this test using the same value on each iteration. This test would be better if it cycled thru all values, or better yet, randomly picked from them (if that can be done cheaply).

Better yet, the real reason people use switch for performance is with a dense range, so you can guarantee it is actually using br_table underneath. Seeing at how many cases br_table is faster than if would be the most useful thing to know.

graph commented 4 years ago

The switch in the tight loops were used because it was cleaner code over performance. But for wasm the performance impact was too large so it was converted to uglier if-statements. For image processing in alot of my use cases if I want more performance out of a switch, I would move the switch outside the loop, and simply have copies of the loop for each case. Usually the switch is just switching between some form of pixel format, color format, encoding and etc... And in many cases the constants are computed via defines or enums and not linear. I see now that my issue is not related to goto design. I just had an incomplete understanding about what was happening for my switch statements. I hope my notes is useful to browser devs reading this to optimize wasm for image processing in these cases. Thank you.

I never thought goto can be such a heated debate 😮 . I'm in the boat of every language should have a goto 😁 . Another reason to add goto is it reduces the complexity for the compiler to compile to wasm. I'm pretty sure that's mentioned above somewhere. Now I have nothing to complain about 😞 .

learnforpractice commented 4 years ago

Any further progress there?

graph commented 4 years ago

due to the heated debate I would assume some browser would add support for goto as a non-standard bytecode extension. Then maybe GCC can enter the game as supporting a non-standard version. Which I don't think is good overall but will allow more compiler competition. Has this been considered?

binji commented 4 years ago

There hasn't been much progress lately, but you may want to look at the funclets proposal.

vshymanskyy commented 4 years ago

@graph to me, your suggestion sounds like "let's break everything, and hope for the better". It doesn't work like that. There are MANY benefits from the current WebAssembly structure (that are not obvious, unfortunately). Try diving deeper into the philosophy of wasm.

Allowing "arbitrary Labels and Gotos" will bring us back to the (ancient) times of non-verifiable bytecode. All the compilers will just switch to a "lazy way" of doing things, instead of "doing it right".

It's clear that wasm in it's current state has some major omissions. People are working on filling the gaps (like the one mentioned by @binji ), but I don't think the "global wasm structure" needs to be reworked. Just my humble opinion.

sunfishcode commented 4 years ago

@vshymanskyy The funclets proposal, which provides functionality equivalent to arbitrary labels and gotos, is fully validatable, in linear time.

eira-fransham commented 4 years ago

I should also mention that in our linear-time Wasm compiler, we internally compile all Wasm control flow into a funclets-like representation, which I have some information about in this block post and the conversion from Wasm control flow to this internal representation is implemented here. The compiler gets all its type information from this funclets-like representation, so suffice to say it is trivial to validate the type-safety of it in linear time.

I think that this misconception that irreducible control flow cannot be validated in linear time stems from the JVM, where irreducible control flow must be executed using the interpreter instead of being compiled. This is because the JVM doesn't have any way to represent type metadata for irreducible control flow, and so it cannot do the conversion from stack machine to register machine. "Arbitrary gotos" (i.e. jump to byte/instruction X) is not verifiable at all, but separating a function into typed blocks, which can then be jumped between in an arbitrary order, is no harder to verify than separating a module into typed functions, which can then be jumped between in an arbitrary order. You do not need jump-to-byte-X-style untyped gotos to implement any useful patterns that would be emitted by compilers like GCC and LLVM.

cnlohr commented 4 years ago

I just love the process here. Side A explains why this is needed in specific applications. Side B says they're doing it wrong, but offers no support for that application. Side A explains how none of B's pragmatic arguments hold water. Side B doesn't want to deal with it because they think side A is doing it wrong. Side A is trying to accomplish a goal. Side B says that's the wrong goal, calling it lazy or brutish. The deeper philosophical meanings are lost on side A. The pragmatic is lost on side B, as they claim to have some sort of higher moral ground. Side A sees this as an amoral mechanistic operation. Ultimately, side B generally remains in control of the spec for better or for worse, and they've gotten incredible amount done with their relative purity.

Honestly, I just stuck my nose in here because years ago, I was trying to make a TinyCC port to WASM so I could run a development environment on an ESP8266 targeting the ESP8266. I only have ~4MB of storage, so including re-looper and switch to an AST, as well as many other changes is out of the question. (Side-note: How is relooper the only thing like relooper? It's sooo awful and no one's rewritten that sucker in In C!?) Even if it were possible at this point, I don't know if I would write a TinyCC target to WASM since it's just not as interesting to me any more.

This thread, though. Holy cow this thread has brought me so much existential joy. To watch a bifurcation in humanity run deeper than democrat or republican, or religion. I feel like if this can ever be resolved. If A can come to live in B's world, or B validate A's claim that procedural programming has its place... I feel like we could solve world peace.

neelance commented 4 years ago

Could someone in charge of V8 confirm in this thread that the opposition against irreducible control flow is not influenced by V8's current implementation?

I am asking because this is what bugs me most. To me it seems like this should be a discussion on the spec level about pros and cons of this feature. It should not at all be influenced by how a particular implementation is currently designed. However, there have been statements that make me believe that V8's implementation is influencing this. Maybe I am wrong. An open statement might help.

verbessern commented 4 years ago

Well, as much as its unfortunate, the current implementations existing so far are so important that the future (presumably longer then the past) is not so important. I was trying to explain that at #1202, that the consistency is more important then the few implementations, but it seems I'm delusional. Good luck of explaining that some development decisions somewhere in some project are not constituting a universal truth, or have to be, by default, assumed correct.

d4tocchini commented 3 years ago

This thread is one canary in the W3C coal mine. Though I have great respect for many W3C individuals, the decision to entrust JavaScript to Ecma International, and not the W3C, was not made without prejudice.

Like @cnlohr, I had hopes for a TCC wasm port, and for good reason;

"Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications." - webassembly.org

Sure, anyone can pontificate why goto is [INSERT JARGON], but how about we prefer standards over opinions. We can all agree POSIX C is a good baseline target, especially given today's langs are either wrought from or benchmarked against C and WASM's homepage headline touts itself a portable compilation target for langs. Sure, some features will be roadmapped like threads and simd. But, to wholly disregard something as fundamental as goto, to not even give it the decency of roadmapping, is not consistent with the WASM's stated purpose and such a stance from the standardization body that greenlit <marquee> is beyond the pale.

According to SEI CERT C Coding Standard Rec. MEM12-C titled, "Consider using a goto chain when leaving a function on error when using and releasing resources";

Many functions require the allocation of multiple resources. Failing and returning somewhere in the middle of this function without freeing all of the allocated resources could produce a memory leak. It is a common error to forget to free one (or all) of the resources in this manner, so a goto chain is the simplest and cleanest way to organize exits while preserving the order of freed resources.

The recommendation then offers an example with the preferred POSIX C solution using goto. The naysayers will point to the note that goto is still considered harmful. Interestingly, this opinion is not embodied in one of those particular coding standards, just a note. Which brings us to the canary, the "considered harmful".

Bottomline, a considering of "CSS Regions" or goto as harmful should only be weighed along with a proposed solution to the problem that such a feature is used for. If removing said "harmful" feature amounts to removing the reasonable use cases without alternative, that's not a solution, that's in fact harmful to the users of the language.

Functions are not zero cost, even in C. If someone offers a replacement to gotos & labels, canihaz please! If someone says I don't need it, how do they know that? When it comes to performance, goto can gives us that little extra, hard to argue to engineers that we don't need performant, easy-to-grok features that have existed since the dawn of the language.

Without a plan to support goto, WASM is a toy compilation target, and that's ok, maybe that's how the W3C sees the web. I hope WASM as a standard reaches higher, out of the 32bit address space, and enters the compilation race. I hope the engineering discourse can get away from "that's not possible..." to let's fast track GCC C extensions like Labels as Values because WASM should be AWESOME. Personally, TCC is considerably more impressive and more useful at this point, without all the wasted pontificating, without the hipster landing page and shiny logo.

rossberg commented 3 years ago

@d4tocchini:

According to SEI CERT C Coding Standard Rec. MEM12-C titled, "Consider using a goto chain when leaving a function on error when using and releasing resources";

Many functions require the allocation of multiple resources. Failing and returning somewhere in the middle of this function without freeing all of the allocated resources could produce a memory leak. It is a common error to forget to free one (or all) of the resources in this manner, so a goto chain is the simplest and cleanest way to organize exits while preserving the order of freed resources.

The recommendation then offers an example with the preferred POSIX C solution using goto. The naysayers will point to the note that goto is still considered harmful. Interestingly, this opinion is not embodied in one those particular coding standards, just a note. Which brings us to the canary, the "considered harmful".

The example given in that recommendation can be directly expressed with labelled breaks, which are available in Wasm. It does not need the extra power of arbitrary goto. (C does not provide labelled break and continue, so has to fall back to goto more often than necessary.)

d4tocchini commented 3 years ago

@rossberg, good point on labelled breaks in that example, but I disagree with your qualitative assumption that C must "fall back". goto is a richer construct than labelled breaks. If C is to be included among the portable compilation targets, and C does not support labelled breaks, that's rather a mute point. Java has labelled breaks/continues whereas Python rejected the proposed feature, and considering both the sun JVM and the default CPython are written in C, wouldn't you agree C as a supported language ought to be higher on the priority list?

If goto is to be so readily dropped from consideration, should the hundreds of uses of goto within emscripten's source be reconsidered as well?

Is there a language that can't be written in C? C as a language should be informing WASM's features. If POSIX C is not possible with today's WASM, then there's your proper roadmap.

pfalcon commented 3 years ago

Not really on the topic of the argument, but to not put shade that the random mistakes are lurking here and there in the argumentation in general:

Python have labelled breaks

Can you elaborate? (Aka: Python doesn't have labelled breaks.)

d4tocchini commented 3 years ago

@pfalcon, yes my bad, I edited my comment to clarify python proposed labelled breaks/continues and rejected it

J0eCool commented 3 years ago

If goto is to be so readily dropped from consideration, should the hundreds of uses of goto within emscripten's source be reconsidered as well?

1) Note how much of that is present in musl libc, not directly in emscripten. (Second most used is tests/third_party) 2) Source level constructs are not the same as bytecode instructions 3) Emscripten is not at the same level of abstraction as the wasm standard, so, no it shouldn't be reconsidered on that basis.

Specifically, it might be useful today to rewrite the gotos out of libc, because then we'd have more control over the resulting cfg than trusting relooper/cfgstackify to handle it well. We haven't because it's a nontrivial amount of work to wind up with wildly divergent code from upstream musl.

Emscripten developers (last I checked) tend to be of the opinion that a goto-like structure would be really nice, for those obvious reasons, so are unlikely to drop it from consideration, even if it takes years to reach an acceptable compromise.

such a stance from the standardization body that greenlit <marquee> is beyond the pale.

This is a particularly asinine statement.

1) We-the-broader-Internet are over a decade away from having made that decision 2) We-the-wasm-CG are an entirely (nearly?) separate group of people from that tag, and are probably individually annoyed by obvious past mistakes as well.

without all the wasted pontificating, without the hipster landing page and shiny logo.

This could have been reworded to "I am frustrated" without running into tone issues.

As this thread shows, these conversations are difficult enough as it is.

cnlohr commented 3 years ago

There's a new level of deeply concerning when you want to rewrite a deeply trusted and understood set of functions for all new just because an environment for their use has to go through extra steps to support it. (though I am still in the firmly please-add-goto camp because I hate being tied to only using one specific compiler)

penzn commented 3 years ago

I think this thread moved way past being productive - it has been running for over four years now and it looks like every possible argument for and against arbitrary gotos has been used here; it should also be noted that none of those arguments are particularly new ;)

There are managed runtimes that chose not to have arbitrary jump labels, which worked out fine for them. Also, there are programming systems where arbitrary jumps are permitted and they are doing well too. In the end, authors of a programming system make design choices and only time really shows whether those choices are successful or not.

Wasm design choices that forbid arbitrary jumps are core to its philosophy. It is unlikely it can support gotos without something like funclets, for the same reasons it does not support pure indirect jumps.

neelance commented 3 years ago

Wasm design choices that forbid arbitrary jumps are core to its philosophy. It is unlikely it can support gotos without something like funclets, for the same reasons it does not support pure indirect jumps.

@penzn Why is the funclets proposal stuck? It exists since October 2018 and it is still in phase 0.

d4tocchini commented 3 years ago

If we were discussing a run-of-the-mill open-source project, I'd fork & be done with it. We’re talking about a far-reaching monopoly standard here. Vigorous community response should be cultivated because we care.

@J0eCool

Note how much of that is present in musl libc, not directly in emscripten. (Second most used is tests/third_party)

Yes, the nod was to how much its used in C in general.

Source level constructs are not the same as bytecode instructions

Of course, what we’re discussing is an internal concern that impacts source level constructs. That's part of the frustration, the black box should not leak its concerns.

Emscripten is not at the same level of abstraction as the wasm standard, so, no it shouldn't be reconsidered on that basis.

The point was that you will find gotos in majority of sizeable C projects, even within WebAssembly toolchain at large. A portable compiler target for languages in general that is not expressive enough to target its own compilers is not exactly consistent with the nature of our enterprise.

Specifically, it might be useful today to rewrite the gotos out of libc, because then we'd have more control over the resulting cfg than trusting relooper/cfgstackify to handle it well.

This is circular. Many above have raised serious unanswered questions regarding the infallibility of such a requirement.

We haven't because it's a nontrivial amount of work to wind up with wildly divergent code from upstream musl.

It is possible to remove gotos, as you said, it’s nontrivial amount of work! Are you suggesting everyone else should wildy diverge code paths because gotos should not be supported?

Emscripten developers (last I checked) tend to be of the opinion that a goto-like structure would be really nice, for those obvious reasons, so are unlikely to drop it from consideration, even if it takes years to reach an acceptable compromise.

A glimmer of hope! I’d be satisfied if goto/label support were taken seriously with a roadmap item + official invitation to get the ball moving, even if years out.

This is a particularly asinine statement.

You’re right. Forgive the hyperbole, I am a bit frustrated. I love wasm, and use it often, but I ultimately see a road of pain in front of me if I want to do anything noteworthy with it, like port TCC. After reading all the comments and articles, I still can't figure if the opposition is technical, philosophical or political. As @neelance expressed,

“Could someone in charge of V8 confirm in this thread that the opposition against irreducible control flow is not influenced by V8's current implementation?

I am asking because this is what bugs me most. [...]

If you guys listen to any of use, take @neelance’s feedback regarding Go 1.11 to heart. That’s hard to argue with. Sure, we can all do the nontrivial dusting of goto, but even then, we take a serious perf hit that can only be fixed with a goto instruction.

Again, forgive my frustration, but if this issue is closed without proper address, then I’m afraid it will send the wrong kind of signal that will only exasperate these kind of community responses and is inappropriate for one of the greatest standards efforts of our field. It goes without saying that I am a huge fan and supporter of all on this team. Thank you!

neelance commented 3 years ago

Here is another real world issue that is caused by missing goto/funclets: https://github.com/golang/go/issues/42979

For this program, the Go compiler currently generates a wasm binary with 18,000 nested blocks. The wasm binary itself has a size of 2.7MB, but when I run it through wasm2wat I get a 4.7GB .wat file. 🤯

I could try to give the Go compiler some heuristic so instead of a single huge jump table it could create some kind of binary tree and then look at the jump target variable multiple times. But is this really how it is supposed to be with wasm?

darkuranium commented 3 years ago

I would like to add that I find it odd how people seem to think that it's perfectly fine if only a single compiler (Emscripten[1]) can realistically support WebAssembly. Reminds me somewhat of the libopus situation (a standard that normatively depends on copyrighted code).

I also find it odd how WebAssembly devs seem to be so vehemently against this, despite just about everyone from the compiler end of things telling them it's required. Remember: WebAssembly is a standard, not a manifesto. And fact is, most modern compilers use some form of SSA + basic blocks internally (or something nearly equivalent, with the same properties), which have no concept of explicit loops[2]. Even JITs use something similar, that's how common it is. The absolute requirement for relooping to happen with no escape hatch of "just use goto" is, to my knowledge[3], unprecedented outside of language-to-language translators --- and even then, only language-to-language translators that target goto-less languages. In particular, I have never heard of this having to be done for any sort of an IR or bytecode, other than WebAssembly.

Perhaps it's time to rename WebAssembly to WebEmscripten (WebScripten?).

As @d4tocchini said, if it wasn't for WebAssembly's (necessary, due to the standarization situation) monopolistic status, it would likely have been forked by now, into something that can reasonably support what the compiler developers already know it needs to support. And no, "just use emscripten" is not a valid counter-argument, because it makes the standard depend on a single compiler vendor. I hope I don't need to tell you why that's bad.

[1]: Talking about C compilers here; I'm not sure if Golang goes via LLVM/Emscripten, or if it does its own thing. [2]: Yes, there is loop detection done in compilers at some stage (to enable loop-based optimizations). But note that unstructured control flow is simply left alone; there is no need for complex reloopers or such. [3]: One could probably find some obscure machine from 1960s with such a requirement, but that's hardly a relevant target in 2020.

EDIT: I forgot to add one thing: You still haven't clarified on whether the issue is technical, philosophical, or political. I suspect the latter, but would gladly be proven wrong (because technical and philosophical issues can be fixed far more easily than political).

conrad-watt commented 3 years ago

Here is another real world issue that is caused by missing goto/funclets: golang/go#42979

For this program, the Go compiler currently generates a wasm binary with 18,000 nested blocks. The wasm binary itself has a size of 2.7MB, but when I run it through wasm2wat I get a 4.7GB .wat file. 🤯

I could try to give the Go compiler some heuristic so instead of a single huge jump table it could create some kind of binary tree and then look at the jump target variable multiple times. But is this really how it is supposed to be with wasm?

This example is really interesting. How does such a simple straight line program generate this code? What's the relationship between the number of array elements and the number of blocks? In particular, should I interpret this as meaning that each array element access requires multiple blocks to be compiled faithfully?

And no, "just use emscripten" is not a valid counter-argument

I think the real counter-argument in this vein would be that another compiler wanting to target Wasm can/must implement their own relooper-like algorithm. Personally, I do think Wasm should eventually have a multi-body loop (close to funclets) or something similar that's a natural target for goto.

neelance commented 3 years ago

@conrad-watt There are several factors that cause each assignment to use several basic blocks in the CFG. One of them is that there is a length check on the slice because the length is not known at compile time. Generally I would say that compilers consider basic blocks as a relatively cheap construct, but with wasm they are somewhat expensive, especially in this particular case.

conrad-watt commented 3 years ago

@neelance in the modified example where the code is split between several functions, the (runtime/compilation) memory overhead is shown to be much lower. Are fewer blocks generated in this case, or is it just that the separate functions mean that engine GC can be more granular?

neelance commented 3 years ago

@conrad-watt It is not even the Go code that is using the memory, but the WebAssembly host: When I instantiate the wasm binary with Chrome 86, my CPU goes to 100% for 2 minutes and the memory usage of the tab peaks at 11.3 GB. This is before the wasm binary / Go code gets executed. It is the shape of the wasm binary that is causing the issue.

conrad-watt commented 3 years ago

That was already my understanding. I'd expect a large number of blocks/type annotations to cause memory overhead specifically during compilation/instantiation.

To try and disambiguate my previous question - if the split version of the code compiles to Wasm with fewer blocks (because of some relooper quirk), that would be one explanation for the reduced memory overhead, and would be a good motivation for adding more general control flow to Wasm.

Alternatively, it may be that the split code results in (roughly) the same total number of blocks, but because each function is separately JIT compiled, the metadata/IR used to compile each function can be more eagerly GC'd by the Wasm engine. A similar issue occured in V8 years ago when parsing/compiling large asm.js functions. In this case, introducing more general control flow to Wasm would not solve the issue.

neelance commented 3 years ago

First I'd like to clarify: The Go compiler is not using the relooper algorithm, because it is inherently incompatible with the concept of switching goroutines. All basic blocks are expressed via a jump table with a bit of fall-through where possible.

I guess there is some exponential complexity growth in Chrome's wasm runtime with regards to the depth of nested blocks. The split version has the same number of blocks but a smaller maximum depth.

In this case, introducing more general control flow to Wasm would not solve the issue.

I agree that this complexity issue can probably be solved at Chrome's end. But I always like to ask the question "Why did this issue exist in the first place?". I would argue that with more general control flow, this issue would have never existed. Also, there is still the significant general performance overhead due to all basic blocks being expressed as jump tables, which I think is unlikely to go away by optimization.

conrad-watt commented 3 years ago

I guess there is some exponential complexity growth in Chrome's wasm runtime with regards to the depth of nested blocks. The split version has the same number of blocks but a smaller maximum depth.

Does this mean that in a straight line function with N array accesses, the final array access will be nested (some constant factor of) N blocks deep? If so, is there a way to reduce this by factoring error-handling code differently? I'd expect any compiler to chug if it has to analyze 3000 nested loops (very rough analogy) so if this is unavoidable for semantic reasons, that would also be an argument for more general control flow.

If the nesting difference is less stark than that, my hunch would be that V8 does almost no GC'ing of metadata during compilation of a single Wasm function, so even if we had something like a tweaked funclets proposal in the language right from the start, the same overheads would still be visible without them doing some interesting GC optimisation.

Also, there is still the significant general performance overhead due to all basic blocks being expressed as jump tables, which I think is unlikely to go away by optimization.

Agree that it's clearly preferable (from a purely technical standpoint) to have a more natural target here.

neelance commented 3 years ago

Does this mean that in a straight line function with N array accesses, the final array access will be nested (some constant factor of) N blocks deep? If so, is there a way to reduce this by factoring error-handling code differently? I'd expect any compiler to chug if it has to analyze 3000 nested loops (very rough analogy) so if this is unavoidable for semantic reasons, that would also be an argument for more general control flow.

The other way around: The first assignment is nested that deeply, not the last. Nested blocks and a single br_table at the top is how a traditional switch statement is expressed in wasm. This is the jump table I mentioned. There are no 3000 nested loops.

If the nesting difference is less stark than that, my hunch would be that V8 does almost no GC'ing of metadata during compilation of a single Wasm function, so even if we had something like a tweaked funclets proposal in the language right from the start, the same overheads would still be visible without them doing some interesting GC optimisation.

Yes, there may also be some implementation that has exponential complexity with regards to the number of basic blocks. But handling basic blocks (even in a large quantity) is what a lot of compilers do all day. For example the Go compiler itself handles this number of basic blocks easily during its compilation, even though they get processed by several optimization passes.

conrad-watt commented 3 years ago

Yes, there may also be some implementation that has exponential complexity with regards to the number of basic blocks. But handling basic blocks (even in a large quantity) is what a lot of compilers do all day. For example the Go compiler itself handles this number of basic blocks easily during its compilation, even though they get processed by several optimization passes.

Sure, but a performance issue here would be orthogonal to how control flow between those basic blocks is expressed in the original source language (i.e. not a motivation for more general control flow in Wasm). To see if V8 is particularly bad here, one could check whether FireFox/SpiderMonkey or Lucet/Cranelift exhibit the same compilation overheads.

neelance commented 3 years ago

I have done some more testing: Firefox and Safari show no issues at all. Interestingly, Chrome is even able to run the code before the intensive process has finished, so it seems like some task not strictly necessary for running the wasm binary is having the complexity issue.

Sure, but a performance issue here would be orthogonal to how control flow between those basic blocks is expressed in the original source language.

I see your point.

I still believe that representing basic blocks not via jump instructions but via a jump variable and a huge jump table / nested blocks is expressing the simple concept of basic blocks in a quite complex way. This leads to performance overhead and a risk of complexity issues such as the one we saw here. I believe that simpler systems are better and more robust than complex systems. I still haven't seen arguments that convince me that the simpler system is a bad choice. I have only heard that V8 would have a hard time implementing arbitrary control flow and my open question to tell me that this statement is wrong (https://github.com/WebAssembly/design/issues/796#issuecomment-623431527) has not been answered yet.

kripken commented 3 years ago

@neelance

Chrome is even able to run the code before the intensive process has finished

It sounds like the baseline compiler Liftoff is ok, and the problem is in the optimizing compiler TurboFan. Please file an issue, or please provide a testcase and I can file one if you prefer.

More generally: Do you think the wasm stack switching plans will be able to solve Go's goroutine implementation issues? That's the best link I can find, but it is quite active now, with a bi-weekly meeting, and several strong use cases that motivate the work. If Go can use wasm coroutines to avoid the big switch pattern then I think arbitrary gotos would not be necessary.

The Go compiler is not using the relooper algorithm, because it is inherently incompatible with the concept of switching goroutines.

It's true that it can't be applied by itself. However, we have good results with using wasm structured control flow + Asyncify. The idea there is to emit normal wasm control flow as much as possible - ifs, loops, etc., without a single big switch - and to add instrumentation on top of that pattern to handle unwinding and rewinding the stack. This leads to fairly small code size, and non-stack switching code can run at basically full speed, while an actual stack switch can be somewhat slower (so this is good for the case where stack switches are not constantly happening on each loop iteration etc.).

I'd be very happy to experiment with that on Go, if you're interested! This would obviously not be as good as built-in stack switching support in wasm, but it could be better than the big switch pattern already. And it would be easier to switch to the built-in stack switching support later. Concretely, how this experiment could work is to make Go emit normally structured code, without worrying about stack switching at all, and just emit a call to a special maybe_switch_goroutine function at appropriate points. The Asyncify transform would take care of all the rest basically.

WebAssembly / design