New example for Walt Explorer

jtiscione commented 6 years ago

I added an example in Walt Explorer that shows how array access works. You click and drag with the mouse, and it draws waves across the canvas. It animates with ordinary JS while you're holding down a key, so you see the (small) speed difference. I also fixed issue #127 with the canvas tab possibly not being mounted.

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 237

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 99.621%

Totals
Change from base Build 234:	0.0%
Covered Lines:	1925
Relevant Lines:	1925

💛 - Coveralls

ballercat commented 6 years ago

Just ran it locally, really cool. Thank you so much, this is a cool demo to have!

jtiscione commented 6 years ago

It's a good demo of how memory access works with ArrayBuffers and WebAssembly.Memory. (Where the memory is exported- I couldn't figure out how to get importing memory to work. Maybe the docs for WebAssembly.Memory were last updated right before some massive security hole got patched.)

The performance needs a look though- it runs pure JS if you hold down a key when it's running, which you think would slow it down, but doesn't, as you can see in the console output. (The AssemblyScript guys were having problems with performance too.)

Emscripten compiles the version I wrote in C down to WASM which somehow runs 50% faster than JS. This is just a bunch of loops over an array. I made sure they were the same across JS, Walt, and C. I'm squinting at them and trying to figure out how they could go faster. WASM isn't laced with sorcery like x86, so a lot of standard C compiler tricks aren't available to it. It's not using an autovectorizer for example, which would easily seize on this code and be the obvious explanation for a speedup. What is Emscripten doing differently? Is it possible to get that speed from modifying the Walt code?

Can any arbitrary WASM or WAT file be produced by feeding the appropriate source to the Walt compiler? Or is Walt capable of producing only a subset? If the reverse mapping from WAT files to Walt files is one-to-many (and not ever one-to-zero) then it should be possible to hand-optimize it.

ballercat commented 6 years ago

Thanks for bringing to my attention. Looking at the loops alone Emscripten has a better way of encoding the iterations over i.

        get_local $l10
        i32.const -1
        i32.add
        tee_local $l10
        br_if $L3

This bit is just about in every loop. From the looks of it all the for-loops are more of "while-loops". First, it counts backwards, so the condition is always "is this zero" instead of less-than/greater-than, it also uses one less instruction on every loop thanks to the tee_local. tee sets and returns the local variable where as in the current compiler there is always the extra get_local.

You could mimic the count down to zero trick with a while loop while(value) { ... ; value -= 1; }, but the tee_local is not currently exposed in the syntax, it would need to be an assignment expression (not a statement) so something like this:

let value : i32 = width;
while((value -= 1)) {
   // logic
}

Not currently implemented, but trivial-ish.

There are other differences from looking at the wasm output, like the heavy use of selects instead of if/thens. Ternary statements in Walt compile to selects though so you could mimic that as well.

In general I would advice not calling into JS during step() at all (like getCanvasWidth() etc) that would give you the most bang for your buck I would bet.

Thanks for bringing this up! There are sneaky improvements to be made here, definitely the tee_local would be useful, basically assignments in expression. Too bad it's only available for locals though.

jtiscione commented 6 years ago

This is sort of intriguing... I'll have to try some of these.

I wonder how Emscripten determines whether or not the loop ordering is relevant. First of all it would have to recognize that the arrays (or the portions being iterated over) don't overlap. If the "vel" array partially overlapped the "u" array in the heap, then it couldn't make that assumption. But the arrays aren't even carrying their lengths around. It sounds like a lot of work to avoid emitting one instruction.

C-style languages don't demand you to tell it this stuff. High Performance Fortran wants to see compiler directives above loops and functions, telling it exactly how to distribute the work, or it won't optimize anything. It also has a FORALL statement that tells the compiler whether or not a function or a loop body is "pure". That's the only reason people are still using Fortran anywhere.

Maybe a goofy syntax like for(i in [0...32]) {...} could be introduced that specifically does not guarantee a loop ordering. But if it were me I'd probably just to tell people to write loops that go backwards.

MaxGraey commented 6 years ago

@jtiscione Hmm, in my tests AssemblyScript version has similar performance with C. Approximately 2-3 ms per iteration.

AssemblyScript port: https://webassembly.studio/?f=6nwdazfd2le

EDIT wasm sizes: C (3 kB), AssemblyScript (~800 bytes)

jtiscione commented 6 years ago

The AssemblyScript wrote an article about their struggle to get it to go fast. That article is a year old now but it looks like they were also reverse engineering Emscripten's output.

MaxGraey commented 6 years ago

Actually this not article from core team of AssemblyScript and wrote by Apurva Raman. As I remember he just made one PR for AS. Anyway this article is pretty old =)

jtiscione commented 6 years ago

If the AssemblyScript has similar performace to C and is smaller, it's probably better to reverse engineer that one instead.

In terms of WAT, AssemblyScript generated 352 lines, and Emscripten generated 445 lines.

JobLeonard commented 5 years ago

By the way, if you want touchscreen support you can add these lines:

    // fake mouse events when triggering touch events
    canvas.addEventListener("touchstart", function (e) {
      e.preventDefault();
      var touch = e.touches[0];
      var mouseEvent = new MouseEvent("mousedown", {
        clientX: touch.clientX,
        clientY: touch.clientY
      });
      canvas.dispatchEvent(mouseEvent);
    }, false);

    canvas.addEventListener("touchend", function (e) {
      e.preventDefault();
      var mouseEvent = new MouseEvent("mouseup", {});
      canvas.dispatchEvent(mouseEvent);
    }, false);

    canvas.addEventListener("touchmove", function (e) {
      e.preventDefault();
      var touch = e.touches[0];
      var mouseEvent = new MouseEvent("mousemove", {
        clientX: touch.clientX,
        clientY: touch.clientY
      });
      canvas.dispatchEvent(mouseEvent);
    }, false);

jtiscione commented 5 years ago

In case anyone is interested... @ballercat @JobLeonard @MaxGraey I have a version of this code at https://github.com/jtiscione/webassembly-wave that's comparing plain JS, Emscripten, Walt, and AssemblyScript (and has touchscreen support now) showing the frames per second. On this MacBook using Chrome, I'm seeing 115 FPS for JavaScript, 145 FPS for Emscripten-generated WebAssembly, and 100 FPS for Walt-generated WebAssembly. My AssemblyScript version is doing 50 FPS, which can't be right. I'm using code based on the code posted by @MaxGraey at https://webassembly.studio/?f=3v88jbgtnd2, and it's running at half the speed of Walt. The Pointer class used in that version appears to be hampering it: https://github.com/jtiscione/webassembly-wave/blob/master/as/assembly/index.ts. I'm still working on that one since I'm not as familiar with AssemblyScript.

MaxGraey commented 5 years ago

@jtiscione I forked you repo and just change build script to:

"asbuild:optimized": "asc assembly/index.ts -b build/optimized.wasm -t build/optimized.wat --sourceMap --validate --noAssert -O3"

and this change fps from 55 to 140 and in Firefox AS even faster than emscripten):

bench

MaxGraey commented 5 years ago

btw in next release of AS we don't need use builtin select anymore. Compiler would properly optimize ternary operators to select by itself

jtiscione commented 5 years ago

Ah, that must be what WebAssembly Studio is doing. I used the scaffold generated by asinit and wasn't seeing a difference between untouched and optimized WebAssembly IIRC. After merging your PR I'm getting 135 FPS from AssemblyScript now. So on Chrome, it's still not as fast as Emscripten (135 vs 145 FPS) but yeah in Firefox it's actually beating Emscripten (165 to 155) which is pretty amazing. I'd use AS for anything "serious" but I like Walt because it's such a little project and looks easy to fork. Still, its optimizer needs a little attention. How does the optimizer in AS work- does it transform precompiled WASM or is it in the guts of the compiler?

MaxGraey commented 5 years ago

What does "a little project and looks easy to fork" mean?) If you compare installed node_modules AS approx 7.7 mb and has lesser dependencies. But currently it installing from Github which always slower than installing from npm if not using yarn of course

jtiscione commented 5 years ago

It's "little" in terms of a feature set. If I wanted to make a special domain-specific language, I'd expect the AssemblyScript project to have a lot of infrastructure related to the full TypeScript specification that wouldn't be relevant for my brainfuck compiler, or whatever it is. (I never look in node_modules- I just watch for strange warnings about deprecated ciphers.)

ballercat / walt

New example for Walt Explorer #174

Pull Request Test Coverage Report for Build 237

💛 - Coveralls