gergoerdi / functional-mos6502-web-performance

https://unsafeperform.io/blog/2022-07-02-a_small_benchmark_for_functional_languages_targeting_web_browsers/
https://unsafeperform.io/blog/2022-07-02-a_small_benchmark_for_functional_languages_targeting_web_browsers/
26 stars 5 forks source link

Roc Implementation #2

Open bhansconnect opened 2 years ago

bhansconnect commented 2 years ago

Not an actual issue, but I thought I would share this implementation with you. Roc is a new functional programming language that targets low level performance. The language is so new that I had to fix a web assembly bug in the compiler to get this working.

Anyway, Here is my version of the repo and an implementation in Roc if you are interested in looking at it.

The performance results for Roc are weird. I have not figure out why yet, but when running the chrome profiler, the roc implementation is extremely fast. When I turn off the chrome profiler, it is not nearly as fast. Feels like the browser is not running wasm at full speed by default. Not sure why.

Anyway here are some timing results with the chrome profiler running:

JavaScript: min: 1ms max: 1.7000000029802322ms avg: 1.276000000052154ms
Roc: min: 0.09999999776482582ms max: 0.30000000074505806ms avg: 0.18499999988824128ms

And here are the results without the profiler running:

JavaScript: min: 0.7999999970197678ms max: 1.300000000745058ms avg: 0.935ms
Roc: min: 4.699999999254942ms max: 5.100000001490116ms avg: 4.94600000012666ms

Same wasm file both times that has a size of 154K. Definitely not small, but I think it has a lot of potential pruning that could be done given Roc is very new and it was optimizing for speed instead of size.

So new fastest version...at least when the browser runs the code at full speed.

bhansconnect commented 2 years ago

So I think I figure out why profiling mode makes it faster: https://stackoverflow.com/questions/56283368/difference-between-regular-and-performance-profiler-mode-in-google-chrome/56299691#56299691

It looks like the dev tool hooks in general slow down code speed. If you set measureAll to run after setup while dev tools is closed. Everything is high speed there as well. So I guess that is a general browser profiling tip. Either run the code with dev tools closed or with profiling on.

gergoerdi commented 2 years ago

The dev tool hooks slowing down execution is a great catch -- shows you how much I know about web dev! I'll definitely look into fixing that. I think the easiest is going to be to change the main script to put the timing results into DOM instead of logging to the console, so it can run simply when the page is opened.

As for your Roc implementation, since I don't know Roc, can you tell me if it supports a similar level of abstraction than the existing implementations? What I mean by that, is that your linked code does quite a few things with more "elbow grease" from the programmer:

BTW, about comment here: https://github.com/bhansconnect/functional-mos6502-web-performance/blob/5845f55bf6bfb109e8e94dc3abdb4a2388d46b9d/implementations/roc/emulator.roc#L120-L123 this, in fact, is exactly how it works on the 6502: the stack pointer wraps around as an 8-bit value, so you basically have a ring from 0x0100 to 0x01ff. It is up to the software to use it in a sensible way.

bhansconnect commented 2 years ago

Yeah, I didn't know about the devtool overhead either. Only recognized it cause my performance results didn't seem to make sense. It seems that it affects wasm based backends the most. Made asterius 20% faster for me.

As for the Roc implementation, the reason it is not using any nice abstractions is because I am really a compiler engineer who happens to work on a functional programming language. In other words, I am not a very good functional programmer. As such, I more or less directly ported the JavaScript version to Roc. It was the easiest implementation for me to read and understand.

To more specifically answer your questions, I chatted with people who know functional programming a lot better than me. Anyway, here are some answers:

You don't use mutable references or lenses to access the components of CPU, instead, manually mirroring its fields in a sum type

  • Roc is 100% immutable so there are no mutable references. I currently have to mirror the field with a sum type because lenses are not fully implemented. Currently they only work for getting fields and not updating them. In the future, you could just use lenses for this. You aren't parameterized by the implementation of memory
  • I didn't parameterize on memory because I didn't see that as important. I just thought it was kinda strange to parameterize on memory when you know memory is just an array. So I changed it to what I saw as more direct and simpler. That being said, it is totally doable. Probably the normal way for that would be to use an Effect and have the platform completely control the memory implementation. That being said, we also have "Abilities" which can be seen as something similar to rust traits or haskell type classes. You could also just directly accept a lambda that takes in an opaque memory type and then call the lambda each time to read/write memory. You're writing out state propagation by hand instead of using something like a monad
  • So this again was totally due to me copying the JS example. In a more proper Roc program, you would probably use Effect, which can essentially be seen as IO in Haskell. Roc is definitely a simpler language than Haskell with less mathematical concepts, but it still has a lot of overlap in some areas. It is highly influenced by the Elm programming language, which has been around much longer, if you have ever heard of it.

The start of your haskell driver function could theoretically look like this if ported to Roc:

run : JsArrayBuffer -> Effect I64
run buf =
    arr <- Js.toUint8Array buf |> Effect.after
    bs = byteStringFromJSUint8Array arr
    mem <- V.new 0x10000 |> Effect.after
    # …etc

I might have to try and rewrite this Roc version in a more properly functional way. Would be a good way for me to get better at pure functional programming and some of the normal patterns. That being said, a lot of the extra wrapping with ReaderT, MonadMachine, and MonadIO is quite confusing to me currently.

EDIT: This version is probably at least slightly closer to what you were looking for. It uses Effect to interact with the memory and the memory functions are completely defined by a few javascript functions. So a JS author can pick any implementation of their choosing. It runs a tiny bit slower than my version shared above, but only by about 0.02ms.

For example, why wrap every register in IORef regA <- newIORef 0x00 instead of just creating a new version of the struct with modified fields every time you update a register?

gergoerdi commented 2 years ago

I would expect IORefs or equivalent to actually be easier to compile to efficient JS since they can just be vars. Also, if I used a StateT CPU with an immutable CPU record, then writing the baseline JavaScript version of that by hand would have had been hell...

I think to make it a fair comparison, making memory access parametrisable is a must. First of all, that feature is what would make this 6502 core usable in pretty much any real-world project, since without being able to implement custom memory-mapped IO, you can just see the CPU ticking without getting anything useful out of it. This also means the individual memory access implementations should be written in the high-level functional programming language itself, since they could contain more complex logic than just writing to an array (e.g. sound synthesis for a SID track player).

And second, specialisation, i.e. resolving this kind of parametrisation at compile time, for cases when the actual memory implementation used in the exported functions is statically known, is a very important compiler technique for performance optimisation, so I want to make sure this feature (or its lack of!) is exposed by the benchmarks.

bhansconnect commented 2 years ago

I totally understand the memory parameterization and specialization now. They definitely are quite important when you think about expanding this basic test into something more usable.

As for the mutability of the fields, I guess I forgot that the other languages are compiling to JS and don't get llvm levels of optimizations. Because roc runs more passes and uses the llvm stack, most of the struct updates should be converted to being inplace updates due to having one unique reference to the CPU struct at any given time. Thus creating a copy with one changed value should be unneeded.

Another interesting browser API piece I learned from testing this more is that performance.now() is not actually accurate enough to measure the speed of the Roc version of the code. It seems that the resolution on my browser is about 0.1 ms. Which is basically the same speed as the Roc version. As such, the roc version basically always records 1 of two times for me. Either 0 ms or slightly over 0.1 ms. So to measure something this fast in the browser requires running it multiple times within a single time reading. So that may be something you want to change in your general testing.

I guess overall, the Roc version with Effects that I shared above in my edit should meet all of your criteria. It is parameterized over the memory setup, uses the equivalent of IO for handling interactions, and is pure immutable functional code. All that is being done well significantly out performing the JS version. Of course, it is compiled directly to wasm which should greatly help with performance.

gergoerdi commented 2 years ago

As for the mutability of the fields, I guess I forgot that the other languages are compiling to JS and don't get llvm levels of optimizations. Because roc runs more passes and uses the llvm stack, most of the struct updates should be converted to being inplace updates due to having one unique reference to the CPU struct at any given time. Thus creating a copy with one changed value should be unneeded.

OK, that's all fine -- the benchmark should be written in idiomatic style for the given language.

Another interesting browser API piece I learned from testing this more is that performance.now() is not actually accurate enough to measure the speed of the Roc version of the code. It seems that the resolution on my browser is about 0.1 ms. Which is basically the same speed as the Roc version. As such, the roc version basically always records 1 of two times for me. Either 0 ms or slightly over 0.1 ms. So to measure something this fast in the browser requires running it multiple times within a single time reading. So that may be something you want to change in your general testing.

Perhaps I should simply change the test to do more? I think running the same program but longer, with some predetermined sequence of inputs at the right points, could be a good idea because it would force the main runner to be a bit more complex. Or I could just try hooking up some comprehensive 6502 test program but that might actually find some bugs in all the implementations ;)

gergoerdi commented 2 years ago

I've emailed Richard to get access to the Roc repo so I can try this all out myself.