copy / v86

x86 PC emulator and x86-to-wasm JIT, running in the browser
https://copy.sh/v86/
BSD 2-Clause "Simplified" License
19.76k stars 1.39k forks source link

NodeJS performance on alpine/arch running on v86 #1005

Closed farhan-ct closed 1 month ago

farhan-ct commented 7 months ago

Hi,

I've been trying to run some nodejs programs and projects on v86. So far, it involved using npm/pnpm to install packages from online & offline (npm cache). Trying to run simple http and express server programs, etc.

Below are some of my experiences.

  1. Most of the stuff like installing packages or running node express programs takes ~2 minutes just to get started.
  2. Trying to install npm packages takes way too long even with a webproxy with removed throttle. But it's slow even when am trying to install packages from local npm cache.
  3. Even if the chunks of the os image are fetched from disk cache, the performance of nodejs programs is very slow. (ignore boot times, I am using a saved booted state to load)

I know all this sounds very vague and perhaps unrelated to v86 but I want to dig deeper, explore and perhaps fix/improve things if I can.

@copy @SuperMaxusa

I request you to kindly point me where I should start looking and studying to improve v86 performance, atleast for nodejs temporarily (though it's understandable that it's running on a web browser).

Thank you!

copy commented 7 months ago

Some ideas:

If it's none of the above, it's probably cpu bound. Note that a 10x slowdown compared to native is expected. Profiling could help (but we really need that minimal testcase):

SuperMaxusa commented 7 months ago

running node express programs takes ~2 minutes just to get started

You probably need load node_modules into cache early or into tmpfs, like a suggested before.

About long start node.js themselves, I have some idea (tested this only on 9pfs) to caching node binary and libs manually, like this makes on cold boot: cat /usr/lib/libnode.* /usr/bin/node > /dev/null or via vmtouch, (if the problem in I/O bounds).

By the way, I found that node's binary compiled without --shared flag (for example, v20.11.1 from unofficial-builds.nodejs.org) much faster than with this flag (for example, like older version in Arch Linux (x32) - nodejs, commit: 26a5d0):

log ```shell root@localhost:~# vmtouch -v /usr/bin/node /usr/bin/node [ ] 0/4 Files: 1 Directories: 0 Resident Pages: 0/4 0/16K 0% Elapsed: 0.0654 seconds root@localhost:~# vmtouch -v /usr/lib/libnode.so.111 /usr/lib/libnode.so.111 [ ] 0/12781 Files: 1 Directories: 0 Resident Pages: 0/12781 0/49M 0% Elapsed: 0.10016 seconds root@localhost:~# time node /root/hello.js Hello, world real 1m53.508s user 0m39.370s sys 0m3.309s root@localhost:~# time node /root/hello.js Hello, world real 0m30.253s user 0m25.674s sys 0m1.164s root@localhost:~# vmtouch -v /node /node [ ] 0/22540 Files: 1 Directories: 0 Resident Pages: 0/22540 0/88M 0% Elapsed: 0.12426 seconds root@localhost:~# time /node /root/hello.js Hello, world real 0m39.985s user 0m6.341s sys 0m2.483s root@localhost:~# time /node /root/hello.js Hello, world real 0m5.739s user 0m4.173s sys 0m0.796s root@localhost:~# /node -v v20.11.1 root@localhost:~# node -v v19.0.0 ```

And I wrote a simple program that calculates Pi by Leibniz formula as benchmark:

pitest.js ```js var pi = 0, k = 1; function picalc(n) { for (var i = 0; i < n; i++) { if (i % 2 == 0) { pi += (4 / k); } else { pi -= (4 / k); } k += 2; } } console.time("pi"); picalc(10000000); console.timeEnd("pi"); ```

Results:

root@localhost:~# node /pitest.js 
pi: 55.320s

root@localhost:~# node /pitest.js 
pi: 17.297s

root@localhost:~# /node /pitest.js 
pi: 15.025s

root@localhost:~# /node /pitest.js 
pi: 13.192s

And for interest, I tested on Void Linux (on an unpartitioned image), unlike Arch, it's builds with --shared flag (config from Void Linux), i. e. without separated libnode.so.

results ```shell # cold start [supermaxusa@void ~]$ time node hello.js Hello, world real 0m38.685s user 0m15.197s sys 0m10.290s [supermaxusa@void ~]$ time node hello.js Hello, world real 0m24.024s user 0m16.345s sys 0m5.421s [supermaxusa@void ~]$ node -v v18.19.1 [supermaxusa@void ~]$ node pitest.js pi: 21.739s [supermaxusa@void ~]$ node pitest.js pi: 20.642s ```

NPM also slow too, in my case, I installed express.js from global node_modules to project's node_modules (ext4 -> ext4), I guess it's I/O bottleneck.

package.json ```json { "name": "npm-test", "version": "1.0.0", "description": "", "main": "/dev/null", "scripts": {}, "author": "", "license": "ISC", "dependencies": { "express": "*" } } ```
[supermaxusa@void express-test]$ time npm install

up to date, audited 65 packages in 3m

12 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

real    4m19.227s
user    2m57.512s
sys     0m17.530s

...but with precached files, result is some better in different cases, for example, I tried to import express module, first - cold start, second - warm start

[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"

real    1m4.244s
user    0m30.458s
sys     0m7.530s

[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"

real    0m11.048s
user    0m7.352s
sys     0m1.954s
farhan-ct commented 7 months ago

Thank you guys for the pointers, I'll go though each and every point, do some testing and update here on the results. Thanks agan!

SuperMaxusa commented 7 months ago

@farhan-ct, for clarify, did you tried to test this on 9p root fs? Now I have thought only about I/O bounds (I think that CPU bounds don't matter yet).

  • If you're running in the linux console, try an X session and run it in xterm instead

You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?

copy commented 7 months ago

You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?

Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.

farhan-ct commented 7 months ago

@SuperMaxusa

for clarify, did you tried to test this on 9p root fs?

I ran all these tests on alpine standard linux, both on my packer version and the docker version you shared in the discussions (both in v86). However, my packer version doesn't have 9pfs at all. So, the performance issues exist for both with and without 9p versions.

Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site. However, the nodejs version installed using apk add is the same version for me. Also, trying to load the node bin into memory using vmtouch also didn't help in my case.

I just saw that you edited your above answer, let me go through your results once again :smiley:

but with precached files, result is some better in different cases

What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?

farhan-ct commented 7 months ago

the Linux framebuffer console emulator is pretty slow (might be relevant if https://github.com/npm/npm/issues/11283). Using ttyS0 could also work.

@copy I am using pnpm instead, not too much output but it gave me relatively better performance in most cases.

Both the cases have the same performance issues.

I am trying to prepare a well written document that includes testcases (minimal, simple imports, and a server) and benchmarks that covers the pointers that you and @SuperMaxusa provided.

SuperMaxusa commented 7 months ago

Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site.

Maybe gcompat and some dependencies (ldd node) are missing?

Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.

Interesting, thanks for the info! My previous tests were run on ttyS0, about progress, on xterm it's runs with progress=false a bit faster than without it (warmed):

xterm

...but results (for some reason) on framebuffer were the opposite.

tty0

According to this results, I tried reduce output and logs from npm and this is helps.

xterm2

What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?

This is file caching (or page cache) that reduce "repeated" access to I/O, see https://www.linuxatemyram.com/ and https://en.wikipedia.org/wiki/Page_cache.

SuperMaxusa commented 7 months ago

3. Even if the chunks of the os image are fetched from disk cache, the performance of nodejs programs is very slow. (ignore boot times, I am using a saved booted state to load)

Did you tried to use different chunk sizes for image? #710

farhan-ct commented 7 months ago

Did you tried to use different chunk sizes for image?

I split my image into chunks each of fixed size 1MB with a filename format from one of the windows 95/98 demos in copy.sh.

While testing, I've been trying both the split version, as well as providing async: true and size which is like auto-chunking the image (by v86 itself, on the fly).

Maybe gcompat and some dependencies (ldd node) are missing?

Yes I had to install gcompat for the missing loader and the node binary atleast executed after that, but when I try any node commands, even node --help, I was constantly getting the below error Fatal javascript OOM in MemoryChunk allocation failed during deserialization.

SuperMaxusa commented 7 months ago

Yes I had to install gcompat for the missing loader and the node binary atleast executed after that,

This node binary is for glibc, so it can be worked incorrectly with gcompat. Anyways, nodejs in Alpine compiles without --shared: https://git.alpinelinux.org/aports/tree/main/nodejs/APKBUILD#n200

SuperMaxusa commented 7 months ago

If this is useful, node with mimalloc2 (https://github.com/gliderlabs/docker-alpine/issues/161#issuecomment-902291370) or jemalloc (https://stackoverflow.com/questions/53234410/how-to-use-node-js-with-jemalloc) is faster (in some cases) that without (tested on Alpine v3.18.6 and Node 18.19.1).

localhost:~# time node hello.js
Hello, world

real    0m 10.31s
user    0m 7.15s
sys     0m 2.23s

localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node hello.js 
Hello, world

real    0m 7.81s
user    0m 6.34s
sys     0m 0.92s

localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node hello.js
Hello, world

real    0m 7.43s
user    0m 6.06s
sys     0m 0.87s

localhost:~# time node pitest.js 
pi: 13.077s
real    0m 21.50s
user    0m 16.69s
sys     0m 2.51s

localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node pitest.js 
pi: 9.576s
real    0m 16.20s
user    0m 14.06s
sys     0m 1.20s

localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node pitest.js
pi: 9.961s
real    0m 16.19s
user    0m 14.28s
sys     0m 1.02s
farhan-ct commented 7 months ago

Thank you @SuperMaxusa for the clear details and tests, will try it out :D

copy commented 6 months ago

I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.

farhan-ct commented 6 months ago

Thank you @copy

Will test now

farhan-ct commented 6 months ago

I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.

Hi @copy

Thanks for the update on the fix. I built on the wip branch and did some quick testing using the provided docker debian example and took some screenshots. I couldn't arrive at a fair conclusion after testing the changes.

@SuperMaxusa I request you to take a look at this and test/confirm if you are facing similar results.

image

All these 4 tests are performed under the same conditions

SuperMaxusa commented 6 months ago

fix for slow disk IO into the wip branch (dfacd61)

Here is my results (browser: Chrome 124 on localhost, guest OS: Alpine Linux 3.19.1, fs: ext4 with relatime), after every test command I had sync and echo 3 > /proc/.../drop_caches:

Branch dd [write] (MB/s) dd [write] (sec) dd [write dsync] (MB/s) dd [write dsync] (sec) cat 128MB [read] dd [read] (MB/s) dd [read] (sec) time node hello.js [cold start, mostly read]
master https://github.com/copy/v86/commit/9a9d6f33072464d735863dfff4efd308191aba68 6.6 20 8.6 15 real 7.65s, user 0.14s, sys 6.17s 22.7 6 real 38.4s, user 18.66s, sys 9.02s
wip https://github.com/copy/v86/commit/dfacd6109dd915104ad916f87fb2f2d3e6cee623 7.2 18 8.2 16 real 6.02s, user 0.14s, sys 5.29s 23.7 5.6 real 33.08s, user 20.91s, sys 9.20s
some graphs ![bar1](https://github.com/copy/v86/assets/41739128/b6084e3b-fed9-467c-9b36-566ada2a6c5a) full size: [svg](https://github.com/copy/v86/assets/41739128/b6084e3b-fed9-467c-9b36-566ada2a6c5a) ![bar2](https://github.com/copy/v86/assets/41739128/053a46d8-bca7-4fdb-b076-419902975e44) full size: [svg](https://github.com/copy/v86/assets/41739128/053a46d8-bca7-4fdb-b076-419902975e44)
test commands ```shell # apk add coreutils # hello.js is the same as in upper post # 1. dd [write] dd if=/dev/zero of=/testfile bs=1M count=128 status=progress # 2. dd [write dsync] dd if=/dev/zero of=/testfile bs=1M count=128 status=progress oflag=dsync # 3. cat [read] time cat /testfile > /dev/zero # 4. dd [read] dd if=/testfile of=/dev/zero bs=1M count=128 status=progress # 5. start hello.js (cold start) time node hello.js ```

I request you to take a look at this and test/confirm if you are facing similar results.

I changed version in Dockerfile to trixie and added apt install nodejs also (server: redbean 2.2.0 with gzip compression). First run takes about 1 minute.

Some ideas:

node debian

copy commented 6 months ago

To be clear, the fix mostly improves performance when files are actually downloaded from the server. So I wouldn't expect writes to be any faster, nor reads of any files that have been previously read (v86 caches downloaded files in most cases).

So the cold performance looks pretty solid now (in comparison to warm). I think much of the remaining cold start overhead can be explained by jit warmup.

The warm performance numbers you posted still look pretty bad, and much worse than my numbers. For example:

localhost:~# time node hello.js # cold (disk + jit warmup)
Hello, world!
real    0m 3.71s
user    0m 2.05s
sys     0m 0.97s
localhost:~# time node hello.js # warm
Hello, world!
real    0m 1.61s
user    0m 1.17s
sys     0m 0.34s
localhost:~# time node -v # cold (disk + jit warmup)
v20.12.1
real    0m 0.98s
user    0m 0.20s
sys     0m 0.36s
localhost:~# time node -v # warm
v20.12.1
real    0m 0.18s
user    0m 0.09s
sys     0m 0.07s

This is on Alpine 3.19.1 built with the docker script (I had to add modules=virtio_pci to the kernel command line).

SuperMaxusa commented 6 months ago

Note: my previous test was tested on loaded from browser hard drive image, on 9pfs I not tested speed yet.

I have tested now on 9pfs rootfs with Alpine Linux 3.19.1 as you said (+ COOP/COEP headers). Rechecks with others cache modes gives roughly the same results (test with cache=mmap below):

localhost:~# cat /proc/mounts 
< ... >
host9p / 9p rw,relatime,cache=5,access=client,msize=512000,trans=virtio 0 0
< ... >

localhost:~# time node hello.js # cold
Hello World!
real    0m 32.07s
user    0m 16.04s
sys     0m 7.63s

localhost:~# time node hello.js # warm
Hello World!
real    0m 30.33s
user    0m 15.17s
sys     0m 7.17s

localhost:~# time node -v
v20.12.1
real    0m 5.74s
user    0m 0.75s
sys     0m 1.28s

(v86 caches downloaded files in most cases)

It's a caching by browser or filestrorage's cache? (sorry if incorrect question) https://github.com/copy/v86/blob/c69cbd7854d7cbda08679c389973aa87c9fb37bd/src/browser/filestorage.js#L15-L21

copy commented 6 months ago

It's a caching by browser or filestrorage's cache? (sorry if incorrect question)

Hmm, 9pfs data is not cached currently, my previous statement was incorrect. However, files are only re-downloaded if the 9pfs closes the file and it hasn't been written to.

Let's try to narrow down why node runs ~20x slower for you:

SuperMaxusa commented 6 months ago

I've pushed my full alpine build process and examples/alpine.html to the wip branch (there's probably not much difference to yours)

Thanks, previously I had added profile in src/browser/main.js and uses for my tests. Almost is same, atleast I've also in kernel params "irqpoll" because on loading it's stucks after nmi selftest.

  • What's your cpu and ram size?

I had now only test on i3 with 6 GB of ram (only running browser and HTTP server) on Windows, and this can be a main problem. However. test with qemu-tcg (singlethreaded, qemu-system-i386 -accel tcg,thread=single) gives this results:

localhost:~$ time node hello.js # cold (no attention in our case now)
Hello World!
real    0m 6.97s
user    0m 3.14s
sys     0m 1.53s

localhost:~$ time node hello.js # warm
Hello World!
real    0m 2.48s
user    0m 1.51s
sys     0m 0.85s

localhost:~$ time node -v # warm
v20.12.1
real    0m 0.33s
user    0m 0.17s
sys     0m 0.13s
  • Could you add $(CLOSURE_READABLE)\ after $(CLOSURE_FLAGS)\ in the libv86.js target in the makefile, rebuild build/libv86.js (you may need to delete the file) and profile a run of node with firefox profiler?

Tried with wip branch (https://github.com/copy/v86/commit/34ebae1b4c8b0973d89721e58648f744afc2847c) and new alpine container, for some reasons, I get around 11 seconds of warm start node.js on Firefox (release and Developer Edition) and around 18 seconds on Chrome (can't record properly with performance profiler):

localhost:~# time node hello.js 
Hello, world!
real    0m 18.18s
user    0m 13.20s
sys     0m 3.06s

localhost:~# time node -v
v20.12.1
real    0m 1.23s
user    0m 0.54s
sys     0m 0.20s

No have idea why this happened, because I use same conditions for tests (just a server with COOP and browser with clean profile without extensions).

copy commented 6 months ago

I had now only test on i3 with 6 GB of ram

I think that explains most of the performance problems. You could try settings MAX_PAGES to 1 or 2, which might improve performance on low-end systems: https://github.com/copy/v86/blob/c69cbd7854d7cbda08679c389973aa87c9fb37bd/src/rust/jit.rs#L67

v86 isn't as fast as qemu here, but there are three layers of jit warmups going on (js->x86, x86->wasm, wasm->hostcpu). I don't think there is any single optimisation that will be an "easy win" here. The following could net a few percent:

@farhan-ct I'd suggest playing around with the alpine image and the various suggestions. Disk performance should be fine now

farhan-ct commented 6 months ago

Will do @copy

Thank you for sharing lots of info that I can learn :D

I've been using debian trixie until now and it's fine so far. Except, I am creating states after a nodejs cold run and saving them without clearing caches. State size is big but works nice. I will try this updated alpine too with the optimizations and suggestions provided.

Maybe out of context but here are a few observations:

  1. State file size is smaller when nodejs v86 is run instead of browser
  2. Network isn't working properly in debian when running v86 on node
SuperMaxusa commented 6 months ago
  • adding some sse instructions used by node to the jit

Seems that v8 uses SSE2 (https://github.com/v8/v8/commit/3fa6100ed3296175423fcfceb33feb2ef3f98b31), SSE3 (can be turn on/off on runtime with popcnt), FPU (not sure how often this is used, at least required for all CPU arches: https://forum.openwrt.org/t/building-node-js-for-ramips/183164, https://github.com/v8/v8/blob/main/test/unittests/base/cpu-unittest.cc#L64) and SSSE3 (#1038).

State size is big but works nice.

With zstd -19 compression too?

farhan-ct commented 6 months ago

With zstd -19 compression too?

I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.

Example: I boot for the 1st time and run some node command, and create a state without clearing vm caches and stuff. The state will obviously be big and when I boot using this intial_state, node is fast. But if I compress using zstd and then boot with initial_state, node is slower again.

SuperMaxusa commented 6 months ago

it seems that the loaded cache is being lost after compression.

Can you try to compare free -h before and after zstd?

2. Network isn't working properly in debian when running v86 on node

Node.JS 22.0 (or 21.x with node --experimental-websocket) has a WebSocket client, so you can add network_relay_url to V86 construstor and run with working networking on this version (tested for me on Buildroot).

farhan-ct commented 6 months ago

Node.JS 22.0 (or 21.x with node --experimental-websocket)

This worked, thank you!

Will update about the free -h too once I check

copy commented 1 month ago

From the latest tests, v86 takes about one second to run a warm nodejs hello world on a low-end machine, which looks reasonable to me considering the constraints.

I'll gently close this issue, but feel free to create a new issue if you have another testcase where v86 is significantly slower than qemu-tcg.

Also, if you have any tuning suggestions than improve fs9p performance on Linux, please send a PR to update examples/alpine.html or docs/archlinux.md.

I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.

Just a note on this. zstd -22 uses significantly more memory than -19 (by design). That wouldn't usually be a problem, however wasm currently doesn't have have an api to give memory back to the OS (and as wasm memory is a single chunk, the browser doesn't know which parts are "unused"). So the amount of memory used by zstd decompression will be used until the page is refreshed. In my tests, I didn't find the extra memory usage by zstd -22 worth it for the compression it provides.