Closed farhan-ct closed 1 month ago
Some ideas:
add_header Cross-Origin-Opener-Policy same-origin; add_header Cross-Origin-Embedder-Policy require-corp;
)If it's none of the above, it's probably cpu bound. Note that a 10x slowdown compared to native is expected. Profiling could help (but we really need that minimal testcase):
make all
)make debug-with-profiler
and run the program in debug.html. It has a bunch of profiling information at the bottomrunning node express programs takes ~2 minutes just to get started
You probably need load node_modules into cache early or into tmpfs, like a suggested before.
About long start node.js themselves, I have some idea (tested this only on 9pfs) to caching node binary and libs manually, like this makes on cold boot: cat /usr/lib/libnode.* /usr/bin/node > /dev/null
or via vmtouch
, (if the problem in I/O bounds).
By the way, I found that node's binary compiled without --shared
flag (for example, v20.11.1
from unofficial-builds.nodejs.org) much faster than with this flag (for example, like older version in Arch Linux (x32) - nodejs, commit: 26a5d0):
And I wrote a simple program that calculates Pi by Leibniz formula as benchmark:
Results:
root@localhost:~# node /pitest.js
pi: 55.320s
root@localhost:~# node /pitest.js
pi: 17.297s
root@localhost:~# /node /pitest.js
pi: 15.025s
root@localhost:~# /node /pitest.js
pi: 13.192s
And for interest, I tested on Void Linux (on an unpartitioned image), unlike Arch, it's builds with --shared flag (config from Void Linux), i. e. without separated libnode.so
.
NPM also slow too, in my case, I installed express.js from global node_modules
to project's node_modules
(ext4 -> ext4), I guess it's I/O bottleneck.
[supermaxusa@void express-test]$ time npm install
up to date, audited 65 packages in 3m
12 packages are looking for funding
run `npm fund` for details
found 0 vulnerabilities
real 4m19.227s
user 2m57.512s
sys 0m17.530s
...but with precached files, result is some better in different cases, for example, I tried to import express
module, first - cold start, second - warm start
[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"
real 1m4.244s
user 0m30.458s
sys 0m7.530s
[supermaxusa@void ~]$ time NODE_PATH=/usr/lib/node_modules node -e "var e = require('express')"
real 0m11.048s
user 0m7.352s
sys 0m1.954s
Thank you guys for the pointers, I'll go though each and every point, do some testing and update here on the results. Thanks agan!
@farhan-ct, for clarify, did you tried to test this on 9p root fs? Now I have thought only about I/O bounds (I think that CPU bounds don't matter yet).
- If you're running in the linux console, try an X session and run it in xterm instead
You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?
You mean that needs run node with X session instead of running that in ttyS0 or tty1 sessions?
Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.
@SuperMaxusa
for clarify, did you tried to test this on 9p root fs?
I ran all these tests on alpine standard linux, both on my packer version and the docker version you shared in the discussions (both in v86). However, my packer version doesn't have 9pfs at all. So, the performance issues exist for both with and without 9p versions.
Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site. However, the nodejs version installed using apk add is the same version for me. Also, trying to load the node bin into memory using vmtouch also didn't help in my case.
I just saw that you edited your above answer, let me go through your results once again :smiley:
but with precached files, result is some better in different cases
What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?
the Linux framebuffer console emulator is pretty slow (might be relevant if https://github.com/npm/npm/issues/11283). Using ttyS0 could also work.
@copy I am using pnpm instead, not too much output but it gave me relatively better performance in most cases.
Both the cases have the same performance issues.
I am trying to prepare a well written document that includes testcases (minimal, simple imports, and a server) and benchmarks that covers the pointers that you and @SuperMaxusa provided.
Also, I am having issues running the node v20.11.1 downloaded from the unofficial builds site.
Maybe gcompat and some dependencies (ldd node
) are missing?
Yes, the Linux framebuffer console emulator is pretty slow (might be relevant if npm prints progress bars). Using ttyS0 could also work.
Interesting, thanks for the info! My previous tests were run on ttyS0, about progress, on xterm it's runs with progress=false a bit faster than without it (warmed):
...but results (for some reason) on framebuffer were the opposite.
According to this results, I tried reduce output and logs from npm and this is helps.
What does 'precached files' here stand for? Is it the files that were loaded into memory when installing modules from global to project?
This is file caching (or page cache) that reduce "repeated" access to I/O, see https://www.linuxatemyram.com/ and https://en.wikipedia.org/wiki/Page_cache.
3. Even if the chunks of the os image are fetched from disk cache, the performance of nodejs programs is very slow. (ignore boot times, I am using a saved booted state to load)
Did you tried to use different chunk sizes for image? #710
Did you tried to use different chunk sizes for image?
I split my image into chunks each of fixed size 1MB with a filename format from one of the windows 95/98 demos in copy.sh.
While testing, I've been trying both the split version, as well as providing async: true and size which is like auto-chunking the image (by v86 itself, on the fly).
Maybe gcompat and some dependencies (ldd node) are missing?
Yes I had to install gcompat for the missing loader and the node binary atleast executed after that, but when I try any node commands, even node --help, I was constantly getting the below error Fatal javascript OOM in MemoryChunk allocation failed during deserialization.
Yes I had to install gcompat for the missing loader and the node binary atleast executed after that,
This node binary is for glibc, so it can be worked incorrectly with gcompat. Anyways, nodejs in Alpine compiles without --shared: https://git.alpinelinux.org/aports/tree/main/nodejs/APKBUILD#n200
If this is useful, node with mimalloc2 (https://github.com/gliderlabs/docker-alpine/issues/161#issuecomment-902291370) or jemalloc (https://stackoverflow.com/questions/53234410/how-to-use-node-js-with-jemalloc) is faster (in some cases) that without (tested on Alpine v3.18.6 and Node 18.19.1).
localhost:~# time node hello.js
Hello, world
real 0m 10.31s
user 0m 7.15s
sys 0m 2.23s
localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node hello.js
Hello, world
real 0m 7.81s
user 0m 6.34s
sys 0m 0.92s
localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node hello.js
Hello, world
real 0m 7.43s
user 0m 6.06s
sys 0m 0.87s
localhost:~# time node pitest.js
pi: 13.077s
real 0m 21.50s
user 0m 16.69s
sys 0m 2.51s
localhost:~# LD_PRELOAD=/usr/lib/libjemalloc.so.2 time node pitest.js
pi: 9.576s
real 0m 16.20s
user 0m 14.06s
sys 0m 1.20s
localhost:~# LD_PRELOAD=/usr/lib/libmimalloc.so.2 time node pitest.js
pi: 9.961s
real 0m 16.19s
user 0m 14.28s
sys 0m 1.02s
Thank you @SuperMaxusa for the clear details and tests, will try it out :D
I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.
Thank you @copy
Will test now
I've pushed a fix for slow disk IO into the wip branch (dfacd61). Please test.
Hi @copy
Thanks for the update on the fix. I built on the wip branch and did some quick testing using the provided docker debian example and took some screenshots. I couldn't arrive at a fair conclusion after testing the changes.
@SuperMaxusa I request you to take a look at this and test/confirm if you are facing similar results.
All these 4 tests are performed under the same conditions
fix for slow disk IO into the wip branch (dfacd61)
Here is my results (browser: Chrome 124 on localhost, guest OS: Alpine Linux 3.19.1, fs: ext4 with relatime
), after every test command I had sync
and echo 3 > /proc/.../drop_caches
:
Branch | dd [write] (MB/s) | dd [write] (sec) | dd [write dsync] (MB/s) | dd [write dsync] (sec) | cat 128MB [read] | dd [read] (MB/s) | dd [read] (sec) | time node hello.js [cold start, mostly read] |
---|---|---|---|---|---|---|---|---|
master https://github.com/copy/v86/commit/9a9d6f33072464d735863dfff4efd308191aba68 | 6.6 | 20 | 8.6 | 15 | real 7.65s, user 0.14s, sys 6.17s | 22.7 | 6 | real 38.4s, user 18.66s, sys 9.02s |
wip https://github.com/copy/v86/commit/dfacd6109dd915104ad916f87fb2f2d3e6cee623 | 7.2 | 18 | 8.2 | 16 | real 6.02s, user 0.14s, sys 5.29s | 23.7 | 5.6 | real 33.08s, user 20.91s, sys 9.20s |
I request you to take a look at this and test/confirm if you are facing similar results.
I changed version in Dockerfile to trixie and added apt install nodejs
also (server: redbean 2.2.0 with gzip compression). First run takes about 1 minute.
Some ideas:
msize
was 8 KiB, which slows a filesystem (may have affected Debain Buster on Figure 3, see 9psetup/Performance Considerations on QEMU Wiki), then changes on 128 KiB.
I tried once to experiment with this (on master branch), but for me it's only affected on write speed. To be clear, the fix mostly improves performance when files are actually downloaded from the server. So I wouldn't expect writes to be any faster, nor reads of any files that have been previously read (v86 caches downloaded files in most cases).
So the cold performance looks pretty solid now (in comparison to warm). I think much of the remaining cold start overhead can be explained by jit warmup.
The warm performance numbers you posted still look pretty bad, and much worse than my numbers. For example:
localhost:~# time node hello.js # cold (disk + jit warmup)
Hello, world!
real 0m 3.71s
user 0m 2.05s
sys 0m 0.97s
localhost:~# time node hello.js # warm
Hello, world!
real 0m 1.61s
user 0m 1.17s
sys 0m 0.34s
localhost:~# time node -v # cold (disk + jit warmup)
v20.12.1
real 0m 0.98s
user 0m 0.20s
sys 0m 0.36s
localhost:~# time node -v # warm
v20.12.1
real 0m 0.18s
user 0m 0.09s
sys 0m 0.07s
This is on Alpine 3.19.1 built with the docker script (I had to add modules=virtio_pci
to the kernel command line).
Note: my previous test was tested on loaded from browser hard drive image, on 9pfs I not tested speed yet.
I have tested now on 9pfs rootfs with Alpine Linux 3.19.1 as you said (+ COOP/COEP headers).
Rechecks with others cache modes gives roughly the same results (test with cache=mmap
below):
localhost:~# cat /proc/mounts
< ... >
host9p / 9p rw,relatime,cache=5,access=client,msize=512000,trans=virtio 0 0
< ... >
localhost:~# time node hello.js # cold
Hello World!
real 0m 32.07s
user 0m 16.04s
sys 0m 7.63s
localhost:~# time node hello.js # warm
Hello World!
real 0m 30.33s
user 0m 15.17s
sys 0m 7.17s
localhost:~# time node -v
v20.12.1
real 0m 5.74s
user 0m 0.75s
sys 0m 1.28s
(v86 caches downloaded files in most cases)
It's a caching by browser or filestrorage's cache? (sorry if incorrect question) https://github.com/copy/v86/blob/c69cbd7854d7cbda08679c389973aa87c9fb37bd/src/browser/filestorage.js#L15-L21
It's a caching by browser or filestrorage's cache? (sorry if incorrect question)
Hmm, 9pfs data is not cached currently, my previous statement was incorrect. However, files are only re-downloaded if the 9pfs closes the file and it hasn't been written to.
Let's try to narrow down why node runs ~20x slower for you:
$(CLOSURE_READABLE)\
after $(CLOSURE_FLAGS)\
in the libv86.js target in the makefile, rebuild build/libv86.js (you may need to delete the file) and profile a run of node with firefox profiler?I've pushed my full alpine build process and examples/alpine.html to the wip branch (there's probably not much difference to yours)
Thanks, previously I had added profile in src/browser/main.js and uses for my tests. Almost is same, atleast I've also in kernel params "irqpoll
" because on loading it's stucks after nmi selftest.
- What's your cpu and ram size?
I had now only test on i3 with 6 GB of ram (only running browser and HTTP server) on Windows, and this can be a main problem. However. test with qemu-tcg (singlethreaded, qemu-system-i386 -accel tcg,thread=single
) gives this results:
localhost:~$ time node hello.js # cold (no attention in our case now)
Hello World!
real 0m 6.97s
user 0m 3.14s
sys 0m 1.53s
localhost:~$ time node hello.js # warm
Hello World!
real 0m 2.48s
user 0m 1.51s
sys 0m 0.85s
localhost:~$ time node -v # warm
v20.12.1
real 0m 0.33s
user 0m 0.17s
sys 0m 0.13s
- Could you add
$(CLOSURE_READABLE)\
after$(CLOSURE_FLAGS)\
in the libv86.js target in the makefile, rebuild build/libv86.js (you may need to delete the file) and profile a run of node with firefox profiler?
Tried with wip branch (https://github.com/copy/v86/commit/34ebae1b4c8b0973d89721e58648f744afc2847c) and new alpine container, for some reasons, I get around 11 seconds of warm start node.js on Firefox (release and Developer Edition) and around 18 seconds on Chrome (can't record properly with performance profiler):
localhost:~# time node hello.js
Hello, world!
real 0m 18.18s
user 0m 13.20s
sys 0m 3.06s
localhost:~# time node -v
v20.12.1
real 0m 1.23s
user 0m 0.54s
sys 0m 0.20s
No have idea why this happened, because I use same conditions for tests (just a server with COOP and browser with clean profile without extensions).
I had now only test on i3 with 6 GB of ram
I think that explains most of the performance problems. You could try settings MAX_PAGES
to 1 or 2, which might improve performance on low-end systems: https://github.com/copy/v86/blob/c69cbd7854d7cbda08679c389973aa87c9fb37bd/src/rust/jit.rs#L67
v86 isn't as fast as qemu here, but there are three layers of jit warmups going on (js->x86, x86->wasm, wasm->hostcpu). I don't think there is any single optimisation that will be an "easy win" here. The following could net a few percent:
JIT_USE_LOOP_SAFETY
to false: https://github.com/copy/v86/blob/c69cbd7854d7cbda08679c389973aa87c9fb37bd/src/rust/jit.rs#L69cpuid_level: 0x14
to the constructor (for #797)@farhan-ct I'd suggest playing around with the alpine image and the various suggestions. Disk performance should be fine now
Will do @copy
Thank you for sharing lots of info that I can learn :D
I've been using debian trixie until now and it's fine so far. Except, I am creating states after a nodejs cold run and saving them without clearing caches. State size is big but works nice. I will try this updated alpine too with the optimizations and suggestions provided.
Maybe out of context but here are a few observations:
- adding some sse instructions used by node to the jit
Seems that v8 uses SSE2 (https://github.com/v8/v8/commit/3fa6100ed3296175423fcfceb33feb2ef3f98b31), SSE3 (can be turn on/off on runtime with popcnt), FPU (not sure how often this is used, at least required for all CPU arches: https://forum.openwrt.org/t/building-node-js-for-ramips/183164, https://github.com/v8/v8/blob/main/test/unittests/base/cpu-unittest.cc#L64) and SSSE3 (#1038).
State size is big but works nice.
With zstd -19
compression too?
With zstd -19 compression too?
I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.
Example: I boot for the 1st time and run some node command, and create a state without clearing vm caches and stuff. The state will obviously be big and when I boot using this intial_state, node is fast. But if I compress using zstd and then boot with initial_state, node is slower again.
it seems that the loaded cache is being lost after compression.
Can you try to compare free -h
before and after zstd?
2. Network isn't working properly in debian when running v86 on node
Node.JS 22.0 (or 21.x with node --experimental-websocket
) has a WebSocket client, so you can add network_relay_url
to V86 construstor and run with working networking on this version (tested for me on Buildroot).
Node.JS 22.0 (or 21.x with node --experimental-websocket)
This worked, thank you!
Will update about the free -h
too once I check
From the latest tests, v86 takes about one second to run a warm nodejs hello world on a low-end machine, which looks reasonable to me considering the constraints.
I'll gently close this issue, but feel free to create a new issue if you have another testcase where v86 is significantly slower than qemu-tcg.
Also, if you have any tuning suggestions than improve fs9p performance on Linux, please send a PR to update examples/alpine.html or docs/archlinux.md.
I did it with -22 (zstd --ultra -22 state.bin), and for some reason, it seems that the loaded cache is being lost after compression.
Just a note on this. zstd -22
uses significantly more memory than -19
(by design). That wouldn't usually be a problem, however wasm currently doesn't have have an api to give memory back to the OS (and as wasm memory is a single chunk, the browser doesn't know which parts are "unused"). So the amount of memory used by zstd decompression will be used until the page is refreshed. In my tests, I didn't find the extra memory usage by zstd -22
worth it for the compression it provides.
Hi,
I've been trying to run some nodejs programs and projects on v86. So far, it involved using npm/pnpm to install packages from online & offline (npm cache). Trying to run simple http and express server programs, etc.
Below are some of my experiences.
I know all this sounds very vague and perhaps unrelated to v86 but I want to dig deeper, explore and perhaps fix/improve things if I can.
@copy @SuperMaxusa
I request you to kindly point me where I should start looking and studying to improve v86 performance, atleast for nodejs temporarily (though it's understandable that it's running on a web browser).
Thank you!