Domiii commented 3 years ago

Due to unoptimized algorithm (as also discussed in #12), encode is a memory hog (I have not looked at decode yet). I decided to post this as a separate issue, since the other issue's title does not capture the problem, and the discussion mostly focusses on execution speed, not on memory issues.

In my case, I am sending data with socket.io, and this is my journey:

I am sending about 100M values (nested in one object)
Initially it crashed on me because during encode it ran out of memory. I had to increase node's RAM limit to --max-old-space-size=8192.
The final buffer size is 298,406,623
It turns out that the recursive _encode call itself required 4GB of additional memory (even though, as mentioned above, buffer size is less than 300MB total).
- It went from 1.2GB in the beginning to 5.2GB in the end. Afterwards, all mem pressure disappeared again. I'm rather confident, the problem is in the encode algorithm itself.
- NOTE: I measured this via process.memoryUsage(). All three (rss, heapTotal, and heapUsed) show the same trend.

Possible Solution

I strongly suggest to heed manast's suggestion to use a direct buffer allocation approach. In case that buffer size is unknown, just run the algorithm once to compute buffer size and index positions, then re-run to actually populate, rather than using the current approach of creating temporary utility objects. This should come at a much lower memory (and probably CPU) cost, than the current version.

I know the owner currently does not have time to work on this, but one can dream :)

darrachequesne commented 3 years ago

Thanks for the detailed report :+1:

What are the 100M values? Plain strings? Could you please share some code reproducing the issue?

Did you try with another messagepack implementation like @msgpack/msgpack or what-the-pack? Do you encounter the same behavior?

Domiii commented 3 years ago

The values are mostly objects in arrays and they are nested a few times (some 5 to 7 layers deep). The raw values are mostly numbers, and some strings. (But there is no circular references; I'm rather certain.)
I don't think other msgpack implementations have a socket.io parser, do they?
I cannot really re-produce an isolated sample right now (timewise)

But I can offer a few more insights regarding defers. I just ran a sample:

23.8M values
Increase in memory: 1.6G between before the recursive _encode call and after (from 1.2 -> 2.8)
bytes.length ~ 53M
defers.length ~ 11.5M
final buf size ~ 141M

It does not seem impossible that the defers array is the culprit.

Do you want to try to create your own sample with some dummy arrays containing a ton of strings?

joshxyzhimself commented 2 years ago

Isn't it better to use compression algorithms (like gzip) to encode data of huge sizes (e.g. 10MB and above)? On client-side (browser) maybe pako can decode it on the main thread or on the worker thread.

uWebSockets.js is a great alternative to socket.io and other web frameworks too (e.g. express, koa, hapi, ws)

Domiii commented 2 years ago

@joshxyzhimself The issue here is with encode, not with encryption or the transport layer. encode has a memory leak causing it to gobble up 4+gb of memory (and then crash) to encode only 200+MB of data (arrays, objects, strings, numbers).

@darrachequesne To answer your question: Things are working after switching to a custom parser around @msgpack/msgpack for socket.io.

darrachequesne / notepack

`encode` is a crazy memory hog #27

Possible Solution