frost-beta / node-mlx

Machine learning framework for Node.js.
MIT License
129 stars 6 forks source link

JavaScript's garbage collector is not very ready for machine learning #2

Open zcbenz opened 2 months ago

zcbenz commented 2 months ago

The problem

When using node-mlx for training LLM, the RAM was quickly exhausted. After some profiling, it turned out that MLX's tensors (mx.array) were too slow to be garbage collected before running out of RAM.

How tensors are garbage collected in JavaScript

To understand the problem, we need to know how GC works in JavaScript. I'm mostly referring to V8 here but as far as I know other popular JavaScript engines share the same behavior.

JavaScript engines usually use mark-and-sweep garbage collectors, which finds the unreachable objects first (mark), and then release them (sweep). For modern engines the steps are done in concurrent and incremental way.

The tensor is exposed to JavaScript as a special object that, the pointer to the C++ object is stored as an internal field in the wrapped JavaScript object, and only when the JavaScript object is garbage collected will the C++ memory be freed.

Timing of GC

JavaScript engines only does GC when it finds necessary, from user's view when an object is garbage collected is non-deterministic. The only way developers can affect GC is to send a memory pressure notification to the engine to hint that it should do GC now.

However even when using internal APIs (for example node --expose_gc) to force GC, you can only make the engine release the already marked unreachable objects (the sweep step), there is no way to force the engine to do a full mark step and find out all the unreachable objects.

This is very good design for most applications, as GC has minimal affects to performance. However for machine learning this leads to disaster.

Unexpected case for large RAM use

When doing machine learning, especially when training a model, you usually want to use as much of the available RAM as possible. Even for casual users the number can be as large as 100GB thanks to M3 Max.

This becomes a problem with JavaScript's GC: after iterating a mini batch you would want to get the tensors released as soon as possible because you are immediately starting the iteration of next mini batch, however as there is no way to force the JavaScript engine to do a full mark and sweep, when you start iterating the next mini batch, the RAM taken by the last iteration has not been released yet and you will get OOM.

Solutions

The provided (not working) solution

Apart from machine learning, there are other applications that use quite a lot of RAM, like WebRTC and WebGPU, and V8 does provide an API that allows the applications to notify how much RAM they had used, and V8 will do GC more frequently according to the RAM uses.

But the timing of GC of the tensors is still non-deterministic, as it is not possible to force V8 to do a full mark to find out all the unreachable tensors. So for most cases the GC comes much later than needed, and RAM is exhausted before the tensors are released.

The TensorFlow.js solution

To solve the problem, TensorFlow.js provides a set of APIs to manually release the tensors. The downside is obvious: code gets uglier, more chances to make mistakes, and the meaning of using JavaScript gets lower.

What about Python?

Why Python does not have this GC problem? It is because Pythons' garage collector uses reference counting, so the tensors are released immediately after they are no longer used.

zcbenz commented 2 months ago

What can we do

I'm going to use both solutions in node-mlx. First add APIs to MLX to allow us to notify RAM usage to V8 to make GC happen at better timings, then add JS-only APIs that allow manual management of tensors.

What can JavaScript engines do

I hope in future engines like V8 can add abilities to cycle certain objects using reference counting, for applications like machine learning and GUI, the timing of GC is much more important than performance.

mikehearn commented 2 months ago

Is the issue here that the memory underlying the tensors is outside of the GC'd heap, so the GC doesn't know you're running out of memory? Because normally in any GC timing doesn't matter at all. If an allocation is about to fail because there isn't enough heap space, a full stop-the-world GC will be triggered before the engine gives up. The only way this can fail is if you have tiny GC controlled objects that point to huge native allocations. The GC doesn't understand it's running out of RAM and won't cause releases of the memory.

At any rate, the issue here is not so much JavaScript but V8, as you notice. You could try an alternative implementation like GraalJS which can benefit from the more server-scale GC impls in the JVM.

zcbenz commented 2 months ago

Is the issue here that the memory underlying the tensors is outside of the GC'd heap, so the GC doesn't know you're running out of memory?

The only way this can fail is if you have tiny GC controlled objects that point to huge native allocations.

Yeah this is exactly what I'm doing in this module. The GPU-accelerated machine libraries use their own memory allocators so it is impossible to manage tensor's memory in the JS engine's heap.

If an allocation is about to fail because there isn't enough heap space, a full stop-the-world GC will be triggered before the engine gives up.

Only doing full GC when allocation would fail is not enough in our case. For example when inferencing a local LLM that takes 8GB RAM on a laptop, we would want to have the app occupying consistent memory instead of bursting to use most of the RAM which would make other apps unusable.

At any rate, the issue here is not so much JavaScript but V8, as you notice. You could try an alternative implementation like GraalJS which can benefit from the more server-scale GC impls in the JVM.

Thanks for letting me know GraalJS, it seems a promising project. However V8 is something that I must support, being able to npm install this module and then start doing machine learning is the reason why I started this project.

DomThePorcupine commented 1 month ago

Do you have the training code you were using available somewhere? I can't seem to quite find an example in the tests or other docs - perhaps this is because of the performance. But I'd love to play around with it if you'd be open to sharing the code :)

zcbenz commented 1 month ago

I will upload the training code after (at least partially) fixing this issue. I'm currently on a long trip so I am not able to do that until next month.

zcbenz commented 3 weeks ago

Just observed another interesting behavior of V8's GC.

If you keep creating native objects in a single tick, they won't ever get collected:

while (true) {
  new Object()
  gc()
}

You have to give current event loop tick a chance to end, it seems that the mark step only runs after that:

setInterval(() => {
  new Object()
  gc()
}, 10)

This creates another challenge when training/inferencing models, because you can not just use while loops which would use up all your RAM.

zcbenz commented 2 weeks ago

Do you have the training code you were using available somewhere?

@DomThePorcupine I have put up some training code at https://github.com/frost-beta/train-model-with-js