Document possible fix to huge memory usage.

Gertian commented 2 months ago

I recently found out that the excessive memory usage of Gradient Descent (perhaps also Vumps but this didn't cause crashes for me) can be tamed by including julia.gc() inside the finalize! funtion.

I provided some documentation on this.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

see 1 file with indirect coverage changes

lkdvos commented 2 months ago

I have to admit that I am not a huge fan of this. Keep in mind that if you add GC.gc(), this has major performance implications. In particular, running the garbage collector means stopping everything (including all other threads), and we should really just invest some time and effort into reducing our memory usage (which is going on, once we get the TensorOperations v5 changes pushed through). I think my opinion here is mostly that if you know what you are doing, you might experiment with manually running the GC, and this extra line in the documentation won't really help. In situations where you are limited by memory, you really might want to just turn off the MPSKit multithreading features (as already suggested in the docs). Furthermore, this is something that is really up to Julia itself, and I am not super keen on working actively against the julia compiler, as this is very version dependent, and what might work on one version might hurt on another.

On the other hand, if you don't know what you are doing, then this suggestion is something along the lines of copy-pasting stack-exchange code in your terminal and keeping your fingers crossed. It might work, but also could really hinder you in the long run. Properly explaining this is really beyond the scope of MPSKit, so I would prefer not including this.

As a side note, have you experimented with the new --heap-size-hint flag? This was added in 1.9, and tells Julia to run the GC more aggressively, which might also solve your issues.

Gertian commented 2 months ago

Thanks for the input.

I tried removing the MPSKit parallelism, and this worked well.But given the fact that I'm running a model with a 16-site unit cell and lots of blocks in the Hamiltonian it lead to very slow performance.

On the other hand, calling gc() every 100 (I agree this number is totally ad hoc sadly) gradient descent steps leads to stable code which can happily run fully parallel.

I wasn't aware of this --heap-size-hint flag, I'll give it a try somewhere in the future.

Gertian commented 2 months ago

ps : I've just read a bit about this flag. It looks very interesting and indeed, potentially a better thing to do.

Currently my numerics are going to remain running for a few days on the cluster (and I can't really test this on my laptop) but I'll implement this in the next sbatch.

Thanks for pointing this out :+1:

lkdvos commented 2 months ago

On the other hand, calling gc() every 100 (I agree this number is totally ad hoc sadly) gradient descent steps leads to stable code which can happily run fully parallel.

The problem is that indeed you need a lot of heuristics, and just a single line in the docs will more likely lead to more confusion than it would help. I am definitely working on improving this in any case, it's the eternal struggle of allocating temporary tensors in the eigsolver loops, and I am pretty sure we can alleviate them, it's just going to take me a bit of time :)

Gertian commented 2 months ago

Doesn't the fact that my "fix" works point out that the issue is simply that julia doesn't run the garbage collector enough ?

The difference is really huge, before my code crashed with 500GB RAM and now it's running smoothly with 200GB. So somehow 300GB was being filled up with unneccecairy stuff.

Do I understand it correctly that the temporary tensors are just not being removed from RAM quick enough ?

lkdvos commented 2 months ago

It's not entirely clear what is going on, but indeed, that is the main gist of it. It seems like Julia is not to keen at running the garbage collector in multithreaded code -- for good reason, as this needs to stop all threads -- and thus it just does not free the memory that is being allocated fast enough. As we are really spawning temporary tensors in a loop in each task, this puts quite a bit of pressure on the heap, and julia's heuristics seem to not trigger in time. As a sidenote, I also bumped into GC.safepoint() recently, which we maybe should consider including. This should give Julia more options to run the GC, which might (or might not) help. Again, these are all symptoms of the underlying problem, we are creating to much garbage, and should really be recycling things.

QuantumKitHub / MPSKit.jl

Document possible fix to huge memory usage. #176

Codecov Report