JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.66k stars 5.48k forks source link

OOM despite `--heap-size-hint` #50658

Closed ufechner7 closed 8 months ago

ufechner7 commented 1 year ago
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 4 on 8 virtual cores
Environment:
  JULIA_CONDAPKG_OFFLINE = yes

I often see that my code is killed due to out-of-memory. This happens when using pmap, but also when running single threaded single process code that allocates a lot repeatedly from the repl. I tried to add --heap-size-hint, but it did not help.

My workaround: I added the following code to all functions that allocate a lot:

if Sys.free_memory()/2^30 < 6.0
    GC.gc()
end

This should not be needed, the garbage collector should do a full collection before the system runs out of memory on its own.

jishnub commented 1 year ago

Could you post a small example that leads to the error? This would help a lot in narrowing the issue down

elextr commented 1 year ago

Possibly duplicate of #42566, see from here on.

ufechner7 commented 1 year ago

Could you post a small example that leads to the error? This would help a lot in narrowing the issue down

I happens reproducible with my production code, but I am not allowed to share it... So far it did not happen with the smaller code examples I tried, I will continue to try to create an MWE...

ufechner7 commented 1 year ago

Possibly duplicate of #42566, see from here on.

But in #42566 they say that "GC.gc(true); GC.gc() Does not fix it."

But for me GC.gc() frees the unreleased memory. So it might be a different issue.

elextr commented 1 year ago

But for me GC.gc() frees the unreleased memory. So it might be a different issue.

Indeed, if manually running GC stops the OOM killer bumping your process off, then the problem is likely not failing to return freed memory to the system, but how GC knows that OOM is approaching and so can work harder to collect unreferenced memory. IIRC there are several Julia issues about that, but of course my search for them is failing just now.

vchuravy commented 1 year ago

What is --heap-size-hint you set?

ufechner7 commented 1 year ago

What is --heap-size-hint you set?

julia -J bin/kps-image-1.9.so --project -i -q -p 16 --heap-size-hint=1G

And I have 32 G memory.

oscardssmith commented 1 year ago

It would be good to see if this is happening on recent julia nightlies. @gbaraldi's recent GC logic changes should have fixed this.

elextr commented 1 year ago

Just a note that the OOM killer is activated by the total memory of your cgroup IIUC, not just the parent, so would likely include any worker process memory usage as well as the parent process.

Does --heap-size-hint propagate to the workers?

vchuravy commented 1 year ago

How big is bin/kps-image-1.9.so? Or after just starting Julia how much memory does ps aux say you are using?

--heap-size-hint is currently not strict, and only measures the live heap and not sysimage/shared libraries etc.

MilesCranmer commented 1 year ago

@elextr Does --heap-size-hint propagate to the workers?

I also noticed this in a different context. It seems like the interaction between processes and heap-size-hint is not yet defined (?). I posted an issue here: https://github.com/JuliaLang/julia/issues/50673.

MilesCranmer commented 1 year ago

@oscardssmith can you link the PRs you mentioned?

oscardssmith commented 1 year ago

https://github.com/JuliaLang/julia/pull/50144

ufechner7 commented 1 year ago

How big is bin/kps-image-1.9.so? Or after just starting Julia how much memory does ps aux say you are using?

--heap-size-hint is currently not strict, and only measures the live heap and not sysimage/shared libraries etc.

ufechner@ufryzen:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            30Gi        11Gi        12Gi        74Mi       6,7Gi        18Gi
Swap:          1,9Gi          0B       1,9Gi

and in ps aux 16 times:

ufechner   10971  3.5  3.5 2397148 1143096 ?     Ssl  08:18   0:07 /home/ufechner/packages/julias/julia-1.9/bin/julia -Cnative -J/home/ufechner/repos/WindTurbines/bin/kps-image-1.9.so -g1 --bind-to 127.0.0.1 --worker

and

ufechner@ufryzen:~/repos/WindTurbines/bin$ ls -lah kps-image-1.9.so 
-rwxrwxr-x 1 ufechner ufechner 808M jul 24 11:48 kps-image-1.9.so
vtjnash commented 8 months ago

We have updated the heuristics more, so the GC should try harder to avoid exceeding this memory limit. We, however, don't control how much memory is required by external libraries (e.g. LLVM) so we expect precompile to take substantial amounts of memory and only possible to do on large build machines as a requirement for building (but not running).