Closed artemsolod closed 1 month ago
OutOfMemory
is generally caused by the operating system. Julia tried to allocate more pages and the OS told us "no". What does ulimit -a
tell you? Alternatively, you could look at dmesg
and see if there is something relevant there.
ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 24771016
max locked memory (kbytes, -l) 792687112
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 24771016
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
If i start another julia session it allocates without any problems
EDIT: ignore this (wasn't the problem): "machine has a lot of available RAM" like 3970 GB of 6 TB?! Doesn't seem plausible, I can do what you can with my 32 GB machine, my "free -g" shows 31 for total.
Are you running on some some kind of container (Docker?) where where it might be lying about the true amount of memory available, and maybe limiting accessible to much lower, and it could be confusing Julia?
[Most of your values for ulimit -a such as for "max locked memory" show about 194 times larger than for mine.]
This is a very big machine, yes, 6TB. No docker
What does Sys.free_memory()/2^20 show in your open session?
What does Sys.free_memory()/2^20 show in your open session?
4.06578748046875e6
And I did a test consuming over 5TB of RAM on this machine so I would guess faulty hardware is unlikely.
Also in case this is useful: read(open("/proc/self/stat", "r"), String)
gives
2488981 (julia) R 1639744 2488981 2488981 0 -1 4194304 494642993 100 97 0 197761 799102 0 0 20 0 157 0 671868918 2186985312256 491878547 18446744073709551615 4198400 4199090 140720573026864 0 0 0 16902 4112 1073745642 0 0 0 17 61 0 0 0 0 0 4210136 4210752 30289920 140720573034988 140720573035236 140720573035236 140720573038508 0
This is probably a bug related to multi-threading, but I'm not the expert here so don't trust me. Do some threads not see all the memory available? Maybe it helps to tune with the --gcthreads option (or try to switch to single-threaded with -t=1, to show not a problem then, i.e. with available memory really).
Your stack trace ends at: https://github.com/JuliaLang/julia/blob/v1.11.0-rc3/src/genericmemory.c#L73 (GenericMemory)
but it doesn't show the C code called: https://github.com/JuliaLang/julia/blob/v1.11.0-rc3/base/boot.jl#L516
Since GenericMemory, and the new Memory type, is new in 1.11, and this was also a problem in 1.10, I think it's not worth it for you (or me) to look too closely there.
How big are those "custom binary files"? Do you use mmap for them (I'm not sure it can be a problem unless they are really huge, in TBs+, or not even then because only part of virtual memory?), or some package to read them that might do?
Do you have actually 6 TB of RAM, or is that the virtual memory size? Might the physical RAM be (much) smaller, and that being the problem? You see its size with:
grep MemTotal /proc/meminfo
How big are those "custom binary files"? Do you use mmap for them (I'm not sure it can be a problem unless they are really huge, in TBs?), or some package to read them that might do?
About 7_000 files around 33mb each is the smallest I've seen to trigger this. Data is compressed with LEB128 encoding so resulting size in RAM is about 8 times bigger.
Files are read with something like open(parser, filepath)
.
I am positive there is 6TB of real RAM
grep MemTotal /proc/meminfo
MemTotal: 6341496900 kB
This is probably a bug related to multi-threading, but I'm not the expert here so don't trust me. Do some threads not see all the memory available? Maybe it helps to tune with the --gcthreads option (or try to switch to single-threaded with -t=1, to show not a problem then, i.e. with available memory really).
Your stack trace ends at: https://github.com/JuliaLang/julia/blob/v1.11.0-rc3/src/genericmemory.c#L73 (GenericMemory)
but it doesn't show the C code called: https://github.com/JuliaLang/julia/blob/v1.11.0-rc3/base/boot.jl#L516
Since GenericMemory, and the new Memory type, is new in 1.11, and this was also a problem in 1.10, I think it's not worth it for you (or me) to look too closely there.
I did run the code on a single thread, it didn't produce this problem. But this takes about 1hr rather than ~2 minutes. I have not tried 1 gc-thread though, will do so.
There was a longer trace spit out in 1.10
with references to lower level functions, will try to reproduce it
in 1.10 here is some info produced
jl_gc_try_alloc_pages at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c:84](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c#line=83) [inlined]
jl_gc_try_alloc_pages at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c:65](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c#line=64)
jl_gc_alloc_page at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c:131](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc-pages.c#line=130)
gc_add_page at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:1277](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c#line=1276)
jl_gc_pool_alloc_inner at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:1336](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c#line=1335)
jl_gc_pool_alloc_noinline at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:1357](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c#line=1356) [inlined]
jl_gc_alloc_ at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_internal.h:476](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_internal.h#line=475) [inlined]
jl_gc_alloc at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:3583](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c#line=3582)
_new_array_ at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/array.c:134](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/array.c#line=133)
jl_new_array_for_deserialization at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/array.c:204](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/array.c#line=203)
jl_decode_value_array at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c:493](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c#line=492)
jl_decode_value at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c:680](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c#line=679)
ijl_uncompress_ir at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c:902](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/ircode.c#line=901)
unknown function (ip: 0x7f2c3370d07b)
....
....
_jl_invoke at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=2894) [inlined]
ijl_apply_generic at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=3076)
jl_apply at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h:1982](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h#line=1981) [inlined]
jl_type_infer at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:394](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=393)
jl_generate_fptr_impl at /home/artem.solod/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu[/bin/../lib/julia/libjulia-codegen.so.1.10](http://localhost:8887/lib/julia/libjulia-codegen.so.1.10) (unknown line)
jl_compile_method_internal at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=2480) [inlined]
jl_compile_method_internal at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=2367)
_jl_invoke at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=2886) [inlined]
ijl_apply_generic at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=3076)
eventloop at /home/artem.solod/.julia/packages/IJulia/bHdNn/src/eventloop.jl:15
#15 at /home/artem.solod/.julia/packages/IJulia/bHdNn/src/eventloop.jl:38
unknown function (ip: 0x7f2c3168a3b2)
_jl_invoke at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=2894) [inlined]
ijl_apply_generic at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c#line=3076)
jl_apply at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h:1982](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h#line=1981) [inlined]
start_task at [/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/task.c:1238](http://localhost:8887/cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/task.c#line=1237)
Your versioninfo
suggests you are on Linux.
Could you print the value outputted by sysctl vm.max_map_count
?
This is the number of mmap's
allowed per-process. Might need sudo
. You can see this page (or Google vm.max_map_count
) for a description of what it does before running the command as root: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html.
We've seen OOM's before when this parameter is too low.
Could you print the value outputted by
sysctl vm.max_map_count
?
sysctl vm.max_map_count
vm.max_map_count = 65530
Ok, so you have the default value.
Does the OOM reproduce if you set a larger (4x or maybe even 10x) value to max_map_count
:
sysctl -w vm.max_map_count=262144
See the ElasticSearch link above for a confirmation that this increases the number of mmap's per-process.
Does the OOM reproduce if you set a larger (4x or maybe even 10x) value to
max_map_count
:
I've sent a request to up this value.
Meanwhile, currently I see in the session:
www = @ccall jl_alloc_memory_any((10^6)::Csize_t)::Ptr{Cvoid}
OutOfMemoryError()
Stacktrace:
[1] top-level scope
@ ./In[166]:2
but for smaller request it works
www = @ccall jl_alloc_memory_any((10^2)::Csize_t)::Ptr{Cvoid}
Ptr{Nothing} @0x00007da6739c2a10
Also direct malloc
call seems to work as well (I am not sure if I am doing it right)
mlcd = @ccall malloc((10^6)::Csize_t)::Ptr{Cvoid}
Ptr{Nothing} @0x0000001185ccb2c0
jl_alloc_memory_any is using pooled allocations for small sizes and only going to malloc for larger sizes.
It's intriguing this works:
mlcd = @ccall malloc((10^6)::Csize_t)::Ptr{Cvoid}
while this doesn't (assuming both in the same open session):
www = @ccall jl_alloc_memory_any((10^6)::Csize_t)::Ptr{Cvoid}
[Note, you would usually do Libc.malloc for malloc, but it seems equivalent to the former.]
So I'm thinking could this be a bug in the latter, not an error passed from the kernel, for memory actually running out?
I'm looking into all the ways you can get such as exceptions, and there seem to be alternative possibilities: https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/src/runtime_ccall.cpp#L202
[Strangely, that function is only used by jl_get_cfunction_trampoline that seems never used by Julia?! Or related to ccall, do you do something strange (directly) with ccall?]
https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/src/gc-pages.c#L94 https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/src/flisp/cvalues.c#L34 https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/stdlib/Sockets/src/addrinfo.jl#L78
Could it be:
UV_ENOBUFS no buffer space available
or
UV_ENOMEM not enough memory
for that, and I'm not sure what the former, means, i.e. the difference.
E.g. do you use GMP/unusual data types? It also throws, but I suppose only if really out of memory.
Could it me alloca related (i.e. blowing the stack), e.g. related to JL_GC_PUSHFRAME calling JL_CPPALLOCA?
jl_malloc_aligned
We've seen OOM's before when this parameter is too low.
@d-netto , thank you, this appears to be the source of the problem indeed. I inspected map_count
when catching the exception and the value is 65532
- just over the limit. FWIW in a single threaded session after all the data is read count is way smaller at 3077
- is this normal / expected?
Function that I used to check map_count
is just counting the number of lines
function get_map_count()
open(io -> sum((==)('\n'), readeach(io, Char)), "/proc/self/maps", "r" )
end
@PallHaraldsson thank you for looking into this also, I think the error stems from here - which fails due to the aforementioned limit
https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/src/gc-pages.c#L54
and then eventually the error is raised in char *jl_gc_try_alloc_pages(void)
https://github.com/JuliaLang/julia/blob/378f1920db0243224fbea8aea4e69ce36cd85173/src/gc-pages.c#L94
FWIW in a single threaded session after all the data is read count is way smaller at 3077 - is this normal / expected?
Hard to tell without looking at your workload. It's possible that multithreading causes you to have more in-flight mmap
s if all threads are calling mmap
.
(Sorry, closed accidentally).
Possible options here seem to change your workload to limit the amount of concurrency so that you don't have so many mmaps in-flight (e.g. put a semaphore around the code section where you are calling mmap), or tune your Linux kernel to increase vm.max_map_count.
In any case, the discussion on which option would be better in this circumstance is probably not in scope for this thread. Given that there was also no activity in the last three weeks I'll be closing it.
I do not think I can produce a MWE but I have a live IJulia session that currently gives
OutOfMemoryError()
even though the machine has a lot of available RAM.This starts to happen after I read a lot of custom binary files with multiple threads. I could not reproduce it when running on a single thread. It initially happened on 1.10, now I switched to
1.11-rc3
and it repeats.versioninfo()
In one of the runs on 1.10 there were trace messages that mentioned
jl_gc_try_alloc_pages
.Since the session itself is still alive I would appreciate guidance on what commands could help pinning down the source of the problem.
Thanks!