Open thraen opened 8 years ago
Can you post the printed exception stack?
On OSX and 0.5 I see 2 different errors. Running the example as is, I get
"WARNING: An error occured during inference. Type inference is now partially disabled."
UndefVarError(var=:ArgumentError)
[inline] at /Users/amitm/Julia/julia/src/task.c:667
rec_backtrace at /Users/amitm/Julia/julia/src/task.c:868
jl_undefined_var_error at /Users/amitm/Julia/julia/usr/lib/libjulia.dylib (unknown line)
jl_get_binding_or_error at /Users/amitm/Julia/julia/usr/lib/libjulia.dylib (unknown line)
pop! at ./array.jl:433
jlcall_pop!_374 at /Users/amitm/Julia/julia/usr/lib/julia/sys.dylib (unknown line)
jl_apply_generic at /Users/amitm/Julia/julia/src/gf.c:863
inlineable at ./inference.jl:2872
jlcall_inlineable_729 at /Users/amitm/Julia/julia/usr/lib/julia/sys.dylib (unknown line)
jl_apply_generic at /Users/amitm/Julia/julia/src/gf.c:863
which goes away if I change
@everywhere @inline function slice!(x, t, p)
# do something
end
to
@everywhere @inline function slice!(x, t, p)
nothing
end
With the above change I get a Too many open files
error on both 0.5 and 0.4 which is due to gc
not freeing resources quickly enough. This goes away by changing the execution loop to
for i=1: 10000000000
x = shared_test(p)
println(i)
gc()
end
The first error is a bug in type inference I guess.
https://github.com/JuliaLang/julia/issues/15419 for the former issue.
without changes to the above I get:
signal (7): Bus error
signal (7): Bus error
_unsafe_batchsetindex! at cartesian.jl:34
setindex! at abstractarray.jl:592
_unsafe_batchsetindex! at cartesian.jl:34
setindex! at abstractarray.jl:592
jl_apply_generic at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
jl_apply_generic at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
anonymous at /root/juliabug/verfahren.jl:10
anonymous at /root/juliabug/verfahren.jl:10
jl_f_apply at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
jl_f_apply at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:920
anonymous at multi.jl:920
run_work_thunk at multi.jl:651
run_work_thunk at multi.jl:651
run_work_thunk at multi.jl:660
run_work_thunk at multi.jl:660
jlcall_run_work_thunk_21307 at (unknown line)
jlcall_run_work_thunk_21339 at (unknown line)
jl_apply_generic at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
jl_apply_generic at /root/Julia/usr/bin/../lib/libjulia.so (unknown line)
anonymous at task.jl:58
anonymous at task.jl:58
unknown function (ip: 0x7fedc910ef3c)
unknown function (ip: 0x7fe69e743f3c)
unknown function (ip: (nil))
unknown function (ip: (nil))
Worker 3 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Worker 2 terminated.ERROR: LoadError: LoadError: ProcessExitedException()
in yieldto at ./task.jl:71
in wait at ./task.jl:371
in wait at ./task.jl:286
in wait at ./channels.jl:63
in fetch at channels.jl:47
in remotecall_wait at multi.jl:752
in remotecall_wait at multi.jl:758
in anonymous at task.jl:447
...and 1 other exceptions.
in sync_end at ./task.jl:413
[inlined code] from task.jl:422
in __SharedArray#138__ at sharedarray.jl:43
in doit at /root/juliabug/verfahren.jl:29
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
while loading /root/juliabug/verfahren.jl, in expression starting on line 35
while loading /root/juliabug/test.jl, in expression starting on line 13
ERROR (unhandled task failure): readcb: connection reset by peer (ECONNRESET)
If I add gc()
to the outer loop, it crashes later (same error) and it seems that only one of the workers (-p3) leaks ram.
strangely, when I leave out the @sync @parallel loop
altogether, both of the workers leak and the same crash occurs. Even if I put a gc()
to the outer loop.
I think I confused the error messages, I always get the same Bus Error. I'll change the original descripton.
This does look gc related, specifically gc not being called soon enough. I just tested on Linux, Julia 0.4.3 with a @everywhere gc()
in the outer loop. No leak and runs fine.
probably the same issue - https://github.com/JuliaLang/julia/issues/15155
If you are building 0.4 from source, could you try with branch https://github.com/JuliaLang/julia/tree/amitm/backport_finalizeshmem ? With the initial code and a finalize(x)
in the outer loop.
In my test the memory grows far more slowly. I interrupted the loop after 6000 iterations and performed a @everywhere gc()
which released all memory.
With your branch and finalize(x)
in the outer loop memory consumption grows slower than without finalize(x)
. But it still grows, on one worker more than on the other. And it still crashes. With @everywhere gc()
in the outer loop it's ok.
I think it is apparent that the problem is being caused by gc
not freeing resources quickly enough. Changed the title to reflect the same.
Perhaps the GC is not aware of the true resource cost of SharedArrays, cc @carnaval @yuyichao.
SharedArray fails to gc() when called within a sequence of functions? 0.5.0-rc3 Apparently looking at the same problem: Reduced the case to simplest possible; original test case above still crashing 0.5.0-rc3 Using SharedArray within a sequence of functions:
addprocs(4)
function chisq(n::Integer)
A=SharedArray(Float64, n)
@sync @parallel for i in 1:n
A[i]=(rand()-rand())^2
end
sumsq=sum(A)
end
function calculate(n::Integer)
b=0.0
for j in 1:n
b+=chisq(n)
end
return b
end
chisq(500^2) #ok no failure
calculate(500) # fails
Calculating the same number of evaluations (500 x 500) it does not fail while it crashes before the same function is called 500 times
And the failure is:
ERROR: SystemError: shm_open() failed for /jl005889eze42OrPYHS9RKjHZihQ: Too many open files
in uv_error at ./libuv.jl:68 [inlined]
in _link_pipe(::Ptr{Void}, ::Ptr{Void}) at ./stream.jl:596
in link_pipe(::Base.PipeEndpoint, ::Bool, ::Base.PipeEndpoint, ::Bool) at ./stream.jl:652
in setup_stdio(::Pipe, ::Bool) at ./process.jl:419
in setup_stdio(::Base.##412#413{Cmd,Ptr{Void},Base.Process}, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}) at ./process.jl:464
in #spawn#411(::Nullable{Base.ProcessChain}, ::Function, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}, ::Bool, ::Bool) at ./process.jl:477
in (::Base.#kw##spawn)(::Array{Any,1}, ::Base.#spawn, ::Cmd, ::Tuple{Base.DevNullStream,Pipe,Base.TTY}, ::Bool, ::Bool) at ./<missing>:0
in open(::Cmd, ::String, ::Base.DevNullStream) at ./process.jl:539
in read(::Cmd, ::Base.DevNullStream) at ./process.jl:574
in readstring at ./process.jl:581 [inlined] (repeats 2 times)
in print_shmem_limits(::Int64) at ./sharedarray.jl:488
in shm_mmap_array(::Type{T}, ::Tuple{Int64}, ::String, ::UInt16) at ./sharedarray.jl:515
in #SharedArray#786(::Bool, ::Array{Int64,1}, ::Type{T}, ::Type{Float64}, ::Tuple{Int64}) at ./sharedarray.jl:70
in SharedArray{T,N}(::Type{Float64}, ::Tuple{Int64}) at ./sharedarray.jl:57
in #SharedArray#793(::Array{Any,1}, ::Type{T}, ::Type{T}, ::Int64, ::Vararg{Int64,N}) at ./sharedarray.jl:113
in chisq(::Int64) at ./REPL[2]:2
in calculate(::Int64) at ./REPL[3]:4
It also happens at 0.4.6, albeit a little different error:
ERROR: On worker 3:
SystemError: shm_open() failed for /jl006428a6fpOftDBFr087xQnY6F: Too many open files
in remotecall_fetch at multi.jl:747
in remotecall_fetch at multi.jl:750
in call_on_owner at multi.jl:793
in wait at multi.jl:808
in __SharedArray#138__ at sharedarray.jl:74
in SharedArray at sharedarray.jl:117
in chisq at none:2
in calculate at none:4
In fact, even without the @sync @parallel in the for o function chisq() it still crashes; it crashes even without addprocs()
if @everywhere gc() called in the second function (at each function calling), it doesn't crash (but long gc() time).
Is garbage collection not recognizing function creating SharedArrays being called many times and hitting system's limit of open files?
This might be a common case, for example, when adjusting parameters by optimization of a chisquare function - and each simulation being done in parallel, whereas optimization method calling chisquare many times...
Or I made something wrong?
Best regards Rafael
p.s.: could reproduce also in juliabox 0.5.0-dev (below) and 0.4.6, but not in a julia 0.4.5 32 bits windows system (also does not fail at 0.5.0rc3 windows 32 bits):
In [4]:
calculate(500)
LoadError: On worker 2:
SystemError: shm_open() failed for /jl000034opVp2HcAjt3ix2bbeW5A: Too many open files
in _jl_spawn at ./process.jl:321
in JuliaLang/julia#293 at ./process.jl:474 [inlined]
in setup_stdio at ./process.jl:462
in #spawn#292 at ./process.jl:473
in #spawn at ./<missing>:0
in ip:0x7f5f467573de at /opt/julia-0.5.0-dev/lib/julia/sys.so:? (repeats 2 times)
in readstring at ./process.jl:577 [inlined] (repeats 2 times)
in print_shmem_limits at ./sharedarray.jl:488
in shm_mmap_array at ./sharedarray.jl:515
in JuliaLang/julia#657 at ./sharedarray.jl:80
in JuliaLang/julia#494 at ./multi.jl:1189
in run_work_thunk at ./multi.jl:844
in run_work_thunk at ./multi.jl:853 [inlined]
in JuliaLang/julia#474 at ./task.jl:54
while loading In[4], in expression starting on line 1
in #remotecall_fetch#482(::Array{Any,1}, ::Function, ::Function, ::Base.Worker, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:904
in remotecall_fetch(::Function, ::Base.Worker, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:898
in #remotecall_fetch#483(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:907
in remotecall_fetch(::Function, ::Int64, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:907
in call_on_owner(::Function, ::Future, ::Int64, ::Vararg{Int64,N}) at ./multi.jl:950
in wait(::Future) at ./multi.jl:965
in #SharedArray#654(::Bool, ::Array{Int64,1}, ::Type{T}, ::Type{Float64}, ::Tuple{Int64}) at ./sharedarray.jl:89
in SharedArray{T,N}(::Type{Float64}, ::Tuple{Int64}) at ./sharedarray.jl:57
in #SharedArray#661(::Array{Any,1}, ::Type{T}, ::Type{T}, ::Int64, ::Vararg{Int64,N}) at ./sharedarray.jl:113
in chisq(::Int64) at ./In[2]:4
in calculate(::Int64) at ./In[2]:14
in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /opt/julia_packages/.julia/v0.5/IJulia/src/execute_request.jl:164
in eventloop(::ZMQ.Socket) at /opt/julia_packages/.julia/v0.5/IJulia/src/IJulia.jl:138
in (::IJulia.##25#31)() at ./task.jl:309
ERROR (unhandled task failure): EOFError: read end of file
probably best to just increase ulimit -n
as there's no particular reason for it to be small (other than to halt runaway programs, since in the common case programs don't need to open many fds)
I've currently stumbled through this problem while trying to benchmark a function with a SharedArray
and a @sync @parallel for
.
Is there a good soul around that could look into fixing this? A bug involving parallelism and a GC is just waaay over my head. 😞
Some analysis, with workarounds at the end.
There seem to be a couple of related issues here. One is that the SharedArray
is not garbage collected until all references to it disappear, and since most references to it are small there is no urgency on other nodes to run GC. The documentation mentions this, and the work around is just to call finalize
on the shared array explicitly. So this doesn't seem like that big of a deal.
The more subtle issue is that each SharedArray
keeps file descriptors open and they are not necessarily released even when the shared array is finalized. This file descriptor problem itself seems to have two causes.
1) If a SharedArray
is created without a backing file, there are file descriptors created while initializing the array that are not closed, but could be. For example,
addprocs(4); @everywhere gc()
fdcount() = println(run(pipeline(`lsof`, `wc -l`)))
fdcount() # 7724
as = [SharedArray{Int}(100,100) for i in 1:1000]
fdcount() # 124773
@everywhere gc()
fdcount() # 76685
Then, in a new session, run the same thing except replacing as
with
as = [SharedArray{Int}("/dev/shm/sharedarray-$i", (100,100)) for i in 1:1000]
and the second and third fdcount
should be about the same. If a SharedArray
is created with a backing file, then the extra files are either never opened or are explicitly closed during the shared array creation. This is probably due to SharedArray
without backing files going through the extra step of calling shm_mmap_array
before Mmap.mmap
, but I didn't finish looking into it.
2) For aSharedArray
S
, each future stored in S.refs
holds a mmap
ed array which is created on the future's owning process (ref.where
). Each of those arrays keeps a file descriptor open until that memmapped array itself finalized. Depending on where the shared array was originally created, the value S.s
might also be a mmap array which keeps an fd open.
Mmap.mmap
creates a finalizer for these arrays so that the backing files can be closed, but the SharedArray
finalizer does not call it. (It probably should.) This is why there are file descriptors left open by a finalized shared array until gc()
runs on every node.
So, workarounds:
finalize(shared_array)
, run
foreach(shared_array.refs) do r
@spawnat r.where finalize(fetch(r))
end
finalize(shared_array.s)
finalize(shared_array)
pid
s as possible to begin with. tmpfs
.After using a SharedArray
inside a function, the following sufficed for me in Julia 1.1.1 to free it:
finalize(my_shared_array)
@everywhere GC.gc()
Without this, the shared array was effectively not gc'ed automatically (even after returning from the function).
Just had the same issue in some code running on Julia 1.1.1. I used @donm 's first workaround above, and it fixed the issue. Running finalize
alone wasn't enough... perhaps because the workers all used a struct that referenced the SharedArray?
@magerton @donm I have a similar issue with SharedArrays that I have not been able to solve. It does not seem to be a memory issue to me because it works about 50% of the times. The issue is here https://discourse.julialang.org/t/systemerror-mmap-the-operation-completed-successfully-issue-with-sharedarrays/31579 .
Using @donm 's workaround did not work out for me.
On my computer julia always crashes on this code:
This triggers a
signal (7): Bus error
which results in the workers beeing killed.I tested this, because I wanted to isolate a memory leak that I suspect. I think it also occurs here, but Julia crashes before I can tell.
I think, if I leave out the Initialization:
by changing the line to
there is no memory leak, but then there are other errors. I am also not sure if the parallel loop is relevant for this.
versioninfo: