Open dominikkiese opened 5 years ago
So apparently the issue appears when I call a function on the shared array. Just allocating it and iterating over its entries in a distributed for loop does not crash. Strangely if the function is called in my calculation only two or three workers seem to become active as I can see from top. Memory consumption then crashes Julia without the loop ever finishing. Anybody know why that is? Are the workers maybe not properly connected to the master? I would not know why because they are all initialized on the same machine, but maybe that's wrong.
This works as a MWE for me, anybody able to reproduce?
using Distributed
using SharedArrays
addprocs(68, topology=:master_worker)
A = ones(Float64, 100000000)
B = SharedArray{Float64, 1}((length(A)))
@sync @distributed for i in 1 : length(A)
B[i] = A[i]
end
Hello everyone,
I see the following strange behavior starting multiple processes on a KNL node with Julia 1.0.1. I add some processes using
addprocs(SYS.CPU_THREADS,
topology=:master_worker)
. Already that consumes roughly a fourth of the available memory (25 out of 96GB). In my code I now allocate a large shared array (~10^6 elements) and compute its entries (just multiplications and sums, there should not be any further allocations). My job now hangs when sharing the array or when trying to iterate over it via@sync @distributed
. During that period memory consumption grows until a bus error occurs and the job cancels.The same code runs fine on my local machine with 4 cores, with memory consumption stable.
Any ideas where that may come from? Anyone can reproduce something similar with a shared array of similar size and many processes?