JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

Memory leak using shared arrays on cluster? #60

Open dominikkiese opened 5 years ago

dominikkiese commented 5 years ago

Hello everyone,

I see the following strange behavior starting multiple processes on a KNL node with Julia 1.0.1. I add some processes using addprocs(SYS.CPU_THREADS, topology=:master_worker). Already that consumes roughly a fourth of the available memory (25 out of 96GB). In my code I now allocate a large shared array (~10^6 elements) and compute its entries (just multiplications and sums, there should not be any further allocations). My job now hangs when sharing the array or when trying to iterate over it via @sync @distributed. During that period memory consumption grows until a bus error occurs and the job cancels.

The same code runs fine on my local machine with 4 cores, with memory consumption stable.

Any ideas where that may come from? Anyone can reproduce something similar with a shared array of similar size and many processes?

dominikkiese commented 5 years ago

So apparently the issue appears when I call a function on the shared array. Just allocating it and iterating over its entries in a distributed for loop does not crash. Strangely if the function is called in my calculation only two or three workers seem to become active as I can see from top. Memory consumption then crashes Julia without the loop ever finishing. Anybody know why that is? Are the workers maybe not properly connected to the master? I would not know why because they are all initialized on the same machine, but maybe that's wrong.

dominikkiese commented 5 years ago

This works as a MWE for me, anybody able to reproduce?

using Distributed using SharedArrays

addprocs(68, topology=:master_worker)

A = ones(Float64, 100000000) B = SharedArray{Float64, 1}((length(A)))

@sync @distributed for i in 1 : length(A) B[i] = A[i] end