Closed raminammour closed 4 years ago
Could you please provide the output of Pkg.status("DistributedArrays")
?
Pkg.status("DistributedArrays")
- DistributedArrays 0.4.0+ master
Hm. I thought that the commit hash would be there. Could you please provide the exact commit you are on?
git show
commit 3996ddd24aa743c83c3a4e2f69c3c6444e4fa015
Merge: 6a376d1 9d9c3bb
Author: Andreas Noack <andreasnoackjensen@gmail.com>
Date: Mon Jul 3 16:07:28 2017 -0400
Merge pull request #149 from JuliaParallel/amitm/darr_leak
fix leak - make registry entry a WeakRef on the node releasing the darray
Thanks. @amitmurthy I thought the leak was fixed with your latest commit. It would be great if you could take a look at this.
I could run this example both on Julia master and 0.6 without any leaks with 4 workers and smaller sized darrays.
No issues with the following n values over a 1000 iterations.
n1,n2,n3=201,201,71
for i=1:1000
dA=drand((n1,n2,n3),workers()[1:(i%length(workers())+1)]);dlA=similar(dA);
Thanks for the assistance :)
n1,n2,n3=2001,2001,701
for i=1:16
dA=drand((n1,n2,n3),workers()[1:i]);dlA=similar(dA);test_gc(dA,dlA);
d_closeall();@everywhere gc();
println(i);
@fetchfrom 2 run(pipeline(`free`,stdout="bla",append=true))
end
On 0.6
total used free shared buffers cached
Mem: 65985900 26317860 39668040 0 0 407180
-/+ buffers/cache: 25910680 40075220
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 26726796 39259104 0 0 407312
-/+ buffers/cache: 26319484 39666416
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 27024688 38961212 0 0 407428
-/+ buffers/cache: 26617260 39368640
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 27386168 38599732 0 0 407508
-/+ buffers/cache: 26978660 39007240
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 27578264 38407636 0 0 407648
-/+ buffers/cache: 27170616 38815284
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 36557240 29428660 0 0 407748
-/+ buffers/cache: 36149492 29836408
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 36720956 29264944 0 0 407872
-/+ buffers/cache: 36313084 29672816
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 36864676 29121224 0 0 407992
-/+ buffers/cache: 36456684 29529216
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 42546900 23439000 0 0 408104
-/+ buffers/cache: 42138796 23847104
Swap: 0 0 0
and.... out of memory
On 0.5
1
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9451632 56534268 0 0 862796
From worker 2: -/+ buffers/cache: 8588836 57397064
From worker 2: Swap: 0 0 0
2
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 20671804 45314096 0 0 862832
From worker 2: -/+ buffers/cache: 19808972 46176928
From worker 2: Swap: 0 0 0
3
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 17041704 48944196 0 0 862832
From worker 2: -/+ buffers/cache: 16178872 49807028
From worker 2: Swap: 0 0 0
4
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 15088944 50896956 0 0 862832
From worker 2: -/+ buffers/cache: 14226112 51759788
From worker 2: Swap: 0 0 0
5
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9560740 56425160 0 0 862840
From worker 2: -/+ buffers/cache: 8697900 57288000
From worker 2: Swap: 0 0 0
6
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 13163940 52821960 0 0 862848
From worker 2: -/+ buffers/cache: 12301092 53684808
From worker 2: Swap: 0 0 0
7
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 12654316 53331584 0 0 862852
From worker 2: -/+ buffers/cache: 11791464 54194436
From worker 2: Swap: 0 0 0
8
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9527396 56458504 0 0 862856
From worker 2: -/+ buffers/cache: 8664540 57321360
From worker 2: Swap: 0 0 0
9
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 11977056 54008844 0 0 862860
From worker 2: -/+ buffers/cache: 11114196 54871704
From worker 2: Swap: 0 0 0
10
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9476404 56509496 0 0 862864
From worker 2: -/+ buffers/cache: 8613540 57372360
From worker 2: Swap: 0 0 0
11
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 11452680 54533220 0 0 862864
From worker 2: -/+ buffers/cache: 10589816 55396084
From worker 2: Swap: 0 0 0
12
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9450688 56535212 0 0 862872
From worker 2: -/+ buffers/cache: 8587816 57398084
From worker 2: Swap: 0 0 0
13
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 9463060 56522840 0 0 862876
From worker 2: -/+ buffers/cache: 8600184 57385716
From worker 2: Swap: 0 0 0
14
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 10994624 54991276 0 0 862880
From worker 2: -/+ buffers/cache: 10131744 55854156
From worker 2: Swap: 0 0 0
15
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 10874968 55110932 0 0 862884
From worker 2: -/+ buffers/cache: 10012084 55973816
From worker 2: Swap: 0 0 0
16
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 10766464 55219436 0 0 862892
From worker 2: -/+ buffers/cache: 9903572 56082328
From worker 2: Swap: 0 0 0
runs...
Without the function call test_gc(dA,dlA)
, on 0.6
total used free shared buffers cached
Mem: 65985900 26557144 39428756 0 0 1286924
-/+ buffers/cache: 25270220 40715680
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 15578672 50407228 0 0 1286932
-/+ buffers/cache: 14291740 51694160
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4589208 61396692 0 0 1286932
-/+ buffers/cache: 3302276 62683624
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 10086768 55899132 0 0 1286932
-/+ buffers/cache: 8799836 57186064
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4593468 61392432 0 0 1286932
-/+ buffers/cache: 3306536 62679364
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4593928 61391972 0 0 1286932
-/+ buffers/cache: 3306996 62678904
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 7734492 58251408 0 0 1287008
-/+ buffers/cache: 6447484 59538416
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4594948 61390952 0 0 1287056
-/+ buffers/cache: 3307892 62678008
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 7034880 58951020 0 0 1287056
-/+ buffers/cache: 5747824 60238076
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4594960 61390940 0 0 1287056
-/+ buffers/cache: 3307904 62677996
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 6592388 59393512 0 0 1287056
-/+ buffers/cache: 5305332 60680568
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4596844 61389056 0 0 1287044
-/+ buffers/cache: 3309800 62676100
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4596080 61389820 0 0 1287044
-/+ buffers/cache: 3309036 62676864
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 4594936 61390964 0 0 1287044
-/+ buffers/cache: 3307892 62678008
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 6060492 59925408 0 0 1287044
-/+ buffers/cache: 4773448 61212452
Swap: 0 0 0
total used free shared buffers cached
Mem: 65985900 5968136 60017764 0 0 1287044
-/+ buffers/cache: 4681092 61304808
Swap: 0 0 0
I don't really understand why this happens, but I hope this helps...
I think it is a Julia issue rather than darray. Can you try with smaller n values but larger number of iterations? I could not detect any leaks locally with a 1000 iterations with n values n1,n2,n3=2001,201,71
For many iterations, the memory "oscillates"; it will only run out of memory when you strain the system enough with large n
so the garbage collection of a previous run coincides with the allocation of the current run.
for i=1:16
n1,n2,n3=2001,201,71
dA=drand((n1,n2,n3),workers()[1:i]);dlA=similar(dA);test_gc(dA,dlA);
d_closeall();@everywhere gc();
println(i);
@fetchfrom 2 run(`free`)
end
1
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 3569984 62415916 0 0 272016
From worker 2: -/+ buffers/cache: 3297968 62687932
From worker 2: Swap: 0 0 0
2
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 3844832 62141068 0 0 272112
From worker 2: -/+ buffers/cache: 3572720 62413180
From worker 2: Swap: 0 0 0
3
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 3889336 62096564 0 0 272228
From worker 2: -/+ buffers/cache: 3617108 62368792
From worker 2: Swap: 0 0 0
4
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 3918320 62067580 0 0 272336
From worker 2: -/+ buffers/cache: 3645984 62339916
From worker 2: Swap: 0 0 0
5
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 3955036 62030864 0 0 272420
From worker 2: -/+ buffers/cache: 3682616 62303284
From worker 2: Swap: 0 0 0
6
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4004760 61981140 0 0 272536
From worker 2: -/+ buffers/cache: 3732224 62253676
From worker 2: Swap: 0 0 0
7
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4146980 61838920 0 0 272632
From worker 2: -/+ buffers/cache: 3874348 62111552
From worker 2: Swap: 0 0 0
8
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4185032 61800868 0 0 272720
From worker 2: -/+ buffers/cache: 3912312 62073588
From worker 2: Swap: 0 0 0
9
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4242176 61743724 0 0 272808
From worker 2: -/+ buffers/cache: 3969368 62016532
From worker 2: Swap: 0 0 0
10
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4474444 61511456 0 0 272928
From worker 2: -/+ buffers/cache: 4201516 61784384
From worker 2: Swap: 0 0 0
11
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4573988 61411912 0 0 273028
From worker 2: -/+ buffers/cache: 4300960 61684940
From worker 2: Swap: 0 0 0
12
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4720300 61265600 0 0 273128
From worker 2: -/+ buffers/cache: 4447172 61538728
From worker 2: Swap: 0 0 0
13
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4730508 61255392 0 0 273224
From worker 2: -/+ buffers/cache: 4457284 61528616
From worker 2: Swap: 0 0 0
14
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4854036 61131864 0 0 273320
From worker 2: -/+ buffers/cache: 4580716 61405184
From worker 2: Swap: 0 0 0
15
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4749008 61236892 0 0 273416
From worker 2: -/+ buffers/cache: 4475592 61510308
From worker 2: Swap: 0 0 0
16
From worker 2: total used free shared buffers cached
From worker 2: Mem: 65985900 4782880 61203020 0 0 273504
From worker 2: -/+ buffers/cache: 4509376 61476524
From worker 2: Swap: 0 0 0
For many iterations, the memory "oscillates"; it will only run out of memory when you strain the system enough with large n so the garbage collection of a previous run coincides with the allocation of the current run.
Yes, that is why I think this is an issue with Julia rather than DistributedArrays. Does calling gc()
twice make any difference, i.e., @everywhere (gc();gc())
?
Does calling gc() twice make any difference, i.e.,
@everywhere (gc();gc())
?
nope...
Reopen if relevant?
Hello,
The code below is a minimal reproducible example that shows the behavior (the original code came up in an application I was writing). Sorry it is a bit convoluted, but on my machine it needed to be so to reproduce the error.
On
julia 0.5
the code runs without running out of memory, but not on0.6
I monitored the memory usage using
top
and what happens is the following: 1- The total memory should be the same (about 60% on a 64 Gb node), split ini
procs 2- Each function call creates temporary arrays that need to be garbage collected 3- As the number of procs increases, the code runs faster as one would hope 4- For some reason, in0.5
the memory de-allocation and garbage collection is faster than0.6
5- As a result, as memory is allocated for runi
, residual memory from runsi-1,i-2,...
is still being deallocated 6- Code runs out of memory...I am not sure if this is expected behavior, or why
0.5
was more robust.p.s: I am on
master
forDistributedArrays
Cheers!