JuliaParallel / DistributedArrays.jl

Distributed Arrays in Julia
Other
197 stars 35 forks source link

Running out of memory with julia `v0.6` but not `v0.5` #151

Closed raminammour closed 4 years ago

raminammour commented 7 years ago

Hello,

The code below is a minimal reproducible example that shows the behavior (the original code came up in an application I was writing). Sorry it is a bit convoluted, but on my machine it needed to be so to reproduce the error.

On julia 0.5 the code runs without running out of memory, but not on 0.6

addprocs(16)
@everywhere using DistributedArrays
 function test_gc(dA,dlA)
       @sync for ip in procs(dA)
       @spawnat ip begin @time begin
                 localpart(dlA)[:,:,ip]+=2localpart(dA)[:,:,1]+svd(localpart(dA)[:,:,2])[1][1]+2convert(Array,dA[1:size(localpart(dA),1),1:size(localpart(dA),2),3])
                 localpart(dlA)[:,:,ip+1]+=3localpart(dA)[:,:,1]+svd(localpart(dA)[:,:,5])[1][1]+4convert(Array,dA[1:size(localpart(dA),1),1:size(localpart(dA),2),3])
                 localpart(dlA)[:,:,ip+2]+=4localpart(dA)[:,:,1]+(localpart(dA)[:,:,7])+5convert(Array,dA[1:size(localpart(dA),1),1:size(localpart(dA),2),3])
       end
       end
       end
end

n1,n2,n3=2001,2001,701
for i=1:16
       dA=drand((n1,n2,n3),workers()[1:i]);dlA=similar(dA);

       println(i);
       @time test_gc(dA,dlA);

       d_closeall();
       @everywhere gc()
end

I monitored the memory usage using top and what happens is the following: 1- The total memory should be the same (about 60% on a 64 Gb node), split in i procs 2- Each function call creates temporary arrays that need to be garbage collected 3- As the number of procs increases, the code runs faster as one would hope 4- For some reason, in 0.5 the memory de-allocation and garbage collection is faster than 0.6 5- As a result, as memory is allocated for run i, residual memory from runs i-1,i-2,... is still being deallocated 6- Code runs out of memory...

I am not sure if this is expected behavior, or why 0.5 was more robust.

p.s: I am on master for DistributedArrays Cheers!

andreasnoack commented 7 years ago

Could you please provide the output of Pkg.status("DistributedArrays")?

raminammour commented 7 years ago
Pkg.status("DistributedArrays")
 - DistributedArrays             0.4.0+             master
andreasnoack commented 7 years ago

Hm. I thought that the commit hash would be there. Could you please provide the exact commit you are on?

raminammour commented 7 years ago
git show

commit 3996ddd24aa743c83c3a4e2f69c3c6444e4fa015
Merge: 6a376d1 9d9c3bb
Author: Andreas Noack <andreasnoackjensen@gmail.com>
Date:   Mon Jul 3 16:07:28 2017 -0400

    Merge pull request #149 from JuliaParallel/amitm/darr_leak

    fix leak - make registry entry a WeakRef on the node releasing the darray
andreasnoack commented 7 years ago

Thanks. @amitmurthy I thought the leak was fixed with your latest commit. It would be great if you could take a look at this.

amitmurthy commented 7 years ago

I could run this example both on Julia master and 0.6 without any leaks with 4 workers and smaller sized darrays.

No issues with the following n values over a 1000 iterations.

n1,n2,n3=201,201,71
for i=1:1000
       dA=drand((n1,n2,n3),workers()[1:(i%length(workers())+1)]);dlA=similar(dA);
raminammour commented 7 years ago

Thanks for the assistance :)

n1,n2,n3=2001,2001,701
for i=1:16
       dA=drand((n1,n2,n3),workers()[1:i]);dlA=similar(dA);test_gc(dA,dlA);
       d_closeall();@everywhere gc();
       println(i);
       @fetchfrom 2 run(pipeline(`free`,stdout="bla",append=true))
end

On 0.6

             total       used       free     shared    buffers     cached
Mem:      65985900   26317860   39668040          0          0     407180
-/+ buffers/cache:   25910680   40075220
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   26726796   39259104          0          0     407312
-/+ buffers/cache:   26319484   39666416
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   27024688   38961212          0          0     407428
-/+ buffers/cache:   26617260   39368640
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   27386168   38599732          0          0     407508
-/+ buffers/cache:   26978660   39007240
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   27578264   38407636          0          0     407648
-/+ buffers/cache:   27170616   38815284
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   36557240   29428660          0          0     407748
-/+ buffers/cache:   36149492   29836408
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   36720956   29264944          0          0     407872
-/+ buffers/cache:   36313084   29672816
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   36864676   29121224          0          0     407992
-/+ buffers/cache:   36456684   29529216
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   42546900   23439000          0          0     408104
-/+ buffers/cache:   42138796   23847104
Swap:            0          0          0

and.... out of memory

On 0.5

1                                                                          
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9451632   56534268          0          0     862796
        From worker 2:  -/+ buffers/cache:    8588836   57397064                                 
        From worker 2:  Swap:            0          0          0
2
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   20671804   45314096          0          0     862832
        From worker 2:  -/+ buffers/cache:   19808972   46176928
        From worker 2:  Swap:            0          0          0
3
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   17041704   48944196          0          0     862832
        From worker 2:  -/+ buffers/cache:   16178872   49807028
        From worker 2:  Swap:            0          0          0
4
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   15088944   50896956          0          0     862832
        From worker 2:  -/+ buffers/cache:   14226112   51759788
        From worker 2:  Swap:            0          0          0
5
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9560740   56425160          0          0     862840
        From worker 2:  -/+ buffers/cache:    8697900   57288000
        From worker 2:  Swap:            0          0          0
6
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   13163940   52821960          0          0     862848
        From worker 2:  -/+ buffers/cache:   12301092   53684808
        From worker 2:  Swap:            0          0          0
7
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   12654316   53331584          0          0     862852
        From worker 2:  -/+ buffers/cache:   11791464   54194436
        From worker 2:  Swap:            0          0          0
8
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9527396   56458504          0          0     862856
        From worker 2:  -/+ buffers/cache:    8664540   57321360
        From worker 2:  Swap:            0          0          0
9
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   11977056   54008844          0          0     862860
        From worker 2:  -/+ buffers/cache:   11114196   54871704
        From worker 2:  Swap:            0          0          0
10
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9476404   56509496          0          0     862864
        From worker 2:  -/+ buffers/cache:    8613540   57372360
        From worker 2:  Swap:            0          0          0
11
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   11452680   54533220          0          0     862864
        From worker 2:  -/+ buffers/cache:   10589816   55396084
        From worker 2:  Swap:            0          0          0
12
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9450688   56535212          0          0     862872
        From worker 2:  -/+ buffers/cache:    8587816   57398084
        From worker 2:  Swap:            0          0          0
13
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    9463060   56522840          0          0     862876
        From worker 2:  -/+ buffers/cache:    8600184   57385716
        From worker 2:  Swap:            0          0          0
14
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   10994624   54991276          0          0     862880
        From worker 2:  -/+ buffers/cache:   10131744   55854156
        From worker 2:  Swap:            0          0          0
15
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   10874968   55110932          0          0     862884
        From worker 2:  -/+ buffers/cache:   10012084   55973816
        From worker 2:  Swap:            0          0          0
16
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900   10766464   55219436          0          0     862892
        From worker 2:  -/+ buffers/cache:    9903572   56082328
        From worker 2:  Swap:            0          0          0

runs...

Without the function call test_gc(dA,dlA), on 0.6

              total       used       free     shared    buffers     cached
Mem:      65985900   26557144   39428756          0          0    1286924
-/+ buffers/cache:   25270220   40715680
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   15578672   50407228          0          0    1286932
-/+ buffers/cache:   14291740   51694160
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4589208   61396692          0          0    1286932
-/+ buffers/cache:    3302276   62683624
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900   10086768   55899132          0          0    1286932
-/+ buffers/cache:    8799836   57186064
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4593468   61392432          0          0    1286932
-/+ buffers/cache:    3306536   62679364
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4593928   61391972          0          0    1286932
-/+ buffers/cache:    3306996   62678904
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    7734492   58251408          0          0    1287008
-/+ buffers/cache:    6447484   59538416
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4594948   61390952          0          0    1287056
-/+ buffers/cache:    3307892   62678008
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    7034880   58951020          0          0    1287056
-/+ buffers/cache:    5747824   60238076
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4594960   61390940          0          0    1287056
-/+ buffers/cache:    3307904   62677996
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    6592388   59393512          0          0    1287056
-/+ buffers/cache:    5305332   60680568
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4596844   61389056          0          0    1287044
-/+ buffers/cache:    3309800   62676100
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4596080   61389820          0          0    1287044
-/+ buffers/cache:    3309036   62676864
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    4594936   61390964          0          0    1287044
-/+ buffers/cache:    3307892   62678008
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    6060492   59925408          0          0    1287044
-/+ buffers/cache:    4773448   61212452
Swap:            0          0          0
             total       used       free     shared    buffers     cached
Mem:      65985900    5968136   60017764          0          0    1287044
-/+ buffers/cache:    4681092   61304808
Swap:            0          0          0

I don't really understand why this happens, but I hope this helps...

amitmurthy commented 7 years ago

I think it is a Julia issue rather than darray. Can you try with smaller n values but larger number of iterations? I could not detect any leaks locally with a 1000 iterations with n values n1,n2,n3=2001,201,71

raminammour commented 7 years ago

For many iterations, the memory "oscillates"; it will only run out of memory when you strain the system enough with large n so the garbage collection of a previous run coincides with the allocation of the current run.


for i=1:16
       n1,n2,n3=2001,201,71
       dA=drand((n1,n2,n3),workers()[1:i]);dlA=similar(dA);test_gc(dA,dlA);
       d_closeall();@everywhere gc();                                      
       println(i);                                                         
       @fetchfrom 2 run(`free`)                                            
       end                                                                 
1                                                                          
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    3569984   62415916          0          0     272016
        From worker 2:  -/+ buffers/cache:    3297968   62687932                                 
        From worker 2:  Swap:            0          0          0
2
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    3844832   62141068          0          0     272112
        From worker 2:  -/+ buffers/cache:    3572720   62413180
        From worker 2:  Swap:            0          0          0
3
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    3889336   62096564          0          0     272228
        From worker 2:  -/+ buffers/cache:    3617108   62368792
        From worker 2:  Swap:            0          0          0
4
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    3918320   62067580          0          0     272336
        From worker 2:  -/+ buffers/cache:    3645984   62339916
        From worker 2:  Swap:            0          0          0
5
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    3955036   62030864          0          0     272420
        From worker 2:  -/+ buffers/cache:    3682616   62303284
        From worker 2:  Swap:            0          0          0
6
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4004760   61981140          0          0     272536
        From worker 2:  -/+ buffers/cache:    3732224   62253676
        From worker 2:  Swap:            0          0          0
7
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4146980   61838920          0          0     272632
        From worker 2:  -/+ buffers/cache:    3874348   62111552
        From worker 2:  Swap:            0          0          0
8
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4185032   61800868          0          0     272720
        From worker 2:  -/+ buffers/cache:    3912312   62073588
        From worker 2:  Swap:            0          0          0
9
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4242176   61743724          0          0     272808
        From worker 2:  -/+ buffers/cache:    3969368   62016532
        From worker 2:  Swap:            0          0          0
10
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4474444   61511456          0          0     272928
        From worker 2:  -/+ buffers/cache:    4201516   61784384
        From worker 2:  Swap:            0          0          0
11
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4573988   61411912          0          0     273028
        From worker 2:  -/+ buffers/cache:    4300960   61684940
        From worker 2:  Swap:            0          0          0
12
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4720300   61265600          0          0     273128
        From worker 2:  -/+ buffers/cache:    4447172   61538728
        From worker 2:  Swap:            0          0          0
13
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4730508   61255392          0          0     273224
        From worker 2:  -/+ buffers/cache:    4457284   61528616
        From worker 2:  Swap:            0          0          0
14
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4854036   61131864          0          0     273320
        From worker 2:  -/+ buffers/cache:    4580716   61405184
        From worker 2:  Swap:            0          0          0
15
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4749008   61236892          0          0     273416
        From worker 2:  -/+ buffers/cache:    4475592   61510308
        From worker 2:  Swap:            0          0          0
16
        From worker 2:               total       used       free     shared    buffers     cached
        From worker 2:  Mem:      65985900    4782880   61203020          0          0     273504
        From worker 2:  -/+ buffers/cache:    4509376   61476524
        From worker 2:  Swap:            0          0          0
amitmurthy commented 7 years ago

For many iterations, the memory "oscillates"; it will only run out of memory when you strain the system enough with large n so the garbage collection of a previous run coincides with the allocation of the current run.

Yes, that is why I think this is an issue with Julia rather than DistributedArrays. Does calling gc() twice make any difference, i.e., @everywhere (gc();gc())?

raminammour commented 7 years ago

Does calling gc() twice make any difference, i.e., @everywhere (gc();gc())?

nope...

ViralBShah commented 4 years ago

Reopen if relevant?