JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

memory is not being released aggressively enough during parallel processing #36

Open CodeLenz opened 8 years ago

CodeLenz commented 8 years ago

I have a large codebase that runs without any problems in 0.4.0, but shows symptoms of memory leak when running in 0.4.3.

Basically, the memory increases after a call to @sync @parallel for. It seems that some of the memory allocated for the parallel processing is not freed. A call to gc() in the end of the loop does not fixes the problem.

The figure below shows the memory allocation for the same script using 0.4.0 (first half of the graphic) and 0.4.3, where it can be seen thtat the memory increases after each call to @sync @parallel for.

processos

It is the first time I fill an issue, so if I am not using the proper words/methodology, I apologize.

tkelman commented 8 years ago

Can you provide a self-contained runnable example that reproduces the problem?

CodeLenz commented 8 years ago

Hi.

Sorry, but my first concern was if this behavior was already addressed by someone else.

Actually, I think it is something related to the way the GC is dealing with the variables. As you can see in the picture, the memory is steadily increasing, leading to the undesirable "hang" associated to it. When using a smaller system of equations, we observed that it is indeed increasing, but after some iterations it drops again (I saw it after filling this issue).

We were not awere of it when using 0.4, since the GC must be cleaning the memory after the parallel processing. 0.5-dev behaves like 0.4.3.

I am trying to make a simple prototype, since there are a lot of calls and intermediary allocation inside the parallel loop. The base algorithm should be something in the line of:

const PARALLEL = true
const NPROC    = 3

if PARALLEL
   npp = nprocs()
   if npp==1
        println("\n Adding.... ")
        addprocs(NPROC')
   end
    # Lets avoid openblas using our cores....
   blas_set_num_threads(1)
end

function MAIN()
    # Number of "external" iterations
    const NE = 10

    # Main Dimension
    const N = 5000

    #Loop
    @inbounds for e=1:NE

        println("\n Iteration ",e)

        # Random R.H.S vectors (6 cols)
        const F = rand(N,6)

        # Coefficient Matrix
        const K1 = sprand(N,N,0.2)

        # Solve each of one of the R.H.S, using parallel processes..

        # Diagonal Scaling
        const d = diag(K1)
        const P = sqrt(d)
        const Pi = 1./P
        const AUX = spdiagm(Pi)
        const K = AUX*K1*AUX
        U = zeros(N,6)
        if PARALLEL
            U = @sync @parallel (hcat) for load = 1:6
                const V = Pi.*F[:,load]
                const resp = K*V
                Pi.*resp
            end
        else
            for load = 1:6
                const V = Pi.*F[:,load]
                const resp = K*V
                U[:,load] = Pi.*resp
            end
        end

        # wait a little bit...
        sleep(5)

        end #e

end

# Lets go...
MAIN()
kswietli commented 8 years ago

Not too long ago I had a memory leak somewhere in a similar situation. It got fixed in a major rewrite, so I wasn't worried about it enough to make it an issue. It was also on 0.4.3. I think I still might have the code for it too, but I didn't have the time to reduce it into something small.

For posterity, in case someone runs into a similar problem and finds this issue: Periodic @everywhere gc(true) worked for me as a stopgap during the few days of having to deal with the leak. Nothing else helped. Try it, might work for you.

CodeLenz commented 8 years ago

Hi.

I think that memory leak is not the case, since after some iterations the memory is reduced. But, indeed, the behaviour with 0.4.3 is different when compared to 0.4.0, since it may be a regression in the way gc() deals with memory allocation and parallel processing.

The following code reproduces the problem and is simpler than the previous code.

# Dimension - Depends on avaliable RAM
const N = 5000

# Number of iterations
const NI = 10

# Time to wait inside the subroutine
const tw = 5

# Number of procs - Depends on the avaliable number of cores
const NP = 3

# If it is the case, add the processes
if nprocs()==1
    addprocs(NP)
end

# Main loop...
for i=1:NI

    println("\n Iteration ",i)

    # Generates a random sparse matrix
    K = sprand(N,N,0.5)

    # Allocates a 6 column array
    F = rand(N,6)

    U = @sync @parallel (hcat) for j=1:6

        # To make the visualization easier
        sleep(tw)

        # In my real code there are a lot of calls here, with many
        # local allocations...this is why the graphic has larger hills
        # when compared to this example.
        K*F[:,j]

    end #@parallel

end # i
amitmurthy commented 8 years ago

I just tested with the above code on

Julia Version 0.4.4-pre+43
Commit 147d6d0* (2016-03-07 00:34 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)

After all iterations finished I required 2 executions of @everywhere gc() to release all memory. There is no leak. This is the same issue as https://github.com/JuliaLang/Distributed.jl/issues/35, the generational gc is holding onto objects a bit longer.

@carnaval , @yuyichao , if we could specify the maximum heap size (like Java's -Xmx startup argument), the user can better control how much a Julia process is allowed to grow.

amitmurthy commented 8 years ago

Changed title to reflect the same.

CodeLenz commented 8 years ago

kswietli

I can confirm that @everywhere gc() works (but the code is slightly slower).

[pao: quoting]

kswietli commented 8 years ago

It was a genuine leak in my case. Unless "brings the machine down without explicit calls to gc(true)" is a normal situation. 1% memory consumption with repeated calls to gc(true) and the same after the rewrite that was done. 20% without explicit gc(true). This is per process. The system would hit swap reliably towards the end of some of the longer runs of the program. I also think calling just gc() wasn't doing the trick, but it may just be my mind playing tricks on me. I can check, if there is interest in that.

amitmurthy commented 8 years ago

Yes please.

It is not a leak in the sense that we are not losing references to objects without memory being freed in the system. The need to call gc() explicitly means that we do have an issue that needs to be fixed. It will be great if you could provide code that can simulate the problem.

carnaval commented 8 years ago

Is the problem that the gc does not know what size your things are since they are remote and the remote ref looks really small ? Gc's trigger is (mainly) "how many bytes you allocated since last time".

There are functions that are used for C malloc, you might be able to do more or less the same things for remote objects ? (except without doing any allocation, just the accounting)

See https://github.com/JuliaLang/julia/blob/master/src/gc.c#L2681

CodeLenz commented 8 years ago

What changed between 0.4.0 and 0.4.3 that is causing such different behaviour ? I was looking for something related to gc in the changelogs and could not find it.

StefanKarpinski commented 8 years ago

We should really figure out what changed here since ideally point upgrades shouldn't break people's programs that used to work (unless they depended on buggy behavior).

amitmurthy commented 8 years ago

I'll do a bisect.

Is the problem that the gc does not know what size your things are since they are remote and the remote ref looks really small ?

Yes and no. Because they are small they are not being finalized quickly enough (which sends a message to the remote side to actually release the resource). But we saw that in JuliaLang/Distributed.jl#35, even with the "release message" issue being addressed (by manually finalizing), a slow steady increase in memory persisted. I think it is a issue of a full gc not being performed in time.

Gc's trigger is (mainly) "how many bytes you allocated since last time".

While a full gc too frequently will slow down programs, I feel it is imperative that we support a configurable max heap size. That will be a hard limit to do a full gc whenever reached.

amitmurthy commented 8 years ago

It appears this is a Windows issue.

I just tested on OSX for 1000 iterations of both the original as well the revised samples (without a sleep) and could not detect a leak. The residual steady state allocations (and after 1000 iterations) were approx. 500MB for the master and 400MB for the workers. These were released only after a couple of @everywhere gc() calls when they dropped to 170MB and 110MB respectively . But this is expected behavior if GC triggering only depends of new allocation requests.

kswietli commented 8 years ago

I'm about a third done reducing my case. Gonna be a few more days, I'm pretty busy.

I am running Arch Linux, so it isn't a Windows-specific problem, at least with my code.

CodeLenz commented 8 years ago

It is also happening in linux (Fedora 23, with 0.4.3...)

amitmurthy commented 8 years ago

Could you post your versioninfo()? I am unable to detect a leak on Ubuntu 15.10 with

  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-unknown-linux-gnu

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

I have been running

# Dimension - Depends on avaliable RAM
const N = 5000

# Number of iterations
const NI = 1000

# Time to wait inside the subroutine
const tw = 5

# Number of procs - Depends on the avaliable number of cores
const NP = 3

# If it is the case, add the processes
if nprocs()==1
    addprocs(NP)
end

# Main loop...
for i=1:NI

    println("\n Iteration ",i)

    # Generates a random sparse matrix
    K = sprand(N,N,0.5)

    # Allocates a 6 column array
    F = rand(N,6)

    U = @sync @parallel (hcat) for j=1:6

        # In my real code there are a lot of calls here, with many
        # local allocations...this is why the graphic has larger hills
        # when compared to this example.
        K*F[:,j]

    end #@parallel

end # i

top shows a max memory usage of 3.2% for the workers.

screenshot from 2016-03-15 21-00-02

amitmurthy commented 8 years ago

This is a generic linux image from https://julialang.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.3-linux-x86_64.tar.gz

kswietli commented 8 years ago

I have tried that reduced code, it works fine on my Linux. That probably means that my problem is distinct from this one.

I hope to have some breathing room over the weekend to reduce my stuff to something postable.

StefanKarpinski commented 8 years ago

Thanks for helping track this down, @kswietli and @CodeLenz.

CodeLenz commented 8 years ago

@StefanKarpinski Thank you for such amazing software and such open community. All my students are already using Julia in their thesis and undergraduate research projects. This issue happened with one of my graduate students when upgrading from 0.4.0 to 0.4.3.

@amitmurthy Right now I dont have acess to the linux machine, but as soon as possible I will post the versioninfo().

CodeLenz commented 8 years ago

julia () | A fresh approach to technical computing () | () () | Documentation: http://docs.julialang.org | |_ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.4.3 (2016-01-12 21:37 UTC) / |_'||_|'_| | |__/ | x86_64-redhat-linux

julia> versioninfo() Julia Version 0.4.3 Commit a2f713d* (2016-01-12 21:37 UTC) Platform Info: System: Linux (x86_64-redhat-linux) CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz WORD_SIZE: 64 BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem) LAPACK: libopenblasp.so.0 LIBM: libopenlibm LLVM: libLLVM-3.3

amitmurthy commented 8 years ago

Hmmm, I just ran the test script on a Fedora 23 on AWS with the generic binary and could not detect a leak.

The processes reached a steady state quickly enough.

screen shot 2016-03-17 at 8 02 40 am
  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-unknown-linux-gnu

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

I wonder if the errors you are seeing is due to system running out of memory. How much RAM to the machines have?

Looking forward to @kswietli 's script to debug this further.

kswietli commented 8 years ago

Good news: I got it down to a very reasonable size. Bad news: It hits swap even without parallel anything on 0.4.3. You can change the pmap to a map and it will still swap.

The master from a short while back appears to cope much better. I gave it 15 minutes and it still kept on growing by then. 0.4.3 runs out pretty much immediately. The fact it keeps on growing is still unnerving me.

type Inner
    data :: Int64
end

type Outer
    inners :: Vector{Inner}
end

function generate_outer()
    inners = Array(Inner, 1000)

    for position in 1:1000
        inners[position] = Inner(0)
    end

    return Outer(inners)
end

function loop(unused)
    outers = Array(Outer, 1000)

    for position in 1:1000
        outers[position] = generate_outer()
    end

    return
end    

function sweep()
    while true
        dontcares = Array(Int64, 100)
        pmap(loop, dontcares)

        #Uncomment this to not hit swap on 0.4.3.
        #gc()
    end
end

sweep()
julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libm
  LLVM: libLLVM-3.3
julia> versioninfo()
Julia Version 0.5.0-dev+3131
Commit b3b1411 (2016-03-13 10:42 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1226 v3 @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

EDIT: Sorry about the layout. Github is not being nice to my code for some reason, thus the code attached and the versioninfos look scrambled.

edit: fixed formatting - Amit.

amitmurthy commented 8 years ago

Thanks @kswietli , I can confirm the same behavior on OSX.

Removing the "parallel" label as the underlying issue is not dependent on parallel execution. map has the same "leak".

kswietli commented 8 years ago

Truly my pleasure. I'm very glad to help make my favourite language better. Figured this issue is a good start. Let me know if you need anything else.

amitmurthy commented 8 years ago

The same issue is on 0.4.0 too with the above code. So the problem that @CodeLenz is seeing as a regression between 0.4.0 and 0.4.3 may be different.

amitmurthy commented 8 years ago

@kswietli , I have created a new issue at JuliaLang/julia#15543 for the problem identified by you since it may be different from the one reported by @CodeLenz .

tcovert commented 8 years ago

I'm not sure if this is the same issue, but I posted a question to the julia-users mailing list about memory usage AFTER calling pmap: https://groups.google.com/forum/#!topic/julia-users/GlnokJuz_54.

I wrote a simple, self-contained MWE here: https://gist.github.com/tcovert/3b9b982ba8b13f5984bebb63887b6def.

The TLDR is that after pmap() has done the work its supposed to do, all of the worker processes still have memory allocations (according to top) well in excess of what I was expecting.

CodeLenz commented 3 years ago

Closing as this is related to a very old version of Julia.

JackKenney commented 3 years ago

I think this issue is still relevant (Julia 1.6.0)

After using pmap() and interrupting the Julia execution, there are still many Julia processes with high memory allocations that have to be manually killed.

I can try to put together an example and make a new issue if that's best? Just let me know.

oscardssmith commented 3 years ago

Reopening as this is still a problem in 1.6.