RAM Usage keep increasing

chanil1218 commented 7 years ago

I've tested deepspeech on larger dataset. I found training task had finally killed by the OS before first epoch had finished. The reason I confirmed is keep increasing Memory(not GPU, RAM) usage. (FYI, my RAM size is 16GB) Training script exceeds memory limit, fills up swap section, and finally killed by OS. I trained it on GPU, but there is far more memory usage in RAM rather than GPU memory. I thought this is not because of input source file sizes. Because when I reversed sorted dataset and trained, then RAM usage was similar in both cases.

I think that this memory leak(I guess) comes from dataloding section in many files condition. And from the results of successful trainings of others, these held memories has collected after each epochs had completed.

Could you guys comment on the possible point this had occurred?

SeanNaren commented 7 years ago

I'll try to investigate further but there is definitely something strange in the loading. I think adding more collectgarbage() calls may help.

chanil1218 commented 7 years ago

@SeanNaren Even after first epoch, allocated memory is not collected.

I guess cutorch is not the cause, because when I train it without GPU and also memory increases without limit.

I suspect,

LMDB lib
threads lib
Project level memory management

Because I am a newbie in lua/torch libraries, it is hard for me to track memory leak(or even concluding it is normal memory usage). Any suggestion of tools for debugging is welcomed! 2016년 12월 8일 (목) 오후 6:45, Sean Naren notifications@github.com님이 작성:

I'll try to investigate further but there is definitely something strange in the loading. I think adding more collectgarbage() calls may help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/71#issuecomment-265697478, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWrZV6-AY5vZs_qDIPLmxBFgifGkBz9ks5rF9GdgaJpZM4K-SzM .

SeanNaren commented 7 years ago

Sorry for not being active on this, this is a major issue that I ran into myself when training (hard to replicate but eventually the memory does run out). Will try to see if there is a test I could use to verify this leak.

chanil1218 commented 7 years ago

I observed memory usage eventually converge to some point so that I could complete training with about 30gb of swap memory. It might help you to point the memory usage increasing. 2017년 1월 6일 (금) 오전 12:19, Sean Naren notifications@github.com님이 작성:

Sorry for not being active on this, this is a major issue that I ran into myself when training (hard to replicate but eventually the memory does run out). Will try to see if there is a test I could use to verify this leak.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/71#issuecomment-270668755, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWrZdkEeGVAT-gPm1A712IrIT7TXPNXks5rPQn9gaJpZM4K-SzM .

mtanana commented 7 years ago

I had this issue as well. Is it confirmed that the memory leak happens on the CPU as well? I remember having some memory leaking for this project in CUDA https://github.com/mtanana/torchneuralconvo and the fix was not intuitive...but if the leak is on the CPU, it could be one of those libraries...

If anyone has ideas let me know, but I might try to narrow down the source in the next couple of days.

Awesome project by the way @SeanNaren

mtanana commented 7 years ago

CollectGarbage isn't doing the trick...I remember this was the case with my bug as well..I'll keep looking

mtanana commented 7 years ago

Wow...no memory leak when I iterate over data with the loader, even when I cuda the input. Might really be a memory leak in the model libraries

SeanNaren commented 7 years ago

That's concerning thanks for taking the time to investigate this...

A few tests that would help narrow the problem down:

Does free memory go down when we just do a forward pass on each batch? IF no then
Does free memory go down when we do a backward and forward pass? IF no then
Does free memory go down when we do CTC as well? If we get here, it might be the criterion

mtanana commented 7 years ago

Yeah- good call breaking it down that way. I'll let you know what I find

mtanana commented 7 years ago

wow...it's in the forward step before adding any other the others...

I cuda'd the inputs over many iterations and no issue.... then self.model:forward(inputs) and the memory explodes

mtanana commented 7 years ago

Pretty sure it has to do with the convolutions inside of a nn.Sequence()

Didn't solve it...but I did find some memory, but major speed improvement by switching the convolutions to cudnn:

 local n = nn;
    if(opt.nGPU>0) then n = cudnn end
    local conv = nn.Sequential()
    -- (nInputPlane, nOutputPlane, kW, kH, [dW], [dH], [padW], [padH]) conv layers
    conv:add(n.SpatialConvolution(1, 32, 11, 41, 2, 2))
    conv:add(n.SpatialBatchNormalization(32))
    conv:add(nn.Clamp(0, 20))
    conv:add(n.SpatialConvolution(32, 32, 11, 21, 2, 1))
    conv:add( n.SpatialBatchNormalization(32))
    conv:add(nn.Clamp(0, 20))
    local rnnInputsize = 32 * 41 -- based on the above convolutions and 16khz audio.
    local rnnHiddenSize = opt.hiddenSize -- size of rnn hidden layers
    local nbOfHiddenLayers = opt.nbOfHiddenLayers

    conv:add(nn.View(rnnInputsize, -1):setNumInputDims(3)) -- batch x features x seqLength
    conv:add(nn.Transpose({ 2, 3 }, { 1, 2 })) -- seqLength x batch x features

    local rnns = nn.Sequential()
    local rnnModule = RNNModule(rnnInputsize, rnnHiddenSize, opt)
    rnns:add(rnnModule:clone())
    rnnModule = RNNModule(rnnHiddenSize, rnnHiddenSize, opt)

    for i = 1, nbOfHiddenLayers - 1 do
        rnns:add(nn.Bottle(n.BatchNormalization(rnnHiddenSize), 2))
        rnns:add(rnnModule:clone())
    end

    local fullyConnected = nn.Sequential()
    fullyConnected:add(n.BatchNormalization(rnnHiddenSize))
    fullyConnected:add(nn.Linear(rnnHiddenSize, 29))

    local model = nn.Sequential()
    model:add(conv)
    model:add(rnns)
    model:add(nn.Bottle(fullyConnected, 2))
    model:add(nn.Transpose({1, 2})) -- batch x seqLength x features

This was based on a post from the torch nn folks:

It is because of the nn.SpatialConvolution. We compute the convolution using a Toeplitz matrix. So unfolding the input takes quite a bit of extra memory.

https://en.wikipedia.org/wiki/Toeplitz_matrix

If you want to keep the memory down, use cudnn.SpatialConvolution from the cudnn package: https://github.com/soumith/cudnn.torch

mtanana commented 7 years ago

Haha...soved!!! From the cudnn literature:
by default, cudnn.fastest is set to false. You should set to true if memory is not an issue, and you want the fastest performance

(See line 15 of UtilsMultiGPU) cudnn.fastest is set to true

@SeanNaren I'm thinking maybe I could send a pull request with

an option to turn 'fastest' on and off and
some code to always turn on cudnn for the convolutions if there is a gpu

let me know what you'd like

Man that bug was getting to me...glad we have it figured out.

mtanana commented 7 years ago

@SeanNaren btw...like the way the aync loader works...that's a nice touch...glad that sucker wasn't leaking the memory.

mtanana commented 7 years ago

Nevermind...managed to crash it again...I'll keep at it

SeanNaren commented 7 years ago

@mtanana Thanks for the work :) I didn't think it would be anything GPU related since it's taking down the RAM mem... But just to clarify, GPU memory usage should increase throughout the epoch (since the time steps get larger and larger) but CPU memory should not!

mtanana commented 7 years ago

Yeah...I think I'm realizing now that the size of the batches are just increasing because of how the loader works. As the sequence size increases, the memory is increasing as well. If I permute before running I get a more or less constant mem usage on the GPU. The CPU isn't increasing. For others that ran into this problem, maybe try these steps and see if you still have issues:

comment out the "fastest" line(line 15 of UtilsMultiGPU)
Move the permute line ( if self.permuteBatch then self.indexer:permuteBatchOrder() end) to right after for i = 1, epochs do and run the training command with -permuteBatch . If you run out of memory in the first few iterations, you're model is just too big for the memory.

I'll keep an eye on this thread. Tag me if you discover anything new.

mtanana commented 7 years ago

Also- wrote some error catching code so that if you occasionally have a talk turn that is too long for the CUDA memory, it will catch the error instead of killing the training.

fanlamda commented 7 years ago

I can understand why GPU menmory usage increase during a batch. But why does GPU memory not get back to as low as before. @SeanNaren

markmuir87 commented 7 years ago

I'm encountering this as well on GPU training. Doing a bit of searching around reveals others are having this issue too: https://github.com/torch/cutorch/issues/379 . Sounds like it's some subtle interaction of how cutorch is implemented, changes in nVidia drivers and linux default memory management. The proposed solutions sound sensible, although I haven't tried them yet (and know nothing about memory management). The suggested solutions I've seen are:

Force torch to use jemalloc for memory management. It's apparently more aggressive with releasing memory:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so th Train.lua

Zero all variables (i.e. assign 'nil') when no longer needed at end of batch iterations, then call collectgarbage() . The behaviour I've observed seems consistent with bits of memory not being deallocated and carrying over to the next iterations (i.e. mem usage gradually climbs until OOM).

I've been thinking about another possible approach (and apologies if this sounds stupid, I'm a complete torch newbie):

Chop up your training sets and test sets into manageable numbers. Then just write a script that trains 1 epoch per set, exits, reloads from the saved model and moves on to the next batch.
I'm very new to machine learning, but I've been wondering if this might be a way to prevent over-fitting and reduce the probability of the model converging on an 'inescapable' local (but not global) minima for the loss function. You'd have to have at least two seperate (read: uncorrelated) datasets chopped up and mixed in together (and probably randomly shuffled every 'round').

Compared to pretty much everything else in ML land, this seems like something I could actually understand and implement (with a simple python script running torch as a subprocess). I'll try and find the time in the next week or so (although don't wait on me if you think it's a good idea and want to implement it sooner).

Interested to hear your thoughts on this.

SeanNaren / deepspeech.torch

RAM Usage keep increasing #71