NervanaSystems / deepspeech

DeepSpeech neon implementation
Apache License 2.0
222 stars 69 forks source link

Out of memory while running more batches. #26

Closed Laqshay closed 7 years ago

Laqshay commented 7 years ago

I set the model to train using ~2500 hrs of audio distributed into ~66000 batches with 32 files/batch. After about 45000 batches, my process ended with the server crashing due to an 'out of memory' error before finishing a single epoch. I am assuming that the process saves the updates to be made to the nnet after each batch in the ROM (and doesn't clear them later). And after a certain amount of batches the space taken by the updates causes the computer to run out of memory. Is my assumption correct? If so, can this be avoided by decreasing the number of batches (taking 48/64 files per batch; or increasing the duration of each file so that greater duration of audio is contained per batch) or increasing the ROM of the computer? Or is this directly correlated to the duration of audio and I can't use more than, let's say, 1600 hrs of audio? I have a GPU server with 8 GB of RAM. This memory should be sufficient for processing 30 second audio files right?

tyler-nervana commented 7 years ago

Could you post a stacktrace for the error you are seeing? I can't tell if what you are describing is a system memory or a gpu memory error. If it's a gpu memory error, then I would first recommend trying a smaller batch size or shorter audio files. Take a look at #10 for some discussion on the memory requirements for deepspeech 2. This model actually requires 9.3 GB of gpu memory with the default configurations, so that could just be it.

I am assuming that the process saves the updates to be made to the nnet after each batch in the ROM (and doesn't clear them later). And after a certain amount of batches the space taken by the updates causes the computer to run out of memory.

Neon uses the updates for each mini-batch when they are computed and will then reuse the same memory buffer on the next mini-batch. Once your model is initialized, all of your memory buffers should be allocated, so I wouldn't expect an out of memory error so late in the training process.

Laqshay commented 7 years ago

Unfortunately, I don't have the stacktrace. I am running the process on an AWS server which crashed immediately following the error (this also means I cannot use batch sizes < 32).

According to the memory requirements discussed in #10 , since 14-sec files required 4 gb and 40-sec files required 9.3 gb, I assumed 8 gb of gpu RAM should be sufficient for a batch size of 32 and a max file duration of 30 seconds. Is that not the case? Could you tell me how to calculate the 'footprint' so that I can ascertain the RAM required?

A friend of mine ran the training with the librispeech dataset and faced the same issue after running multiple epochs. Finishing one epoch implies that none of his batches exceeded the RAM requirements, yet the server ran out of memory afterwards.

Laqshay commented 7 years ago

Update:

I set the model to be trained with a batch size of 64 files and the manifest arranged in decreasing order of duration (the first files to be input were the 64 longest duration files). I ran the process for a few batches. The training progressed without any memory-related error.

I also checked the durations of the files. Though the max duration was set to 30, only 37 files have durations more than 20 sec, and these were the last files in the manifest, which means they weren't even processed when the server ran out of memory the last time.

Neuroschemata commented 7 years ago

According to the memory requirements discussed in #10 , since 14-sec files required 4 gb and 40-sec files required 9.3 gb, I assumed 8 gb of gpu RAM should be sufficient for a batch size of 32 and a max file duration of 30 seconds. Is that not the case? Could you tell me how to calculate the 'footprint' so that I can ascertain the RAM required?

The easiest way to estimate the required memory footprint is to simply count the number of parameters in your model (with each parameter requiring 32 bits), and keep in mind that each audio frame gives rise to one activation, so the effective no. of activations per batch is given by num_hidden_units * max_audio_length * batch_size / stride

For example, assuming a model with a depth of 9, a batch size of 32 and a max duration of 30s, the above count gives an estimate of 7.2GB

A friend of mine ran the training with the librispeech dataset and faced the same issue after running multiple epochs. Finishing one epoch implies that none of his batches exceeded the RAM requirements, yet the server ran out of memory afterwards.

Once the model is initialized, the memory footprint should not increase. Of course if another process is triggered on the same GPU, then you could run out of memory in the middle of a training session.

We have carried out hundreds of experiments with this model and have never encountered any kind of "memory leak". However, all our tests are carried out on local GPUs and not on AWS servers.

Laqshay commented 7 years ago

So to clarify, in the formula, num_hidden_units = 1152 (default rnn hidden_units) * 9 (default rnn depth) max_audio_length = max_utter_len (20/25/30 sec) batch_size = 32 (or other multiple of 32) stride = 0.01 s (default time_stride) {max_audio_len / stride = max number of audio frames} and each frame contributes 32 bits (4 bytes)

Thank you for your response. I have just one last query.

I have lost a lot of time due to the fact that one epoch takes around one week to complete given the size of my dataset and that the training is constantly being interrupted before the completion of the first epoch for a variety of reasons. Is there any fix I can implement to the code so as to save the model after a fixed number of batches (let's say after each 200th or 500th batch)?