MemoryError - Githubissues

scfrank commented 9 years ago

Running python train.py with default arguments (flickr8k dataset etc) throws a MemoryError:

$python train.py 
Extracting vocabulary
Pickling dictionary to checkpoint//dictionary.pk
Traceback (most recent call last):
  File "train.py", line 289, in <module>
    w.trainModel()
  File "train.py", line 43, in trainModel
    trainX, trainIX, trainY, valX, valIX, valY = self.prepareInput()
  File "train.py", line 113, in prepareInput
    trainX, trainIX, trainY = self.createPaddedInputSequences(self.train, self.trainVGG)
  File "train.py", line 243, in createPaddedInputSequences
    return self.vectoriseSequences(split, vggFeats, sentences, next_words, vgg)
  File "train.py", line 248, in vectoriseSequences
    vectorised_sentences = np.zeros((len(sentences), self.maxSeqLen+1, len(self.vocab)))
MemoryError

This is on a CPU machine (4 cores, 8GB RAM, 16GB swap). The values for the np.zeros() parameters are 30000_39_2763, which results in a 3*10^9 item array/tensor, which would seem to result in a ~192 GB table (assuming 64bit zeros). Perhaps I am off by some orders of magnitude.

elliottd commented 9 years ago

Yup! I didn't bump into this error because our server has 256GB RAM.

I think Theano (and by association Keras) uses 32-bit floats by default, which means ~ 12.9GB RAM (30000 x 39 x 2763 x 4 (bytes in a 32-bit float) / (1024*1024)) to assign the vectorised_sentences structure.

I don't know how we can reduce the memory footprint; perhaps a scipy sparse matrix might help? It will become an even bigger problem in the future with larger datasets.

scfrank commented 9 years ago

I will look at the code with a view to thinking about iterables/generating on the fly instead of using up all the memory. This will lead pretty soon to (badly) reinventing fuel (https://github.com/mila-udem/fuel) so maybe it would be worth moving to that.

On 5 August 2015 at 16:56, Desmond Elliott notifications@github.com wrote:

Yup! I didn't bump into this error because our server has 256GB RAM.

I think Theano (and by association Keras) uses 32-bit floats by default, which means ~ 12.9GB RAM (30000 x 39 x 2763 x 4 (bytes in a 32-bit float) / (1024*1024)) to assign the vectorised_sentences structure.

I don't know how we can reduce the memory footprint; perhaps a scipy sparse matrix might help? It will become an even bigger problem in the future with larger datasets.

— Reply to this email directly or view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/1#issuecomment-128025214 .

elliottd commented 9 years ago

Fixed by 17c9eff

elliottd / GroundedTranslation

MemoryError #1