MemoryError and rewriting data_generator

evanmiltenburg commented 8 years ago

Since I don't want to bombard your inbox with emails, I'll make an issue here for the MemoryError I got with the Flickr30k data. Loading everything at once into memory seems to be too cumbersome for my machine. So maybe a generator-based approach is better. Keras now provides a fit_generator function that seems perfect to keep memory requirements as low as possible.

I just wrote a simple generator that automatically randomizes the training and val data, see 6dd0b27215594d6602036588fbfd4d6379bb59e1. If I can get everything to run, this should make the module scale really well.

scfrank commented 8 years ago

Hi Emiel, I haven't looked at data_generator for a while now, but have you tried using yield_training_batch? It is a generator over "big_batches" for keras.fit(), precisely to keep the memory usage within limits. If you're running into Memory Errors, maybe lowering big_batch_size will help. Admittedly this is hardcoded for training data, so if the problem is in val this won't work. Also IIRC it doesn't randomize, certainly not outside the big_batch. But +1 for rewriting data_generator, it has turned into a monster.

On 15 March 2016 at 14:21, Emiel van Miltenburg notifications@github.com wrote:

Since I don't want to bombard your inbox with emails, I'll make an issue here for the MemoryError I got with the Flickr30k data. Loading everything at once into memory seems to be too cumbersome for my machine. So maybe a generator-based approach is better. Keras now provides a fit_generator function that seems perfect to keep memory requirements as low as possible.

I just wrote a simple generator that automatically randomizes the training and val data, see 6dd0b27 https://github.com/elliottd/GroundedTranslation/commit/6dd0b27215594d6602036588fbfd4d6379bb59e1. If I can get everything to run, this should make the module scale really well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/elliottd/GroundedTranslation/issues/17

evanmiltenburg commented 8 years ago

Thanks for the response, but I'm afraid that won't help. For some reason my machine keeps running out of memory after the first epoch, while computing the perplexity of the model. Here is the full error message:

Traceback (most recent call last):
  File "train.py", line 271, in <module>
    model.train_model()
  File "train.py", line 131, in train_model
    shuffle=True)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 490, in fit
    shuffle=shuffle, metrics=metrics)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 231, in _fit
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/callbacks.py", line 36, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/data/GroundedTranslation-0.1/Callbacks.py", line 96, in on_epoch_end
    val_pplx = self.calculate_pplx()
  File "/data/GroundedTranslation-0.1/Callbacks.py", line 402, in calculate_pplx
    self.use_sourcelang, self.use_image)
  File "/data/GroundedTranslation-0.1/data_generator.py", line 451, in get_data_by_split
    targets = self.get_target_descriptions(arrays[0])
  File "/data/GroundedTranslation-0.1/data_generator.py", line 735, in get_target_descriptions
    target_array = np.zeros(input_array.shape)
MemoryError

scfrank commented 8 years ago

Ok - it looks like it's running out of memory on the val data, which we definitely don't have a solution for (and will require rewriting calculate_pplx too). Quick for-now fixes: Can you use a smaller vocabulary? smaller dev set?

evanmiltenburg commented 8 years ago

Good call, I could use --unk for this! Raising the lower bound will get rid of most of the long tail while maintaining the size of the dataset. (Rewriting calculate_pplx is also on the list).

evanmiltenburg commented 8 years ago

Using a larger value for --unk was a good idea. The smaller the vocabulary, the further I get.

If I understand the model correctly, this is a fundamental problem with 1-hot encoding. As the dataset grows, the vocabulary grows. And as the vocabulary size grows, the size of the matrix the model has to predict grows along with it. As it stands, I have to remove about 5% from the Flickr30k data in order for everything to fit in memory. From the model:

Retained / Original Tokens: 1695886 / 1785683 (94.97 pc)

This percentage will be bigger for the MS COCO dataset, which will mean that we’re throwing out words that aren’t that rare. (In fact, we probably are, with the Flickr30k data. I am currently seeing if the system holds up with unk=30. It would be interesting to log which words get thrown out.). One solution is to make the whole algorithm leaner by having only one sentence in memory at any time. (Plus one for the prediction.) This will raise the upper bound for vocabulary size somewhat, but it remains a hard limit. That limit might just be enough for most usecases, though.

Another solution is not to have one-hot encodings, but rather use word vectors from a word2vec model, such as the GoogleNews one. This might also make learning a bit easier for the description model. (Also, I really wonder how well the predictions generalize to unseen words that aren't in the training data but that are in the word2vec model.)

evanmiltenburg commented 8 years ago

Ok, by increasing unk just a little more a couple of times, I think I managed to localize the bottlenecks. The first bottleneck is computing the perplexity. But if you increase unk just enough, it's no longer an issue. The bigger problem comes with generating descriptions and computing the BLEU score. This requires more memory, and thus I needed to increase unk even more.

This was a bit surprising to me, as predicting the next word and building a sentence is exactly what the model had been doing for the entire epoch. (Or isn't this what training is all about?) So I suspect there might be a more memory-efficient way to generate sentences. (Perhaps in a trade-off with time-efficiency.)

One thing that worries me is the BLEU score. For the first two epochs (now in the third), the BLEU score has been 0.00. I'd hoped for the score to be at least a little bit better than that. Could this also be because guessing the correct word for a huge vocabulary is extremely difficult? If so, could beam search help here? Or a different encoding of the vocabulary? (E.g. word2vec or something similar.)

scfrank commented 8 years ago

Both perplexity and generation happen with val, so you're running into issues whenever you need to load a val data matrix into memory. I suspect it happens more when you're generating because you've already filled up most of your ram with perplexity calculations and python isn't garbage collecting hyperefficiently, so you start swapping.

Does Keras now support passing generators for the validation data to keras.fit()? When we wrote data_generator, it required a numpy array, which is why there's no generator for val. If that's changed, moving to generators would be ideal (and maybe it would be worth using generators in our own Callbacks anyway). Des also said something about keras now supporting the kind of integer-> embedding mapping you were talking about, for moving away from one-hot mappings; this would help a lot. I'm not using keras myself at the moment so I'm not really up to speed on recent developments.

For your current experiments: can you cut your validation data down to a very small number (50-100 sentences) and train on --unk 3 or so? I.e. is the bottleneck for the vocabulary just due to the val array sizes? In this scenario, you might also see better bleu scores, since you'll have fewer unks in your output. Is your training cost going down? Val cost/error? These measures are less brittle than bleu.

On 16 March 2016 at 09:06, Emiel van Miltenburg notifications@github.com wrote:

Ok, by increase unk just a little more a couple of times, I think I managed to localize the bottlenecks. The first bottleneck is computing the perplexity. But if you increase unk just enough, it's no longer an issue. The bigger problem comes with generating descriptions and computing the BLEU score. This requires more memory, and thus I needed to increase unk even more.

This was a bit surprising to me, as predicting the next word and building a sentence is exactly what the model had been doing for the entire epoch. (Or isn't this what training is all about?) So I suspect there might be a more memory-efficient way to generate sentences. (Perhaps in a trade-off with time-efficiency.)

One thing that worries me is the BLEU score. For the first two epochs (now in the third), the BLEU score has been 0.00. I'd hoped for the score to be at least a little bit better than that. Could this also be because guessing the correct word for a huge vocabulary is extremely difficult? If so, could beam search help here? Or a different encoding of the vocabulary? (E.g. word2vec or something similar.)

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/elliottd/GroundedTranslation/issues/17#issuecomment-197204002

evanmiltenburg commented 8 years ago

I suspect it happens more when you're generating because you've already filled up most of your ram with perplexity calculations and python isn't garbage collecting hyperefficiently, so you start swapping.

Ah, so adding a couple of del statements (i.e. actively stepping in to manage the memory) might already help a lot in terms of efficiency.

Does Keras now support passing generators for the validation data to keras.fit()?

Yes, or at least there's a new function called fit_generator. It has arguments for both a training data generator as well as a validation data generator. Link to models.py, which has very informative docstrings.

For your current experiments: can you cut your validation data down to a very small number (50-100 sentences) and train on --unk 3 or so? I.e. is the bottleneck for the vocabulary just due to the val array sizes?

Good idea. Will try that later today.

evanmiltenburg commented 8 years ago

The idea seems to work. I started the training this evening with --small_val and (i) it didn't get a memory error after the first two epochs, and (ii) it returns nonzero BLEU scores:

INFO:Callbacks:Best BLEU: 1 | val pplx 53.43271 bleu 10.13
INFO:Callbacks:Best BLEU: 2 | val pplx 32.22855 bleu 10.27

Now I'm definitely convinced that a generator-based solution is the way to go. Then the one-hot encoding strategy should still work, even with a huge vocabulary. And the size of val will no longer be the limiting factor.

evanmiltenburg commented 8 years ago

So here is a short summary of the run last night:

INFO:Callbacks:Checkpoint 1 | val pplx: 53.43271 bleu 10.13
INFO:Callbacks:Checkpoint 2 | val pplx: 32.22855 bleu 10.27
INFO:Callbacks:Checkpoint 3 | val pplx: 27.18124 bleu 8.66
INFO:Callbacks:Checkpoint 4 | val pplx: 24.51813 bleu 9.16
INFO:Callbacks:Checkpoint 5 | val pplx: 22.97953 bleu 8.70
INFO:Callbacks:Checkpoint 6 | val pplx: 21.89622 bleu 7.13
INFO:Callbacks:Checkpoint 7 | val pplx: 21.17004 bleu 8.80

It's still going, but because BLEU hasn't improved since the second checkpoint it will not go on for much longer. Anyway, this is more evidence that the model will keep training if we don't process everything all at once.

elliottd commented 8 years ago

You've raised some great issues about how the data_generator is currently built. We do need to fix this if we're to ever work with substantially larger datasets.

I've identified a few key places we can reduce the memory footprint without changing the input data types. I'll fix this up sometime in the next week. But a validation / test data generator would be a great addition to the project!

elliottd commented 8 years ago

Memory issues have now been addressed my merge commit 3b735585d123151f6b7c4ce8e2605094600d0141

elliottd / GroundedTranslation

MemoryError and rewriting data_generator #17