Neural interlingua: Torch not enough memory after 8 hours

Gldkslfmsd commented 6 years ago

Hello,

I'm experimenting with my PR branch of neural-interlingua. After 8 hours of non-problematic training this happened:

[2018-11-07 21:47:35,402 INFO] Current language pair: ('de', 'cs')
[2018-11-07 21:47:35,433 INFO] Loading valid dataset from all_pairs_preprocessed/de-cs/data.valid.1.pt, number of examples: 1014
[2018-11-07 21:47:36,350 INFO] Validation perplexity: 6.25632
[2018-11-07 21:47:36,351 INFO] Validation accuracy: 62.3635
[2018-11-07 21:48:45,551 INFO] >> BLEU = 23.28, 55.7/29.4/17.9/11.0 (BP=0.976, ratio=0.977, hyp_len=10099, ref_len=10342)
[2018-11-07 21:48:45,555 INFO] Current language pair: ('fr', 'cs')
[2018-11-07 21:48:45,581 INFO] Loading valid dataset from all_pairs_preprocessed/fr-cs/data.valid.1.pt, number of examples: 1014

[2018-11-07 21:48:46,478 INFO] Validation perplexity: 5.42896
[2018-11-07 21:48:46,478 INFO] Validation accuracy: 64.3625
Traceback (most recent call last):
  File "../train.py", line 40, in <module>
    main(opt)
  File "../train.py", line 27, in main
    single_main(opt)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/train_single.py", line 239, in main
    opt.valid_steps)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/trainer.py", line 271, in train
    parser = argparse.ArgumentParser(prog = 'translate.py',
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/translate/translator.py", line 43, in build_translator
    model = onmt.model_builder.load_test_multitask_model(opt)
  File "/lnet/spec/work/people/machacek/neural-interlingua/OpenNMT-py-dm/onmt/model_builder.py", line 133, in load_test_multitask_model
    map_location=lambda storage, loc: storage)
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/lnet/spec/work/people/machacek/neural-interlingua/p2-onmt/local/lib/python2.7/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.cpp:204

The dataset is multi30k, so no big file is loaded into memory. I observed this error twice, once after 15 hours of training with 12 src-tgt training pairs, and once after 8 hours with 10 pairs. I don't have any comparable error-free run.

Most probably no other process used the same machine at the same time, but I can't be sure.

Any suggestions or ideas, what happens and how to fix it? My only idea is to merge the newest master from OpenNMT-py and hope that it's already fixed there.

chrishokamp commented 5 years ago

hi @Gldkslfmsd it guess the serializer is storing some state which grows as the training continues, my fork is too different to even think about merging now, but you might be able to get an idea how to fix serialization here: https://github.com/chrishokamp/OpenNMT-py/blob/multi-decoder-generator/onmt/models/model_saver.py#L115-L121

BTW why are you using python 2.7? :fearful:

Gldkslfmsd commented 5 years ago

Hi, Chris, thanks for reply.

hi @Gldkslfmsd it guess the serializer is storing some state which grows as the training continues,

I don't understand what happens there. All the checkpoint files (except the very first one) have the same size in bytes. Is there a memory leak inside of torch loader?

my fork is too different to even think about merging now,

I saw you merged the recent OpenNMT-py orig repo master into your branch. Is your branch an extension of Helsinki-NLP/neural-interlingua? Let's merge it. I wanted to merge master on my own, but then I realized it's too much work...

but you might be able to get an idea how to fix serialization here: https://github.com/chrishokamp/OpenNMT-py/blob/multi-decoder-generator/onmt/models/model_saver.py#L115-L121

Thanks.

BTW why are you using python 2.7? fearful

yes. Does OpenNMT-py support python3 nowadays? I didn't notice...

chrishokamp commented 5 years ago

hi, my fork is an extension in a way, but unfortunately for it to work it depends on some stuff that I can't make public right now, so not a good idea to merge. Yes I've been trying to keep up with the upstream changes in master in my fork. Where are you stuck in merging with master? there will be lots of conflicts but most of them shouldn't be too hard to resolve.

Gldkslfmsd commented 5 years ago

There were around 5 conficts. It seemed easy but I wasn't patient to go through. I would need to either test the code manually, or write some unittests on my own from scratch. Both pretty demanding.

I tried to run the existing automatical tests in plain recent master and they didn't work. It failed on some audio piece of code, on something like "import torchaudio". Is there any documentation about the tests? I didn't notice.

chrishokamp commented 5 years ago

I don't know about the tests in OpenNMT-py, but since what you care about is the multi-task setting you can do a manual integration test by asserting that you get the same BLEU scores before and after merging.

I don't understand what happens there. All the checkpoint files (except the

very first one) have the same size in bytes. Is there a memory leak inside of torch loader?

If the checkpoint files all have the same size then my theory is wrong

On Wed, Nov 14, 2018 at 4:37 PM Dominik Macháček notifications@github.com wrote:

There were around 5 conficts. It seemed easy but I wasn't patient to go through. I would need to either test the code manually, or write some unittests on my own from scratch. Both pretty demanding.

I tried to run the existing automatical tests in plain recent master and they didn't work. It failed on some audio piece of code, on something like "import torchaudio". Is there any documentation about the tests? I didn't notice.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpenNMT-py/issues/11#issuecomment-438728998, or mute the thread https://github.com/notifications/unsubscribe-auth/ABicPyqDPy1eVMnSAVjjRzOBKMSRUwysks5uvEbUgaJpZM4YWzMF .

Helsinki-NLP / FoTraNMT

Neural interlingua: Torch not enough memory after 8 hours #11