OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.77k stars 2.25k forks source link

how many seq2seq pairs does the toolkit support? #460

Closed wolfshow closed 6 years ago

wolfshow commented 6 years ago

I would like to pre-process a 20 million pairs dataset with vocab=100k. Seems always reporting memory error like this:

Preparing for training ...
Building training data...
Building vocabulary...
Building validation data...
Saving train/valid/vocab...
Traceback (most recent call last):
  File "preprocess.py", line 108, in <module>
    main()
  File "preprocess.py", line 103, in main
    torch.save(train, open(opt.save_data + '.train.pt', 'wb'))
  File "/home/wolfshow/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 135, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/wolfshow/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 117, in _with_file_like
    return body(f)
  File "/home/wolfshow/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 135, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/wolfshow/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 198, in _save
    pickler.dump(obj)
MemoryError
JianyuZhan commented 6 years ago

We will have #384 merged soon, I think it would to some degree solve this problem, would you mind testing it and report later?

And this is a memory error during the torch.save(). I think it should have been reported it to pytorch.

wolfshow commented 6 years ago

Sure. i will test it later.

srush commented 6 years ago

@JianyuZhan now that this is merged, do you think this problem will be fixed or just pushed to training time? Should we turn on sharding automatically if the source is too big?

JianyuZhan commented 6 years ago

@wolfshow , please pull the latest code and make a test again. Glad to hear your report.

@srush, for this specific problem, I think it could be fixed now, because we now doing sharded torch.save. And yep, the train side might still have memory issue, because we currently load all datasets to pass to Iterator. Next step I would try to tackle this issue.

As for automatic sharding, I currently print a warning to user if the corpus is too big(>10MB), and don't make the decision for users. I don't yet have experience on what an optimal max_shard_size should be. I prefer to let users test this feature first, and see if any concrete number reported, and then we can do automatic sharding.

srush commented 6 years ago

So how many shards should @wolfshow try? Can you give a sample command?

JianyuZhan commented 6 years ago

@wolfshow, probably you could start with -max_shard_size == 1M Bytes, and increase to 2MB, 4MB, etc, if you want to test different settings. Appreciate if you can report the result.

wolfshow commented 6 years ago

Thanks @JianyuZhan ! I will try and get back to you soon

wolfshow commented 6 years ago

@JianyuZhan I tried to re-run the pre-processing with -max_shard_size == 1M, but the process is automatically killed. The dataset contains 45 million pairs.

JianyuZhan commented 6 years ago

@wolfshow , please post the traceback.

wolfshow commented 6 years ago

@JianyuZhan I used it for a summarization task, so i also added "-dynamic_dict -share_vocab". But the process is automatically killed by the kernel as follows:

Preparing for training ... Building & saving training data... Killed

No other information.

JianyuZhan commented 6 years ago

Then you can try a small shard size to see if it works. If not, please report and I will investigate to help you get more detailed info.

wolfshow notifications@github.com于2018年1月6日 周六14:37写道:

@JianyuZhan https://github.com/jianyuzhan I used it for a summarization task, so i also added "-dynamic_dict -share_vocab". But the process is automatically killed by the kernel as follows:

Preparing for training ... Building & saving training data... Killed

No other information.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/OpenNMT/OpenNMT-py/issues/460#issuecomment-355727575, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnnUuPrl-edokHHHOYAZyjtdaOJRh3gks5tHxSygaJpZM4RJsUx .

--

Regards, Jianyu Zhan