OpenNMT loading data twice before Training?

OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch

https://opennmt.net/

MIT License

6.76k stars 2.25k forks source link

OpenNMT loading data twice before Training? #844

Closed abaheti95 closed 6 years ago

abaheti95 commented 6 years ago

Hi all,

I have successfully used an older version of OpenNMT-py (with pytroch 0.3). I recently saw that all the dependencies have been upgraded and I now I'm facing out of memory issues with the same previous configuration. However, that's not my main concern.

The train.py first loads the data and takes a LOT of time doing that. Once the mode in built it again loads the data for another hour before starting the training? Why is this done?

rbommasanimozilla commented 6 years ago

I too recently have started having this bug, but
I am under the impression that this is just logging twice and not running everything twice (notice how every duplicate has the same timestampe and is exactly the same in terms of performance). This may be related to 3. A previously seen issue (I assume you are using a single GPU and that is why this happening) - See #829, #821

abaheti95 commented 6 years ago

I didn't notice the logging twice thing.

But, it is indeed loading the data again. I was running the same code again with a smaller batch size(128) and I was at 48% MEM when it logged "Start training...". After that, it processed for another 45mins (I am guessing it was taking about 65mins to load the data) and now it is killed by the server.

[1873031.362219] Out of memory: Kill process 19577 (python2) score 502 or sacrifice child
[1873031.362231] Killed process 19577 (python2) total-vm:449309268kB, anon-rss:398061500kB, file-rss:89088kB

abaheti95 commented 6 years ago

Also, is OpenNMT loading the entire training and validation data into the memory? If yes how can I control? Is there any option to tell it to read it in batches? If No, why is it taking up so much memory and time?

vince62s commented 6 years ago

no just logging twice when using 1 GPU there is a PR to fix it but not merged yet.

abaheti95 commented 6 years ago

Why is there a 1 hour long wait between the 2 loggings then? (referring to the snapshot in the initial question) Start training... [10:24:49] Loading dataset.. [11:28:39] Step 50... [11:30:06]

vince62s commented 6 years ago

for 22M sentences you need to shard. look for similar questions in the issues. try no more than 1M segments per shard. (it's the way torchtext works)

rbommasanimozilla commented 6 years ago

one thing I have observed is you don't incur much of penalty for havign many shards, so erring on the side of smaller shards seems to work fine in practice (I currently use 10mb shards, the recommended from elsewhere may be 131mb shards)

vince62s commented 6 years ago

as long as you shuffle before hand it should be fine, but working with 131mb shrad gives you about 1M sgements per shard which works fine too.

eeric commented 5 years ago

make OpenNMT project audio list and sentences text list, as following: https://github.com/eeric/OpenNMT-py-make-tgt-train.txt/blob/master/openNMT_make_label.py