Closed abaheti95 closed 6 years ago
I didn't notice the logging twice thing.
But, it is indeed loading the data again. I was running the same code again with a smaller batch size(128) and I was at 48% MEM when it logged "Start training..."
. After that, it processed for another 45mins (I am guessing it was taking about 65mins to load the data) and now it is killed by the server.
[1873031.362219] Out of memory: Kill process 19577 (python2) score 502 or sacrifice child
[1873031.362231] Killed process 19577 (python2) total-vm:449309268kB, anon-rss:398061500kB, file-rss:89088kB
Also, is OpenNMT loading the entire training and validation data into the memory? If yes how can I control? Is there any option to tell it to read it in batches? If No, why is it taking up so much memory and time?
no just logging twice when using 1 GPU there is a PR to fix it but not merged yet.
Why is there a 1 hour long wait between the 2 loggings then? (referring to the snapshot in the initial question) Start training... [10:24:49] Loading dataset.. [11:28:39] Step 50... [11:30:06]
for 22M sentences you need to shard. look for similar questions in the issues. try no more than 1M segments per shard. (it's the way torchtext works)
one thing I have observed is you don't incur much of penalty for havign many shards, so erring on the side of smaller shards seems to work fine in practice (I currently use 10mb shards, the recommended from elsewhere may be 131mb shards)
as long as you shuffle before hand it should be fine, but working with 131mb shrad gives you about 1M sgements per shard which works fine too.
make OpenNMT project audio list and sentences text list, as following: https://github.com/eeric/OpenNMT-py-make-tgt-train.txt/blob/master/openNMT_make_label.py
Hi all,
I have successfully used an older version of OpenNMT-py (with pytroch 0.3). I recently saw that all the dependencies have been upgraded and I now I'm facing out of memory issues with the same previous configuration. However, that's not my main concern.
The train.py first loads the data and takes a LOT of time doing that. Once the mode in built it again loads the data for another hour before starting the training? Why is this done?