Open Gldkslfmsd opened 6 years ago
hi @Gldkslfmsd it guess the serializer is storing some state which grows as the training continues, my fork is too different to even think about merging now, but you might be able to get an idea how to fix serialization here: https://github.com/chrishokamp/OpenNMT-py/blob/multi-decoder-generator/onmt/models/model_saver.py#L115-L121
BTW why are you using python 2.7? :fearful:
Hi, Chris, thanks for reply.
hi @Gldkslfmsd it guess the serializer is storing some state which grows as the training continues,
I don't understand what happens there. All the checkpoint files (except the very first one) have the same size in bytes. Is there a memory leak inside of torch loader?
my fork is too different to even think about merging now,
I saw you merged the recent OpenNMT-py orig repo master into your branch. Is your branch an extension of Helsinki-NLP/neural-interlingua? Let's merge it. I wanted to merge master on my own, but then I realized it's too much work...
but you might be able to get an idea how to fix serialization here: https://github.com/chrishokamp/OpenNMT-py/blob/multi-decoder-generator/onmt/models/model_saver.py#L115-L121
Thanks.
BTW why are you using python 2.7? fearful
yes. Does OpenNMT-py support python3 nowadays? I didn't notice...
hi, my fork is an extension in a way, but unfortunately for it to work it depends on some stuff that I can't make public right now, so not a good idea to merge. Yes I've been trying to keep up with the upstream changes in master in my fork. Where are you stuck in merging with master? there will be lots of conflicts but most of them shouldn't be too hard to resolve.
There were around 5 conficts. It seemed easy but I wasn't patient to go through. I would need to either test the code manually, or write some unittests on my own from scratch. Both pretty demanding.
I tried to run the existing automatical tests in plain recent master and they didn't work. It failed on some audio piece of code, on something like "import torchaudio". Is there any documentation about the tests? I didn't notice.
I don't know about the tests in OpenNMT-py, but since what you care about is the multi-task setting you can do a manual integration test by asserting that you get the same BLEU scores before and after merging.
I don't understand what happens there. All the checkpoint files (except the
very first one) have the same size in bytes. Is there a memory leak inside of torch loader?
If the checkpoint files all have the same size then my theory is wrong
On Wed, Nov 14, 2018 at 4:37 PM Dominik Macháček notifications@github.com wrote:
There were around 5 conficts. It seemed easy but I wasn't patient to go through. I would need to either test the code manually, or write some unittests on my own from scratch. Both pretty demanding.
I tried to run the existing automatical tests in plain recent master and they didn't work. It failed on some audio piece of code, on something like "import torchaudio". Is there any documentation about the tests? I didn't notice.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/OpenNMT-py/issues/11#issuecomment-438728998, or mute the thread https://github.com/notifications/unsubscribe-auth/ABicPyqDPy1eVMnSAVjjRzOBKMSRUwysks5uvEbUgaJpZM4YWzMF .
Hello,
I'm experimenting with my PR branch of neural-interlingua. After 8 hours of non-problematic training this happened:
The dataset is multi30k, so no big file is loaded into memory. I observed this error twice, once after 15 hours of training with 12 src-tgt training pairs, and once after 8 hours with 10 pairs. I don't have any comparable error-free run.
Most probably no other process used the same machine at the same time, but I can't be sure.
Any suggestions or ideas, what happens and how to fix it? My only idea is to merge the newest master from OpenNMT-py and hope that it's already fixed there.