Closed wolfshow closed 6 years ago
We will have #384 merged soon, I think it would to some degree solve this problem, would you mind testing it and report later?
And this is a memory error during the torch.save()
. I think it should have been reported it to pytorch.
Sure. i will test it later.
@JianyuZhan now that this is merged, do you think this problem will be fixed or just pushed to training time? Should we turn on sharding automatically if the source is too big?
@wolfshow , please pull the latest code and make a test again. Glad to hear your report.
@srush, for this specific problem, I think it could be fixed now, because we now doing sharded torch.save
. And yep, the train side might still have memory issue, because we currently load all
datasets to pass to Iterator. Next step I would try to tackle this issue.
As for automatic sharding, I currently print a warning to user if the corpus is too big(>10MB), and don't make the decision for users. I don't yet have experience on what an optimal max_shard_size
should be. I prefer to let users test this feature first, and see if any concrete number reported, and then we can do automatic sharding.
So how many shards should @wolfshow try? Can you give a sample command?
@wolfshow, probably you could start with -max_shard_size == 1M Bytes
, and increase to 2MB, 4MB, etc, if you want to test different settings. Appreciate if you can report the result.
Thanks @JianyuZhan ! I will try and get back to you soon
@JianyuZhan I tried to re-run the pre-processing with -max_shard_size == 1M, but the process is automatically killed. The dataset contains 45 million pairs.
@wolfshow , please post the traceback.
@JianyuZhan I used it for a summarization task, so i also added "-dynamic_dict -share_vocab". But the process is automatically killed by the kernel as follows:
Preparing for training ... Building & saving training data... Killed
No other information.
Then you can try a small shard size to see if it works. If not, please report and I will investigate to help you get more detailed info.
wolfshow notifications@github.com于2018年1月6日 周六14:37写道:
@JianyuZhan https://github.com/jianyuzhan I used it for a summarization task, so i also added "-dynamic_dict -share_vocab". But the process is automatically killed by the kernel as follows:
Preparing for training ... Building & saving training data... Killed
No other information.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/OpenNMT/OpenNMT-py/issues/460#issuecomment-355727575, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnnUuPrl-edokHHHOYAZyjtdaOJRh3gks5tHxSygaJpZM4RJsUx .
--
Regards, Jianyu Zhan
I would like to pre-process a 20 million pairs dataset with vocab=100k. Seems always reporting memory error like this: