model save error for big dataset && cuda OOM problem

wenjf commented 5 years ago

Thanks for sharing the source code of JODIE ! It's really an impressive method which has great potential ability for recommendation !

While, for now I'm trying to apply this source code on some online shopping dataset, which contains 10k user and 20k items. I met two hard problem.

The source code crashed when the first epoch was finished when calling save_model function in file jodie.py. With a error msg:

Traceback (most recent call last): File "jodie.py", line 219, in save_model(model, optimizer, args, ep, user_embeddings_dystat, item_embeddings_dystat, train_end_idx, user_embeddings_timeseries, item_embeddings_timeseries) File "/home/jianfeng/dl/jodie_dip/library_models.py", line 163, in save_model torch.save(state, filename) File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/serialization.py", line 260, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/serialization.py", line 185, in _with_file_like return body(f) File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/serialization.py", line 260, in return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/serialization.py", line 332, in _save pickler.dump(obj) OverflowError: cannot serialize a string larger than 2 GiB

For another similar size dataset, the code crashed with the following error msg:

Initializing the JODIE model Initializing user and item embeddings Initializing user and item RNNs Initializing linear layers JODIE initialization complete

Training the JODIE model for 1 epochs Epoch 0 of 1: 0%| | 0/1 [00:00<?, ?it/s] Traceback (most recent call last):%|█▊ | 9382/50209 [00:49<01:25, 478.87it/s] File "jodie.py", line 193, in ████| 4/4 [00:00<00:00, 18.88it/s] loss.backward() File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/jianfeng/.conda/envs/jodie/lib/python2.7/site-packages/torch/autograd/init.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 1.81 GiB (GPU 0; 15.90 GiB total capacity; 12.60 GiB already allocated; 819.88 MiB free; 1.80 GiB cached)

It seems that the CUDA was OOM, I think this might caused by a too big size of the t_batch, but in the source code, we can only set the _tbatchtimespan variable. How can I fix this for apply this model on this dataset ?

Thanks again for your attention about this issue.

wenjf commented 5 years ago

solved by myself.

florianscheidl commented 3 years ago

Hi Wenjf,

I have experienced a similar error. Could you describe how you solved it?

Thank you!

claws-lab / jodie

model save error for big dataset && cuda OOM problem #6