Closed GittiHab closed 5 years ago
Hi @TimDettmers , I have the same issue with ConvE and ComplEx model in combination with the nations dataset. When I run
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE input_drop 0.2 hidden_drop 0.3 feat_drop 0.2 lr 0.003 lr_decay 0.995 dataset nations process True
it gets stuck with the output
saving to saved_models/nations_ConvE_0.2_0.3.model 2019-06-05 21:04:02.992287 (INFO): 2019-06-05 21:04:02.992432 (INFO): -------------------------------------------------- 2019-06-05 21:04:02.992562 (INFO): dev_evaluation 2019-06-05 21:04:02.992637 (INFO): -------------------------------------------------- 2019-06-05 21:04:02.992738 (INFO):
Could you provide any help with that? I am trying to reproduce your results from you paper. Thank you
I think these are two separate problems. @BelindaBecker In your case, the problem is the batch size combined with the number of background loader threads. The nations dataset is too small to provide all the background data loaders with data for the given batch size. This command with the batch szie of 32 should work fine:
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE input_drop 0.2 hidden_drop 0.3 feat_drop 0.2 lr 0.003 lr_decay 0.995 dataset nations process True batch_size 32
@GittiHab I think I encountered your bug once before. I do not remember though what the exact problem was. I think it might be related to the samples_per_file
variable in main.py
(increasing the samples per file should help) and if so it might be related to RAM problems. Otherwise, a batch size bug like @BelindaBecker's bug might cause the same behavior. You might want to decrease the batch_size of the validation StreamBatcher.
Another workaround is to load the saved model. You do this calculating the learning rate at that epoch given the learning rate decay that you used and then resume that model with this learning rate.
Please let me know if this helps. Otherwise, I need a bit more data to debug this issue.
Thank you much for the fast reply, it solved my problem!
I run it in an nvidia-docker container with Anaconda Python Version 3.6.4 and all the dependencies installed listed in the readme. The script does the preprocessing and training of the first few epochs correctly, then the output is
and it won't continue. I ran following command: