Model not running through

GittiHab commented 6 years ago

I run it in an nvidia-docker container with Anaconda Python Version 3.6.4 and all the dependencies installed listed in the readme. The script does the preprocessing and training of the first few epochs correctly, then the output is

########################################
           COMPLETED EPOCH: 435
 train Loss: 0.0012251        99% CI: (0.00098407, 0.0014662), n=68
 ########################################

saving to saved_models/FB15k-237_ConvE_0.2_0.3.model

 --------------------------------------------------
 dev_evaluation
 --------------------------------------------------

and it won't continue. I ran following command:

CUDA_VISIBLE_DEVICES=0 python main.py \
 model ConvE \
 dataset FB15k-237 \
 input_drop 0.2 \
 hidden_drop 0.3 \
 feat_drop 0.2 \
 lr 0.003 \
 lr_decay 0.995 \
 process True

BelindaBecker commented 5 years ago

Hi @TimDettmers , I have the same issue with ConvE and ComplEx model in combination with the nations dataset. When I run
CUDA_VISIBLE_DEVICES=0 python main.py model ConvE input_drop 0.2 hidden_drop 0.3 feat_drop 0.2 lr 0.003 lr_decay 0.995 dataset nations process True it gets stuck with the output

saving to saved_models/nations_ConvE_0.2_0.3.model 2019-06-05 21:04:02.992287 (INFO): 2019-06-05 21:04:02.992432 (INFO): -------------------------------------------------- 2019-06-05 21:04:02.992562 (INFO): dev_evaluation 2019-06-05 21:04:02.992637 (INFO): -------------------------------------------------- 2019-06-05 21:04:02.992738 (INFO):

Could you provide any help with that? I am trying to reproduce your results from you paper. Thank you

TimDettmers commented 5 years ago

I think these are two separate problems. @BelindaBecker In your case, the problem is the batch size combined with the number of background loader threads. The nations dataset is too small to provide all the background data loaders with data for the given batch size. This command with the batch szie of 32 should work fine:

CUDA_VISIBLE_DEVICES=0 python main.py model ConvE input_drop 0.2 hidden_drop 0.3 feat_drop 0.2 lr 0.003 lr_decay 0.995 dataset nations process True batch_size 32

@GittiHab I think I encountered your bug once before. I do not remember though what the exact problem was. I think it might be related to the samples_per_file variable in main.py (increasing the samples per file should help) and if so it might be related to RAM problems. Otherwise, a batch size bug like @BelindaBecker's bug might cause the same behavior. You might want to decrease the batch_size of the validation StreamBatcher. Another workaround is to load the saved model. You do this calculating the learning rate at that epoch given the learning rate decay that you used and then resume that model with this learning rate. Please let me know if this helps. Otherwise, I need a bit more data to debug this issue.

BelindaBecker commented 5 years ago

Thank you much for the fast reply, it solved my problem!

TimDettmers / ConvE

Model not running through #29