Paper: Cannot fully reproduce results in Table 2

qibinc commented 5 years ago

Hi Raymond, @RaymondLi0

This work is super interesting. Congrads to you that it got into NeurIPS 2018!

My name is Qibin, from Tsinghua University, China. I'm currently trying to follow up this work and have started to playing with your baselines. However, I cannot reproduce some part of the results in Table 2: RMSE for movie recommendations.

Here are the details:

For Experiments on REDIAL (validation RMSE), No pre-training on MovieLens, the best validation loss I got is 0.0756 (RMSE = sqrt(0.0756) = 0.275) while the result in the paper is 0.127. There is a large gap in there and I have no idea what is wrong. In particular, could you check the hyperparameters in the repo?

Here are the changes I made to the code:

In this line, I modified data="db_pretrain" to data="db". To the best of my knowledge, this is sufficient to produce the No pre-training result on REDIAL.

Looking forward to your reply. Thanks!

In the end, I provide my training output.

CUDA_VISIBLE_DEVICES=2 python train_autorec.py
Saving in saved-no-pretrain-standard/autorec with parameters : {'f': 'sigmoid', 'g': 'sigmoid', 'layer_sizes': [1000]}, {'learning_rate': 0.001, 'batch_size': 64, 'patience': 5, 'batch_input': 'full', 'max_num_inputs': 10000000000.0, 'nb_epochs': 50}
loaded 59944 movies from redial/movies_merged.csv
6924 movies
Loading and processing data
('Mean training rating ', 0.9433204182821627)
validation MSE made by mean estimator: 0.0556195151183
Loading vocabulary from redial/vocabulary.p
Vocabulary size : 15005 words.
/home/qibin/anaconda3/envs/alchemy2/lib/python2.7/site-packages/torch/nn/f
unctional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.   warnings.warn(warning.format(ret))
0%|          | 0/122 [00:00<?, ?it/s]
/home/qibin/TalkAndRecommend/models/autorec.py:152: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  losses.append(loss.data[0])
100%|██████████| 122/122 [00:08<00:00, 13.68it/s]
valid loss with input=full : 0.259392401395
--------------------------------------------------------------
  0%|          | 0/490 [00:00<?, ?it/s]train_autorec.py:77: UserWarning: i
nvalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use 
tensor.item() to convert a 0-dim tensor to a Python number
  loss = loss.data[0]
100%|██████████| 490/490 [00:02<00:00, 194.55it/s]
Epoch : 0 Training Loss : 0.0930548577511
100%|██████████| 122/122 [00:08<00:00, 15.82it/s]
valid loss with input=full : 0.0773077955255
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 198.48it/s]
Epoch : 1 Training Loss : 0.0548694368867
100%|██████████| 122/122 [00:09<00:00, 13.90it/s]
valid loss with input=full : 0.0768081429436
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 202.61it/s]
Epoch : 2 Training Loss : 0.0543349974933
100%|██████████| 122/122 [00:08<00:00, 12.92it/s]
valid loss with input=full : 0.0755859776912
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 198.54it/s]
Epoch : 3 Training Loss : 0.0541957180774
100%|██████████| 122/122 [00:08<00:00, 15.43it/s]
valid loss with input=full : 0.0783623081245
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 202.62it/s]
Epoch : 4 Training Loss : 0.0542582329866
100%|██████████| 122/122 [00:08<00:00, 15.37it/s]
valid loss with input=full : 0.0804945125459
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 198.14it/s]
Epoch : 5 Training Loss : 0.0547528188295
100%|██████████| 122/122 [00:08<00:00, 14.99it/s]
valid loss with input=full : 0.0810510425758
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 202.63it/s]
Epoch : 6 Training Loss : 0.0530067476191
100%|██████████| 122/122 [00:08<00:00, 14.58it/s]
valid loss with input=full : 0.0809844291611
--------------------------------------------------------------
100%|██████████| 490/490 [00:02<00:00, 202.63it/s]
Epoch : 7 Training Loss : 0.0506737201622
100%|██████████| 122/122 [00:08<00:00, 15.34it/s]
valid loss with input=full : 0.0851037339878
--------------------------------------------------------------
Early stopping, 5 epochs without best
Training done.

RaymondLi0 commented 5 years ago

Hi Qibin,

Thank you for your interest in this work!

Sorry for the confusion, indeed an error has slipped in this table and we updated our paper on Arxiv a few weeks ago https://arxiv.org/abs/1812.07617
- In this line, I modified data="db_pretrain" to data="db". To the best of my knowledge, this is sufficient to produce the No pre-training result on REDIAL.

That is correct

As for the hyper-parameters, we used the following in the experiments on ReDial without pre-training:
```
autorec_params = {
'layer_sizes': [100],
'f': "sigmoid",
'g': "sigmoid",
}
```
These should produce the results in table 2 of the updated paper on Arxiv. The hyper-parameters in test_params.py correspond to those we used to train the complete conversational recommender system. In the future I'll try to add which hyper-parameters we used to produce the results on the different tables.
Finally, I would just like to mention that this table only aims to justify our approach that consists in pre-training the recommender system on another dataset, and I am sure there are more intelligent ways to evaluate a recommender system

Hope this helped!

qibinc commented 5 years ago

Hi Raymond,

Thanks for your help!

Thanks for sharing the new link. I should have check the latest version first.. BTW, I suggest that you also change the paper link in this repo's README to the arxiv link.

Finally, I would just like to mention that this table only aims to justify our approach that consists in pre-training the recommender system on another dataset, and I am sure there are more intelligent ways to evaluate a recommender system

Thanks for mention it. I understand that you used RMSE to demonstrate straightforwardly the effectiveness of pretraining and also to provide some insights on the data. I'll definitely try to adopt a more common evaluation.

You can close the issue.

Best wishes!

RaymondLi0 commented 5 years ago

Thanks for catching that ;) I updated the link

RaymondLi0 / conversational-recommendations