some difference between paper and code

shanry commented 6 years ago

paper say use early stop but it seems the code just train 1000 epochs ?
paper say force L2 norm for DistMult and CompEx but I didn't see where is it ?
I don't think we should see the test result during training, which is cheating . so why do test every 3 epoch in the main.py
paper say DistMult and CompEx use margin-based loss and ConvE use cross entropy loss but what I see is that they all use the torch.nn.BCELoss for model.loss ?

maybe the questions are a bit more but I am working on a new model on the task , hard , so hope you could give some answer which is a great help for me and I will appreciate it a lot ,thank you ! (please forgive me for my poor English..)

TimDettmers commented 6 years ago

Thank you for the interest in our work. Here the answers to your questions.

The early procedure that I used is not hard-coded: I look in the log files for the highest mean reciprocal rank on the dev set and I report the test error that goes along with that.
We used the inferbeddings framework to produce standard 1-1 scoring scores for DistMult and ComplEx. You can find the python script parameters used for WN18, WN18RR and FB15k-237 in issue #22 .
I would agree that this can be seen as cheating. However, I follow the procedure described in (1) and thus I report early stopping. Of course, the additional information that I get from the test set error induces bias, but during development of ConvE, I just worked with the validation set (in the later stages I included the test error). Also, note that the correlation between validation set error and test set error is almost 1.0 and thus seeing test error scores provides almost no advantage over seeing the validation error alone.
Please see (2)

I hope this helps. Please let me know if you have more questions, or if I can help you further.

shanry commented 6 years ago

First ,Thank you very much for your patient and detailed reply . (usually I got an email when my questions are answered but don't know why this time I didn't get the email so It is surprise for me to see your comment today )

Second, I have been using your code trying my new model recently for link prediction and now I got some decent result and here are my new questions :

the Quirks of README says "The model currently ignores data that does not fit into the specified batch size, for example if your batch size is 100 and your test data is 220, then 20 samples will be ignored. This is designed in that way to improve performance on small datasets" . so is it also true for training set (which means the model did not use all the data in the train set unless the train set is shuffled after every epoch but i did not see the shuffle operaion) . apparently it is more reasonable to make full use of all the triples in train set. and I could not understand the sentence "This is designed in that way to improve performance on small datasets"
1. I wonder if I could use some of your code as my formal experiment code for a paper ?

Again, thanks a lot and looking forward to your answer .

TimDettmers commented 6 years ago

This is true. Some training data will be wasted. Usually, the knowledge graphs are large enough that the difference will not matter much. If you have a batch size of 128 and FB15k-237 you have 149689 1-N samples in the training set which means you will waste on average about 64 samples for similar batch sizes or 0.00043 of the dataset. In this specific case of a batch size of 128, 85 samples are wasted which is 0.00057 of the train set. On smaller datasets, like UMLS you have 1560 1-N samples and on average you have a waste of around 0.041 which is high. You can use for example a batch size of 156 to reduce the waste to 0. I do not think it matters much for large datasets as the fraction is too small, but for small datasets, you definitely want to optimize around this quirk.
Of course you can use part of the code — I would be happy if you do so. I would even encourage it for the ranking code because the ranking procedure is not straight-forward and very error-prone. Many publications about link prediction in knowledge graphs cannot be replicated and the crux of the issue might be that they wrote their own code and got the evaluation function wrong. Here the ranking implementation has been tested not only by my co-authors, but also externally by other researchers and I think the implementation is correct.

Let me know if you have more questions.

shanry commented 6 years ago

OK. now some of my doubts are cleared. I am still working on my model but it seems it is hard to tune it to surpass your convE with respect to MRR and hit@1 a lot ( but I can always get a much better MR ). I expect to focus on it after August. and hope then we could have some more communications with each other. Thank you.

TimDettmers commented 6 years ago

Sure, you can always send me an email. If you think it could also be beneficial for others, you can always create a new issue here. Thank you.

TimDettmers / ConvE

some difference between paper and code #25