Closed zhangzy6 closed 4 years ago
It is a simple intuition that triples having a same scores should have a same rank.
Hi, I believe this is wrong, @daiquocnguyen. Following your current evaluation, a trivial model giving a score of 0 to all triples would have a perfect MRR of 1.
Two simple fixes: You can either be optimistic and use the average rank of all ties (scipy.stats.rankdata with method=average) or pessimistic and use the worst rank of all ties (method=max) as is done for example here : https://github.com/facebookresearch/kbc
Currently your results for ConvKB and CapsE are not comparable to the remainder of the literature and do not make sense (since a trivial model would be optimal). Re-running your evaluation with method='max' gives an MRR of 0.21 on FB15k-237.
At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples.
Last year, I found out that someone mentioned an issue that some different triples have also a same score on FB15k-237 in Openreview ICLR2019.
I don't know how many triples as they mentioned, but this probably does not happens in WN18RR, WN11, FB13 and SEARCH17. I believe my model implementation itself and its evaluation do not have any problem. And I think FB15k-237 is a quite special dataset, such that TransE outperforms Distmult, ComplEx and even ConvE (v1 July 2017 when we worked on).
Hi,
"At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples."
No. Each triple in the list new_x_batch is unique. There are exactly n_entities elements in new_x_batch, each with a unique head or tail. Then you "filter" all correct triples out line 195-202, before re-adding the correct triple (which has been removed before) line 205-206. Since there are no duplicated triples, there is no reason why two different triples with the same score should have the same rank.
Currently, the model that performs better for your evaluation is a model that gives a score of 0 to all triples.
Your numbers for ComplEx are outdated, please see https://github.com/facebookresearch/kbc for up-to-date numbers.
At the beginning I mean it's 1 years and a half ago, when using a batch size, for each correct test triple, I replicated it several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said in my previous comment above. After the discussion in Openreview last year, I made a new implementation without using a batch size for the evaluation as you now see here.
So is the problem already solved? Could we use the eval.py to evaluate the methods modified from ConvKB?
I would like to clarify that there is no problem in eval.py itself w.r.t the implementation of the "test_prediction" function
. You can integrate the test_prediction
function in eval.py to your code.
As the issue doesn't appear on other datasets, I still don't know what exact answer is.
@chenwang1701, if you use eval.py from this repo, you'll suffer from the same problems mentioned in this issue. However, the fix to obtain results that are comparable to the state of the art is simple : On line 222 of eval.py just use method = average, as recommended here : https://arxiv.org/pdf/1911.03903.pdf
@timlacroix @chenwang1701 @zhangzy6 @AndRossi This paper above was accepted to ACL 2020 and advertised via Twitter, thus I knew it. I see this paper contributes some findings to the evaluation protocol. But it's not fair when the paper does not mention that the issue is "only" on FB15k237, where I still don't know the exact answer. And it does not mean that ConvKB can not be served as a good baseline at this time.
I have found the ACL 2019 paper (and the Pytorch code), where the authors used ConvKB in the decoder. The results are still good for our ConvKB. And I can get better results with suitable initialization.
So, for technical speaking, a reasonable answer comes from Pytorch (instead of Tensorflow), where ConvKB does not get the issue on FB15k237. I will plan to reimplement our ConvKB in Pytorch to answer you further.
@chenwang1701 I believe that the issue does not come from the implementation of eval.py
itself. Theoretical speaking, using results_with_id = rankdata(results, method='ordinal')
is similar to using results_with_id = rankdata(results, method='average')
because different triples will have different scores given an existing KG embedding model.
In a special case when you have to use a fixed
batch size, and if you replicate each test triple several times (to add to its set of corrupted triples) to fulfill a batch size, using method='ordinal'
is a right choice because each test triple and its replicated identical triples
must have a same rank. That was what I did three years ago.
The issue mentioned here is due to a finding that ConvKB returns a same score for different triples only
on FB15K237. Theoretical speaking, it should not happen. I gave an answer in the comment above.
@timlacroix @zhangzy6 @AndRossi I have just released the Pytorch implementation of our ConvKB, which is based on the OpenKE framework, to deal with the issue to show you that the ACL 2020 paper is not fair when re-evaluating our model. The authors just changed a line of code in eval.py
to make their claims and did not give a reasonable solution for our ConvKB model on FB15K237.
The results obtained by the Pytorch implementation with using the existing initialization are as follows:
Dataset MR MRR Hits@10
WN18RR 2741 0.220 50.83
FB15K237 196 0.302 48.31
These obtained results are sate-of-the-art ones in 2017.
Note that ConvKB is used as a decoder to improve the task performance. You can get better results if you use the initialization produced by TransE from the OpenKE framework.
I find a problem in the eval.py. when the scores of these entities are same, the ranking of the all entities are 1. Can you explain this?