Closed guolingbing closed 6 years ago
Good catch. I just checked this and it is true. 212 entities in the test set do not occur in the training set. Since the dataset has already been used in some other papers I would not want to adjust it now. If everybody works with these unpredictable test cases it should even out and scores will be comparable (albeit being low). I will add a comment about this in the README. Thank you.
How do you deal with triplets whose entities do not appear in training set during evaluation? Simply ignore them or assign them a specific score, say 0. Thank you.
Just treat them like any other triple. The model will probably not be able to rank them correctly (I would expect a random rank), but that is no issue as long as everybody evaluates those triple in that way. Note that, although not much better, random ranks are better than to assign zero scores. If you having problems with the triples not being in the vocabulary (embedding matrix) then include test set triples in the vocabulary — this is how I deal with the issue in this repo.
Thanks!
It seems that the results are still depend on how you assign random ranks (how you choose random seeds), although the dependence may be insignificant.
Yes, I agree. It could induce bias, but it is unlikely I think. Thank you for this question, I think it will be helpful for others in the future.
It seems that some enities in the testing set does not appear in the training set. So, about 210 number of triples in testing set are meaningless?