Closed zhiweihu1103 closed 2 years ago
We follow the “Filter” setting proposed by Bordes et at. [1] during evaluation. For Each test sample (e, t) in the test set, we first calculate the relevance score between e and every type, then rank all the types in descending order of relevance score. All the known types of e in the training, validation and test sets are removed from the ranking.
A more rigorous experimental setting is to only filter the (entity, type) tuples that appear in the training and validation sets during validation (still using the original “Filter” settings during testing). According to our experiment, this setting does not bring significant change to the results:
Dataset | FB15kET | YAGO43kET |
---|---|---|
MRR | 0.702 | 0.502 |
MR | 18 | 242 |
Hit@1 | 0.621 | 0.400 |
Hit@3 | 0.746 | 0.562 |
Hit@10 | 0.859 | 0.689 |
Please feel free to contact us if you have any other questions.
[1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795, 2013.
Why for valid, you use the all_true as the label? For all_true, there contains the test labels, which may have data leakage.