DeepGraphLearning / graphvite

GraphVite: A General and High-performance Graph Embedding System
https://graphvite.io
Apache License 2.0
1.22k stars 151 forks source link

Is Wikidata5m complete/correct? #40

Closed dfdazac closed 4 years ago

dfdazac commented 4 years ago

I recently read "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation", which proposes the Wikidata5m dataset. I found it in this repository, but upon inspection I see that only the Knowledge Graph triples are provided. Do you plan to release the text that you obtained from Wikipedia for this project?

Second, I was checking the data and I found this:

$ grep Q1000000 wikidata5m.txt
Q1000000        P17     Q794

I checked and Q1000000 corresponds to the Finnish tv series "Matkaoppaat", but this entity does not have an English Wikipedia page, contrary to the way the dataset is described in the paper. Furthermore, P17 corresponds to "country", and Q794 to "Iran", so if I interpret it correctly, this triple is wrong. If this is true, could there be other instances of this issue in wikidata5m.txt?

KiddoZhu commented 4 years ago

As Kepler hasn't been officially released, the Wikidata5m here is a temporary version for benchmarking GraphVite. For the text data, I guess the author will release it soon.

Your finding is interesting. I am aware that this dataset is a little bit noisy, especially for entities with large IDs. According to my experience, it is still statistically good enough for machine learning. So don't worry.

The knowledge graph here has a few more entities and relations than the one in the paper. It's possible that some entities don't align with Wikipedia.

dfdazac commented 4 years ago

I understand, thanks for your reply!