Closed mcollardanuy closed 2 years ago
From: https://github.com/Living-with-machines/toponym-resolution/pull/93#discussion_r834508049
I have had a look at this and it seems it's an error coming from the first stages of processing Wikipedia, because the missing entities (see below) are not in the
overall_mentions_freq
dictionary. Should be create an issue to address this at some point?Missing entities in the
wikipedia2wikidata
dictionary:Zante%2C%20California Borne%2C%20Haute-Loire Thos%20Hunt Tower%20Building%2C%20Liverpool Forenoon
By the way, looking at the entities in the
overall_mentions_freq
, I have found that we have some that are not the title of the Wikipedia page, but the content. Have you ever noticed this? You can see some examples by doing this:with open(path+'overall_mentions_freq.json', 'r') as f: overall_mentions_freq = json.load(f) for k in overall_mentions_freq: if len(k) > 1000: print(k)
W.g. what's going on with: Zante, California