Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
https://living-with-machines.github.io/T-Res/
Other
7 stars 1 forks source link

Check wiki2wiki conversion exceptions #92

Closed mcollardanuy closed 2 years ago

mcollardanuy commented 2 years ago

W.g. what's going on with: Zante, California

mcollardanuy commented 2 years ago

From: https://github.com/Living-with-machines/toponym-resolution/pull/93#discussion_r834508049

I have had a look at this and it seems it's an error coming from the first stages of processing Wikipedia, because the missing entities (see below) are not in the overall_mentions_freq dictionary. Should be create an issue to address this at some point?

Missing entities in the wikipedia2wikidata dictionary:

Zante%2C%20California
Borne%2C%20Haute-Loire
Thos%20Hunt
Tower%20Building%2C%20Liverpool
Forenoon

By the way, looking at the entities in the overall_mentions_freq, I have found that we have some that are not the title of the Wikipedia page, but the content. Have you ever noticed this? You can see some examples by doing this:

with open(path+'overall_mentions_freq.json', 'r') as f:
    overall_mentions_freq = json.load(f)

for k in overall_mentions_freq:
    if len(k) > 1000:
        print(k)