Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
Other
7 stars 1 forks source link

Investigate Wikidata pre-trained node embeddings #69

Closed mcollardanuy closed 2 years ago

mcollardanuy commented 2 years ago

https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html

mcollardanuy commented 2 years ago

Embeddings and names downloaded in toponymVM at:

/resources/wikidata

From the documentation here:

We used as entities all the distinct strings that appeared as either source or target nodes in this dump: this means that entities include URLs of Wikidata entities (in the form http://www.wikidata.org/entity/Q123), plain quoted strings (e.g., "Foo"), strings with language annotation (e.g., "Bar"@fr), dates and times, and possibly more. Similarly, we used as relation types all the distinct strings that appeared as properties. We then filtered out entities and relation types that appeared less than 5 times in the data dump.

You can load the json like this:

import json
with open("wikidata_translation_v1_names.json", "rt") as tf:
    names = json.load(tf)

names is a list of 78413185 items, these are the first items (I think we can filter out most of them to have a more manageable embeddings file, at the moment loading both the json and the embeddings is not possible because of a memory error!):

>>> names[0]
'<http://schema.org/Dataset>'
>>> names[1]
'<http://wikiba.se/ontology#Item>'
>>> names[2]
'<http://www.wikidata.org/entity/Q13442814>'
>>> names[3]
'"wetenschappelijk artikel"@nl'
>>> names[4]
'"article cient\\u00EDfic"@ca'
>>> names[5]
'"bilimsel makale"@tr'

You can load the embeddings like this:

import numpy as np
embeddings = np.load("wikidata_translation_v1_vectors.npy")

embeddings is a ndarray with this shape: (78413185, 200).

mcollardanuy commented 2 years ago

We will need to figure out what's the impact of:

We then filtered out entities and relation types that appeared less than 5 times in the data dump.

mcollardanuy commented 2 years ago

A quick search shows that we have embeddings for 55026744 wikidata entities.

>>> for n in names:
...     if re.match("^.*\/entity\/(Q[0-9]+).*$", n):
...         counter_qid += 1

>>> counter_qid
kasparvonbeelen commented 2 years ago

Figure out how to read the large numpy file with embeddings. The trick was using mmap_mode

embeddings = np.load("wikidata_translation_v1_vectors.npy", mmap_mode='r')
embeddings[0]
kasparvonbeelen commented 2 years ago
with open('wikidata_translation_v1_names.json') as in_json:
      names = json.load(in_json)

offset = names.index('"London Bridge"@en')
embeddings[offset]
kasparvonbeelen commented 2 years ago

Make submatrix for place names.

kasparvonbeelen commented 2 years ago

Fine-tune entity embeddings on additional data?

kasparvonbeelen commented 2 years ago

Started writing a simple script for selecting (and saving) the embedding of the gazetteer entities.

wiki_entities = []
wiki_embeddings = []

for q in tqdm(set(gazetteer.wikidata_id)):
    try:
        entity = '<http://www.wikidata.org/entity/' + q + '>'
        offset = names.index(entity)
    except ValueError:
        print(entity)
        continue
    wiki_entities.append(q)
    wiki_embeddings.append(embeddings[offset])

Script will take 120 hours!

mcollardanuy commented 2 years ago

Great @kasparvonbeelen, thanks!

@fedenanni and @kasparvonbeelen, let's discuss on Tuesday how to integrate the downloading of embeddings and processing in our repo.

mcollardanuy commented 2 years ago

By @kasparvonbeelen:

Started writing a simple script for selecting (and saving) the embedding of the gazetteer entities.

wiki_entities = []
wiki_embeddings = []

for q in tqdm(set(gazetteer.wikidata_id)):
    try:
        entity = '<http://www.wikidata.org/entity/' + q + '>'
        offset = names.index(entity)
    except ValueError:
        print(entity)
        continue
    wiki_entities.append(q)
    wiki_embeddings.append(embeddings[offset])

Script will take 120 hours!

For future reference:

In the toponymVM.

mcollardanuy commented 2 years ago

We'll consider using Wikipedia embeddings as well: https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Stored under toponymVM:/resources/wikipedia2vec:

enwiki_20180420_win10_300d.pkl.bz2
enwiki_20180420_win10_300d.txt.bz2

Instructions to use them: https://wikipedia2vec.github.io/wikipedia2vec/usage/

mcollardanuy commented 2 years ago

I think we can close this for now, as it's probably superseded by https://github.com/Living-with-machines/toponym-resolution/issues/121.