IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
244 stars 49 forks source link

Using Wikidata Sparql Endpoint #30

Closed ghost closed 3 years ago

ghost commented 3 years ago

❓ Question

Hi, we want to create embeddings for Items from DBpedia and Wikidata using sparql endpoints.

DBpedia works as expected: kg = KG("https://dbpedia.org/sparql", is_remote=True) we get fine results

Wikidata kg = KG("https://query.wikidata.org/sparql", is_remote=True) edited kg.py to define and add agent in SPARQL Wrapper: self.endpoint = SPARQLWrapper(location, agent=user_agent) we get results, but they seem to be random numbers, no clusters are recognizable in any way. Checking against not existing URIs also leads to similar results -> Therefore we assume that with correct URIs there is also no real processing.

Is there something special to take care on when working with wikidata-sparql-endpoint?

BG werner

GillesVandewiele commented 3 years ago

Hi there,

Thank you for showing interest in pyRDF2Vec, and great to hear that you are already making your own modifications to the code base to suit your use case. We appreciate any kind of feedback on how we could make it easier for people to make these modifications!

How many walks are you extracting for each of the entities? Could you perhaps also print out the walks after extraction to take a look if that went well? A point in the code where you could print these walks is around this line: https://github.com/IBCNServices/pyRDF2Vec/blob/master/pyrdf2vec/rdf2vec.py#L72

bsteenwi commented 3 years ago

The DBPedia endpoint is based on Virtuoso (which provides the /sparql option). Most of the wikidata queries are just concatenations of the query string to the following URL:https://query.wikidata.org/ So I'm not sure if this /sparql is needed for wikidata?

Example: https://query.wikidata.org/#SELECT%20%3Fp%20%3Fo%0AWHERE%0A%7B%0A%20%20wd%3AQ83894%20%3Fp%20%3Fo.%0A%7D

ghost commented 3 years ago

Thanks to both for responding so quick. @GillesVandewiele your answer brought me to a new idea how to verify if there is a useful response from wikidata - and there is! i have concentrated to less on what happens but on how the results are I have to refine Walking Strategy and to define a better set of label_predicates to improve the quality of embeddings.

@bsteenwi thanks for the idea. i could now verify that "https://query.wikidata.org/sparql" works

GillesVandewiele commented 3 years ago

Hi @krewer, this could indeed help! It's also important to note that our remote KG is typically very slow when you have to query a server over the web, due to the many HTTP requests. Having your own local server would speed this is up significantly. As a result of the remote KG being so slow, a very low number of walks typically must be extracted in order to have results in a timely manner. Unfortunately, those results are often not that great with a low number of walks...

It is on our roadmap to optimize the efficiency of our remote KG!

nnadine25 commented 3 years ago

Hi, i also try to use Alternative 2: Using a dbpedia endpoint (nothing is loaded into memory)but it give me the this erros Extracted 0 walks for 0 entities! raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model I how does it work with sparql and dbpedia?using sparql property path construct query and rdflib ? where we put the query ?

GillesVandewiele commented 3 years ago

Hi @nnadine25

Can we see the code you are using in order to help you debug? It should work by just providing the URL of the endpoint and then a list of entities (URIs/URLs) for which you want to generate embeddings. pyRDF2Vec will perform SPARQL queries under the hood:

SELECT ?p, ?o WHERE {
   <entity> ?p ?o.
}

With <entity> one of the provided URLs in your list.

GillesVandewiele commented 3 years ago

I do not understand your question. Please re-phrase.

GillesVandewiele commented 3 years ago

Yes. You should be able to provide any URI that can be found at the endpoint you provided.

nnadine25 commented 3 years ago

thanks sir , i have another question : when we use word2vec sometimes give differents values or the same item(s) and we need to Ensure the gensim generate the same Word2Vec model for different runs on the same data, is it the same in rdf2vec ? the values in embedddings array are the same in each execusion ?

GillesVandewiele commented 3 years ago

See our README:

For a more elaborate example, check at the example.py file:

PYTHONHASHSEED=42 python3 example.py

NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.
nnadine25 commented 3 years ago

i try this code and it give me diffferent results in each execusion , i set the PYTHONHASHSEED=0 in pycharm

from pyrdf2vec.graphs import KG

import numpy as np from pyrdf2vec.samplers import UniformSampler from pyrdf2vec.walkers import RandomWalker from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec entities = ['http://dbpedia.org/resource/Ralf_Schumacher','http://dbpedia.org/resource/Mick_Schumacher']

kg = KG(location="https://dbpedia.org/sparql", is_remote=True)

walkers = [RandomWalker(1, 200, UniformSampler())] embedder = Word2Vec(size=200) transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)

embeddings = transformer.fit_transform(kg, entities) print(embeddings)

vector1=embeddings[0] vector2=embeddings[1]

unit_vector_1 = vector1 / np.linalg.norm(vector1) unit_vector_2 = vector2 / np.linalg.norm(vector2) dot_product = np.dot(unit_vector_2, unit_vector_2) print(dot_product)

bsteenwi commented 3 years ago

Hi,

To get reproducible embeddings, you have to set the PYTHONHASHSEED=0 as already discussed by Gilles. You can set it in Pycharm as Environment Variable (more info here)

But you also have to fix the sampler randomness After you import numpy, add the following code:

np.random.seed(0)

This will ensure the np.random calls in our code are seeded.

We will add this information to our readme.

rememberYou commented 3 years ago

Thanks @bsteenwi, I added the explanation in the README.rst file of the develop branch.

Link: https://github.com/IBCNServices/pyRDF2Vec/tree/develop#how-to-ensure-the-generation-of-similar-embeddings

When the unit tests are all successful, I will make sure to do the merge.

nnadine25 commented 3 years ago

thank you very much , i have another question if we need to same the rdf2vec into a file , there is any specific extension that the file should be ?

rememberYou commented 3 years ago

If you are using pyRDF2Vec 0.1.1, the library does not yet provide a function to save the RDF2VecTransformer object which contains the entity embeddings, but nothing prevents you from serializing the object yourself as explained below.

If you use a clone directly from the master branch of pyRDF2Vec, the RDF2VecTransformer object is saved and loaded by a binary file (transformer_data by default) using the save and load function.

https://github.com/IBCNServices/pyRDF2Vec/blob/8181e02a866bf274d50d66f28f1ffa7100ebb9a6/pyrdf2vec/rdf2vec.py#L119

Here is an easy example (from examples/countries.py) to use:

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t") 
entities = [rdflib.URIRef(x) for x in data["location"]]
kg = KG(
    "https://dbpedia.org/sparql",
    label_predicates=["www.w3.org/1999/02/22-rdf-syntax-ns#type"],
    is_remote=True,
)
transformer = RDF2VecTransformer(walkers=[RandomWalker(2, None)])
transformer.fit_transform(kg, entities).save("foo")

# Create another RDF2VecTransform object and load the embedding of the 
# previously saved entities, in order to avoid having to train the model again.
transformer_saved = RDF2VecTransformer.load("foo")

So no, no type of extension needs to be specified. However, if you would like to use more than one RDF2VecTransformer object, you will need to serialize them with a different filename.

Please create another issue if you still have other questions.

nnadine25 commented 3 years ago

thank you , i will do , but the values of embedddings array are stil differents in each execusion even when i set PYTHONHASHSEED=0 in pycharm and np.random.seed(0)

rememberYou commented 3 years ago

@nnadine25 Could you paste your current code with the modifications made? Use triple backticks following by python to paste your code here.

SEE: https://docs.github.com/en/enterprise-server@2.20/github/writing-on-github/basic-writing-and-formatting-syntax#quoting-code

nnadine25 commented 3 years ago

i try the code in this page and the values of the array are different in each execusion even when set PYTHONHASHSEED=0

the page : https://github.com/IBCNServices/pyRDF2Vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

the example:

from pyrdf2vec.graphs import KG from pyrdf2vec.samplers import UniformSampler from pyrdf2vec.walkers import RandomWalker from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec

kg = KG(location="http://dbpedia.org/sparql", is_remote=True)

walkers = [RandomWalker(1, 200, UniformSampler())] embedder = Word2Vec(size=200) transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)

embeddings = transformer.fit_transform(kg, ['http://dbpedia.org/resource/Brussels']) print(embeddings)

GillesVandewiele commented 3 years ago

There's no np.random.seed in that code...

rememberYou commented 3 years ago

@nnadine25 Use pyRDF2Vec 0.1.1:

pip install pyRDF2Vec

By using this code and saving it in a file (e.g., foo.py):

import random

import numpy as np

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker

# Ensure the determinism of this script by initializing a pseudo-random number
# generator.
np.random.seed(42)
random.seed(42)

transformer = RDF2VecTransformer(Word2Vec(size=200), [RandomWalker(1, 200)])

embeddings = transformer.fit_transform(
    KG(location="http://dbpedia.org/sparql", is_remote=True),
    ["http://dbpedia.org/resource/Brussels"],
)
print(embeddings)

I always get the same embeddings by executing the script like this:

PYTHONHASHSEED=42 python foo.py

Let me know if the problem persists.

rememberYou commented 3 years ago

A quick update after my analysis this morning. For the master branch, random determinism works well when you don't use multi processing (i.e. use the n-jobs parameter >=2 for Walker) which is perfecly normal as initializing multiple processes cannot guarantee determinism yet.

@nnadine25 I always have the same embeddings with your code.

EDIT: after played a bit with random determinism this morning, you should use Word2Vec(workers=1, size=200) in case you have many entities, to ensure determinism for Word2Vec.

SEE: https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vecdoc2vecetc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism