Closed ghost closed 3 years ago
Hi there,
Thank you for showing interest in pyRDF2Vec, and great to hear that you are already making your own modifications to the code base to suit your use case. We appreciate any kind of feedback on how we could make it easier for people to make these modifications!
How many walks are you extracting for each of the entities? Could you perhaps also print out the walks after extraction to take a look if that went well? A point in the code where you could print these walks is around this line: https://github.com/IBCNServices/pyRDF2Vec/blob/master/pyrdf2vec/rdf2vec.py#L72
The DBPedia endpoint is based on Virtuoso (which provides the /sparql option). Most of the wikidata queries are just concatenations of the query string to the following URL:https://query.wikidata.org/ So I'm not sure if this /sparql is needed for wikidata?
Example:
https://query.wikidata.org/#SELECT%20%3Fp%20%3Fo%0AWHERE%0A%7B%0A%20%20wd%3AQ83894%20%3Fp%20%3Fo.%0A%7D
Thanks to both for responding so quick. @GillesVandewiele your answer brought me to a new idea how to verify if there is a useful response from wikidata - and there is! i have concentrated to less on what happens but on how the results are I have to refine Walking Strategy and to define a better set of label_predicates to improve the quality of embeddings.
@bsteenwi thanks for the idea. i could now verify that "https://query.wikidata.org/sparql" works
Hi @krewer, this could indeed help! It's also important to note that our remote KG is typically very slow when you have to query a server over the web, due to the many HTTP requests. Having your own local server would speed this is up significantly. As a result of the remote KG being so slow, a very low number of walks typically must be extracted in order to have results in a timely manner. Unfortunately, those results are often not that great with a low number of walks...
It is on our roadmap to optimize the efficiency of our remote KG!
Hi, i also try to use Alternative 2: Using a dbpedia endpoint (nothing is loaded into memory)but it give me the this erros Extracted 0 walks for 0 entities! raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model I how does it work with sparql and dbpedia?using sparql property path construct query and rdflib ? where we put the query ?
Hi @nnadine25
Can we see the code you are using in order to help you debug? It should work by just providing the URL of the endpoint and then a list of entities (URIs/URLs) for which you want to generate embeddings. pyRDF2Vec will perform SPARQL queries under the hood:
SELECT ?p, ?o WHERE {
<entity> ?p ?o.
}
With <entity>
one of the provided URLs in your list.
I do not understand your question. Please re-phrase.
Yes. You should be able to provide any URI that can be found at the endpoint you provided.
thanks sir , i have another question : when we use word2vec sometimes give differents values or the same item(s) and we need to Ensure the gensim generate the same Word2Vec model for different runs on the same data, is it the same in rdf2vec ? the values in embedddings array are the same in each execusion ?
See our README:
For a more elaborate example, check at the example.py file:
PYTHONHASHSEED=42 python3 example.py
NOTE: the PYTHONHASHSEED (e.g., 42) is to ensure determinism.
i try this code and it give me diffferent results in each execusion , i set the PYTHONHASHSEED=0 in pycharm
from pyrdf2vec.graphs import KG
import numpy as np from pyrdf2vec.samplers import UniformSampler from pyrdf2vec.walkers import RandomWalker from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec entities = ['http://dbpedia.org/resource/Ralf_Schumacher','http://dbpedia.org/resource/Mick_Schumacher']
kg = KG(location="https://dbpedia.org/sparql", is_remote=True)
walkers = [RandomWalker(1, 200, UniformSampler())] embedder = Word2Vec(size=200) transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)
embeddings = transformer.fit_transform(kg, entities) print(embeddings)
vector1=embeddings[0] vector2=embeddings[1]
unit_vector_1 = vector1 / np.linalg.norm(vector1) unit_vector_2 = vector2 / np.linalg.norm(vector2) dot_product = np.dot(unit_vector_2, unit_vector_2) print(dot_product)
Hi,
To get reproducible embeddings, you have to set the PYTHONHASHSEED=0 as already discussed by Gilles. You can set it in Pycharm as Environment Variable (more info here)
But you also have to fix the sampler randomness After you import numpy, add the following code:
np.random.seed(0)
This will ensure the np.random calls in our code are seeded.
We will add this information to our readme.
Thanks @bsteenwi, I added the explanation in the README.rst
file of the develop
branch.
When the unit tests are all successful, I will make sure to do the merge.
thank you very much , i have another question if we need to same the rdf2vec into a file , there is any specific extension that the file should be ?
If you are using pyRDF2Vec 0.1.1
, the library does not yet provide a function to save the RDF2VecTransformer
object which contains the entity embeddings, but nothing prevents you from serializing the object yourself as explained below.
If you use a clone directly from the master
branch of pyRDF2Vec
, the RDF2VecTransformer
object is saved and loaded by a binary file (transformer_data
by default) using the save
and load
function.
Here is an easy example (from examples/countries.py
) to use:
data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")
entities = [rdflib.URIRef(x) for x in data["location"]]
kg = KG(
"https://dbpedia.org/sparql",
label_predicates=["www.w3.org/1999/02/22-rdf-syntax-ns#type"],
is_remote=True,
)
transformer = RDF2VecTransformer(walkers=[RandomWalker(2, None)])
transformer.fit_transform(kg, entities).save("foo")
# Create another RDF2VecTransform object and load the embedding of the
# previously saved entities, in order to avoid having to train the model again.
transformer_saved = RDF2VecTransformer.load("foo")
So no, no type of extension needs to be specified. However, if you would like to use more than one RDF2VecTransformer
object, you will need to serialize them with a different filename.
Please create another issue if you still have other questions.
thank you , i will do , but the values of embedddings array are stil differents in each execusion even when i set PYTHONHASHSEED=0 in pycharm and np.random.seed(0)
@nnadine25 Could you paste your current code with the modifications made? Use triple backticks following by python
to paste your code here.
i try the code in this page and the values of the array are different in each execusion even when set PYTHONHASHSEED=0
the example:
from pyrdf2vec.graphs import KG from pyrdf2vec.samplers import UniformSampler from pyrdf2vec.walkers import RandomWalker from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec
kg = KG(location="http://dbpedia.org/sparql", is_remote=True)
walkers = [RandomWalker(1, 200, UniformSampler())] embedder = Word2Vec(size=200) transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)
embeddings = transformer.fit_transform(kg, ['http://dbpedia.org/resource/Brussels']) print(embeddings)
There's no np.random.seed in that code...
@nnadine25 Use pyRDF2Vec 0.1.1
:
pip install pyRDF2Vec
By using this code and saving it in a file (e.g., foo.py
):
import random
import numpy as np
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker
# Ensure the determinism of this script by initializing a pseudo-random number
# generator.
np.random.seed(42)
random.seed(42)
transformer = RDF2VecTransformer(Word2Vec(size=200), [RandomWalker(1, 200)])
embeddings = transformer.fit_transform(
KG(location="http://dbpedia.org/sparql", is_remote=True),
["http://dbpedia.org/resource/Brussels"],
)
print(embeddings)
I always get the same embeddings by executing the script like this:
PYTHONHASHSEED=42 python foo.py
Let me know if the problem persists.
A quick update after my analysis this morning. For the master
branch, random determinism works well when you don't use multi processing (i.e. use the n-jobs parameter >=2 for Walker) which is perfecly normal as initializing multiple processes cannot guarantee determinism yet.
@nnadine25 I always have the same embeddings with your code.
EDIT: after played a bit with random determinism this morning, you should use Word2Vec(workers=1, size=200)
in case you have many entities, to ensure determinism for Word2Vec
.
❓ Question
Hi, we want to create embeddings for Items from DBpedia and Wikidata using sparql endpoints.
DBpedia works as expected: kg = KG("https://dbpedia.org/sparql", is_remote=True) we get fine results
Wikidata kg = KG("https://query.wikidata.org/sparql", is_remote=True) edited kg.py to define and add agent in SPARQL Wrapper: self.endpoint = SPARQLWrapper(location, agent=user_agent) we get results, but they seem to be random numbers, no clusters are recognizable in any way. Checking against not existing URIs also leads to similar results -> Therefore we assume that with correct URIs there is also no real processing.
Is there something special to take care on when working with wikidata-sparql-endpoint?
BG werner