Closed dinani65 closed 3 years ago
If you have an RDF file (many syntaxes are supported: turtle, n3, owl, xml, ...), you can use the RDFLib KG. You can also host your own SPARQL endpoint with this data and use that to initialize a KG.
I am not sure whether my question makes sense or not.
The sample.n3 file includes one fact:
<fb:m.0100zv6s> <fb:common.topic.notable_for> <fb:g.1q3sj7rb3> .
kg = KG("/freebase_2hops/sample.n3")
print(len(kg._entities))
print((kg._entities))
Could u explain why the entities of the KG are in this way:
{<pyrdf2vec.graphs.kg.Vertex object at 0x7f9ab818acd0>, <pyrdf2vec.graphs.kg.Vertex object at 0x7f9a65fec250>}
These entities are showing fb:m.0100zv6s and fb:g.1q3sj7rb3, respectively?
How to access the entities (in the form of the input file) via KG?
The fact that _entities
is a private variable should put you on notice that it should only be used in rare exceptions.
To give a little more context, in the case of multiple online learning where a model is trained incrementally, someone may want to retrieve all the entities that have been trained. For example, to be able to plot these entities in a graph. It is mainly for this reason that this variable was created.
The display of Vertex
objects should be displayed by their name (e.g., Vertex(name="fb:m.0100zv6s")
) and not by their memory address. Are you using the latest version of pyRDF2Vec? Otherwise, you can always access the name of a vertex with the .name
attribute. So the first three vertices of your input file should be displayed as follows:
print([entity.name for entity in kg._entities[:3]])
I need to generate the embeddings of the all entities and relations of the input dataset. I created my own SPARQL endpoint, now my question is that I should put all entities in a file and run the following command for them one by one?
kg = KG(location="http://localhost:5820/db1/query", is_remote=True)`
for each e in allEntities:
embeddings.append(transformer.fit_transform(kg, e1))
Assuming that allEntities
corresponds to the training entities from your file, here is how you can train the model for them:
# ...
transformer = RDF2VecTransformer(
walkers=[RandomWalker(4, 10, n_jobs=2, random_state=RANDOM_STATE)],
verbose=1,
)
embeddings, literals = transformer.fit_transform(
KG("http://localhost:5820/db1"), allEntities
)
print(embeddings[:-3])
NOTE: with pyRDF2Vec>=0.2.0
you don't need to use is_remote=True
anymore to indicate a remote KG. Specifically, pyRDF2Vec>=0.2.0
relies on the fact that a KG location with an "http" suffix corresponds to a remote KG. Also note that you should not add the /query
to your KG location.
I think I am missing some information, I run a simple code based on what u mentioned. It seems not to be checked whether the entities are in the KG, could u please tell me where I am doing wrong?
allEntities=['hello', 'fb:m.03_r3']
kg = KG("http://localhost:5820/WebQSP", is_remote=True)
transformer = RDF2VecTransformer(
walkers=[RandomWalker(4, 10)]
)
transformer.fit(kg, allEntities, verbose=True)
walk_embeddings = transformer.transform(allEntities)
print(walk_embeddings)
It returns the embedding for two entities while "hello" does not belong to the dataset.
The way pyRDF2Vec works is as follows:
if the Knowledge Graph is stored locally in physical memory, then its entire content is loaded into the KG
class. Therefore, the extraction of walks is easy since we can work directly with the content of the Knowledge Graph.
Otherwise, if the Knowledge Graph is stored on a SPARQL server endpoint, pyRDF2Vec only extract the walks of this Knowledge Graph without storing it content in the KG
class. Some Knowledge Graphs (e.g., DBpedia) are too large to be stored in RAM.
So, since your Knowledge Graph is not stored on your physical machine, the entities you want to create embeddings of may or may not be in the Knowledge Graph: https://github.com/IBCNServices/pyRDF2Vec/blob/7b1a41e589dbd2894382bbc2879eab46aed7f759/pyrdf2vec/rdf2vec.py#L163-L168
In your case, the generated walk for an entity that does not exist in the KG is given as follows: [('hello',)]
and be injected into the Word2Vec training. Checking for a remote KG that all the entities provided are correct, would lead to additional SPARQL queries and thus slow down performance. I doubt that this is a desired behavior for a future release of pyRDF2Vec. @GillesVandewiele do we need to display a warning message in the code/FAQ about this?
Hi, I think it would be an interesting addition to have this check for remote KGs as well. The queries (ASK query) should not be to expensive AFAIK.
@dinani65 The issue has been fixed with this commit on the master
branch: https://github.com/IBCNServices/pyRDF2Vec/commit/f53b46ce6060930e8e769b11457d544509aeb110
You can clone the repository to take advantage of this. I also added a skip_verify
(default to False
) attribute for the KG
class to skip this check.
I have a subset facts of FreeBase dataset in the form of, please take a look at the following example of my input data:
Based on the document, I think the KG should be initialized from scratch. Do u think so? If so, do I have to traverse the scratch(around 2G) line by line to add walks?