Remote KG does not check whether provided entities exist.

dinani65 commented 3 years ago

I have a subset facts of FreeBase dataset in the form of , please take a look at the following example of my input data:

<fb:m.0100zv6s> <fb:common.topic.notable_for>   <fb:g.1q3sj7rb3>
<fb:m.0100zv6s> <fb:common.topic.notable_types> <fb:m.0kpv1_>
<fb:m.0100zv6s> <fb:music.group_member.membership>  <fb:m.0100zv6q>
<fb:m.0100zv6s> <fb:people.person.nationality>  <fb:m.03_r3>
<fb:m.0100zv6s> <fb:people.person.place_of_birth>   <fb:m.03_r3>

Based on the document, I think the KG should be initialized from scratch. Do u think so? If so, do I have to traverse the scratch(around 2G) line by line to add walks?

GillesVandewiele commented 3 years ago

If you have an RDF file (many syntaxes are supported: turtle, n3, owl, xml, ...), you can use the RDFLib KG. You can also host your own SPARQL endpoint with this data and use that to initialize a KG.

dinani65 commented 3 years ago

I am not sure whether my question makes sense or not. The sample.n3 file includes one fact: <fb:m.0100zv6s> <fb:common.topic.notable_for> <fb:g.1q3sj7rb3> .

kg = KG("/freebase_2hops/sample.n3")
print(len(kg._entities))
print((kg._entities))

Could u explain why the entities of the KG are in this way: {<pyrdf2vec.graphs.kg.Vertex object at 0x7f9ab818acd0>, <pyrdf2vec.graphs.kg.Vertex object at 0x7f9a65fec250>} These entities are showing fb:m.0100zv6s and fb:g.1q3sj7rb3, respectively? How to access the entities (in the form of the input file) via KG?

rememberYou commented 3 years ago

The fact that _entities is a private variable should put you on notice that it should only be used in rare exceptions.

To give a little more context, in the case of multiple online learning where a model is trained incrementally, someone may want to retrieve all the entities that have been trained. For example, to be able to plot these entities in a graph. It is mainly for this reason that this variable was created.

The display of Vertex objects should be displayed by their name (e.g., Vertex(name="fb:m.0100zv6s")) and not by their memory address. Are you using the latest version of pyRDF2Vec? Otherwise, you can always access the name of a vertex with the .name attribute. So the first three vertices of your input file should be displayed as follows:

print([entity.name for entity in kg._entities[:3]])

dinani65 commented 3 years ago

I need to generate the embeddings of the all entities and relations of the input dataset. I created my own SPARQL endpoint, now my question is that I should put all entities in a file and run the following command for them one by one?

kg = KG(location="http://localhost:5820/db1/query", is_remote=True)`
for each e in allEntities:
  embeddings.append(transformer.fit_transform(kg, e1))

rememberYou commented 3 years ago

Assuming that allEntities corresponds to the training entities from your file, here is how you can train the model for them:

# ...

transformer = RDF2VecTransformer(
    walkers=[RandomWalker(4, 10, n_jobs=2, random_state=RANDOM_STATE)],
    verbose=1,
)
embeddings, literals = transformer.fit_transform(
    KG("http://localhost:5820/db1"), allEntities
)
print(embeddings[:-3])

NOTE: with pyRDF2Vec>=0.2.0 you don't need to use is_remote=True anymore to indicate a remote KG. Specifically, pyRDF2Vec>=0.2.0 relies on the fact that a KG location with an "http" suffix corresponds to a remote KG. Also note that you should not add the /query to your KG location.

dinani65 commented 3 years ago

I think I am missing some information, I run a simple code based on what u mentioned. It seems not to be checked whether the entities are in the KG, could u please tell me where I am doing wrong?

allEntities=['hello', 'fb:m.03_r3']
kg = KG("http://localhost:5820/WebQSP", is_remote=True)
transformer = RDF2VecTransformer(
    walkers=[RandomWalker(4, 10)]
)
transformer.fit(kg, allEntities, verbose=True)
walk_embeddings = transformer.transform(allEntities)
print(walk_embeddings)

It returns the embedding for two entities while "hello" does not belong to the dataset.

rememberYou commented 3 years ago

The way pyRDF2Vec works is as follows:

if the Knowledge Graph is stored locally in physical memory, then its entire content is loaded into the KG class. Therefore, the extraction of walks is easy since we can work directly with the content of the Knowledge Graph.
Otherwise, if the Knowledge Graph is stored on a SPARQL server endpoint, pyRDF2Vec only extract the walks of this Knowledge Graph without storing it content in the KG class. Some Knowledge Graphs (e.g., DBpedia) are too large to be stored in RAM.

So, since your Knowledge Graph is not stored on your physical machine, the entities you want to create embeddings of may or may not be in the Knowledge Graph: https://github.com/IBCNServices/pyRDF2Vec/blob/7b1a41e589dbd2894382bbc2879eab46aed7f759/pyrdf2vec/rdf2vec.py#L163-L168

In your case, the generated walk for an entity that does not exist in the KG is given as follows: [('hello',)] and be injected into the Word2Vec training. Checking for a remote KG that all the entities provided are correct, would lead to additional SPARQL queries and thus slow down performance. I doubt that this is a desired behavior for a future release of pyRDF2Vec. @GillesVandewiele do we need to display a warning message in the code/FAQ about this?

GillesVandewiele commented 3 years ago

Hi, I think it would be an interesting addition to have this check for remote KGs as well. The queries (ASK query) should not be to expensive AFAIK.

rememberYou commented 3 years ago

@dinani65 The issue has been fixed with this commit on the master branch: https://github.com/IBCNServices/pyRDF2Vec/commit/f53b46ce6060930e8e769b11457d544509aeb110

You can clone the repository to take advantage of this. I also added a skip_verify (default to False) attribute for the KG class to skip this check.

IBCNServices / pyRDF2Vec

Remote KG does not check whether provided entities exist. #47