IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
243 stars 49 forks source link

Remote KG does not check whether provided entities exist. #47

Closed dinani65 closed 3 years ago

dinani65 commented 3 years ago

I have a subset facts of FreeBase dataset in the form of , please take a look at the following example of my input data:

<fb:m.0100zv6s> <fb:common.topic.notable_for>   <fb:g.1q3sj7rb3>
<fb:m.0100zv6s> <fb:common.topic.notable_types> <fb:m.0kpv1_>
<fb:m.0100zv6s> <fb:music.group_member.membership>  <fb:m.0100zv6q>
<fb:m.0100zv6s> <fb:people.person.nationality>  <fb:m.03_r3>
<fb:m.0100zv6s> <fb:people.person.place_of_birth>   <fb:m.03_r3>

Based on the document, I think the KG should be initialized from scratch. Do u think so? If so, do I have to traverse the scratch(around 2G) line by line to add walks?

GillesVandewiele commented 3 years ago

If you have an RDF file (many syntaxes are supported: turtle, n3, owl, xml, ...), you can use the RDFLib KG. You can also host your own SPARQL endpoint with this data and use that to initialize a KG.

dinani65 commented 3 years ago

I am not sure whether my question makes sense or not. The sample.n3 file includes one fact: <fb:m.0100zv6s> <fb:common.topic.notable_for> <fb:g.1q3sj7rb3> .

kg = KG("/freebase_2hops/sample.n3")
print(len(kg._entities))
print((kg._entities))

Could u explain why the entities of the KG are in this way: {<pyrdf2vec.graphs.kg.Vertex object at 0x7f9ab818acd0>, <pyrdf2vec.graphs.kg.Vertex object at 0x7f9a65fec250>} These entities are showing fb:m.0100zv6s and fb:g.1q3sj7rb3, respectively? How to access the entities (in the form of the input file) via KG?

rememberYou commented 3 years ago

The fact that _entities is a private variable should put you on notice that it should only be used in rare exceptions.

To give a little more context, in the case of multiple online learning where a model is trained incrementally, someone may want to retrieve all the entities that have been trained. For example, to be able to plot these entities in a graph. It is mainly for this reason that this variable was created.

The display of Vertex objects should be displayed by their name (e.g., Vertex(name="fb:m.0100zv6s")) and not by their memory address. Are you using the latest version of pyRDF2Vec? Otherwise, you can always access the name of a vertex with the .name attribute. So the first three vertices of your input file should be displayed as follows:

print([entity.name for entity in kg._entities[:3]])
dinani65 commented 3 years ago

I need to generate the embeddings of the all entities and relations of the input dataset. I created my own SPARQL endpoint, now my question is that I should put all entities in a file and run the following command for them one by one?

kg = KG(location="http://localhost:5820/db1/query", is_remote=True)`
for each e in allEntities:
  embeddings.append(transformer.fit_transform(kg, e1))
rememberYou commented 3 years ago

Assuming that allEntities corresponds to the training entities from your file, here is how you can train the model for them:

# ...

transformer = RDF2VecTransformer(
    walkers=[RandomWalker(4, 10, n_jobs=2, random_state=RANDOM_STATE)],
    verbose=1,
)
embeddings, literals = transformer.fit_transform(
    KG("http://localhost:5820/db1"), allEntities
)
print(embeddings[:-3]) 

NOTE: with pyRDF2Vec>=0.2.0 you don't need to use is_remote=True anymore to indicate a remote KG. Specifically, pyRDF2Vec>=0.2.0 relies on the fact that a KG location with an "http" suffix corresponds to a remote KG. Also note that you should not add the /query to your KG location.

dinani65 commented 3 years ago

I think I am missing some information, I run a simple code based on what u mentioned. It seems not to be checked whether the entities are in the KG, could u please tell me where I am doing wrong?

allEntities=['hello', 'fb:m.03_r3']
kg = KG("http://localhost:5820/WebQSP", is_remote=True)
transformer = RDF2VecTransformer(
    walkers=[RandomWalker(4, 10)]
)
transformer.fit(kg, allEntities, verbose=True)
walk_embeddings = transformer.transform(allEntities)
print(walk_embeddings)

It returns the embedding for two entities while "hello" does not belong to the dataset.

rememberYou commented 3 years ago

The way pyRDF2Vec works is as follows:

So, since your Knowledge Graph is not stored on your physical machine, the entities you want to create embeddings of may or may not be in the Knowledge Graph: https://github.com/IBCNServices/pyRDF2Vec/blob/7b1a41e589dbd2894382bbc2879eab46aed7f759/pyrdf2vec/rdf2vec.py#L163-L168

In your case, the generated walk for an entity that does not exist in the KG is given as follows: [('hello',)] and be injected into the Word2Vec training. Checking for a remote KG that all the entities provided are correct, would lead to additional SPARQL queries and thus slow down performance. I doubt that this is a desired behavior for a future release of pyRDF2Vec. @GillesVandewiele do we need to display a warning message in the code/FAQ about this?

GillesVandewiele commented 3 years ago

Hi, I think it would be an interesting addition to have this check for remote KGs as well. The queries (ASK query) should not be to expensive AFAIK.

rememberYou commented 3 years ago

@dinani65 The issue has been fixed with this commit on the master branch: https://github.com/IBCNServices/pyRDF2Vec/commit/f53b46ce6060930e8e769b11457d544509aeb110

You can clone the repository to take advantage of this. I also added a skip_verify (default to False) attribute for the KG class to skip this check.