IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
244 stars 49 forks source link

Order of rdf triples embeddings #63

Closed maya-25 closed 2 years ago

maya-25 commented 2 years ago

❓ Question

I have generated embeddings for RDF triples URIs (from DBpedia) using pyRDF2vec. When I am passing the list(set(entities)) in the transformer.fir_transform(), I am not sure about the sequence of order of embeddings generated by the pyRDF2vec transformer. Will these sequences or order affect the results when I will concatenate these rdf embeddings with sentence context embeddings while training the model?

`code:

def rdftriplestovec(filepath,entities):

kg = KG(filepath)
transformer = RDF2VecTransformer(walkers=[RandomWalker(3, None)], 
                             embedder=Word2Vec(size=500))
entities_names=[entity.name for entity in kg._entities]
filtered_entities = [e for e in entities if e in entities_names]
not_found = set(entities) -  set(filtered_entities)
print('entities could not be found in the KG! Removing them')
entities = list(set(filtered_entities))
embeddings = transformer.fit_transform(kg, entities)
print(embeddings)
return embeddings`

Sample of rdf triples in .ttl file (the predicate is of owl type): (passing as filepath in rdftriplestovec function)

@prefix owl: http://www.w3.org/2002/07/owl# .

http://dbpedia.org/resource/AT&T owl:Ontology http://dbpedia.org/resource/Espionage, http://dbpedia.org/resource/Police .

http://dbpedia.org/resource/Actor owl:Ontology http://dbpedia.org/resource/Major, http://dbpedia.org/resource/Plea, http://dbpedia.org/resource/United_States .

http://dbpedia.org/resource/Actor_model owl:Ontology http://dbpedia.org/resource/Visibility .

http://dbpedia.org/resource/Advertising owl:Ontology http://dbpedia.org/resource/Indian_Americans .

http://dbpedia.org/resource/Afghan_National_Army owl:Ontology http://dbpedia.org/resource/Enemy .

http://dbpedia.org/resource/Ago,_Mie owl:Ontology http://dbpedia.org/resource/Haunt_(comics), http://dbpedia.org/resource/Human_back, http://dbpedia.org/resource/Jesus


sample: URI list which I get from DBpedia API for my dataset (passing as entities in function rdftriplestovec)

['http://dbpedia.org/resource/United_States_House_of_Representatives', 'http://dbpedia.org/resource/Australian_Democrats', 'http://dbpedia.org/resource/Aide-de-camp', 'http://dbpedia.org/resource/United_Kingdom', 'http://dbpedia.org/resource/Even_language', 'http://dbpedia.org/resource/James_Comey', 'http://dbpedia.org/resource/Letter_(message)', 'http://dbpedia.org/resource/Jason_Chaffetz', 'http://dbpedia.org/resource/Twitter', 'http://dbpedia.org/resource/Italian_language', 'http://dbpedia.org/resource/Robb_Flynn', 'http://dbpedia.org/resource/Hillary_Clinton', 'http://dbpedia.org/resource/Breitbart_News', 'http://dbpedia.org/resource/Truth', 'http://dbpedia.org/resource/Get_(divorce_document)', 'http://dbpedia.org/resource/Inactivated_vaccine', 'http://dbpedia.org/resource/India', 'http://dbpedia.org/resource/Single_(music)', 'http://dbpedia.org/resource/November_2017_Somalia_airstrike', 'http://dbpedia.org/resource/Identified', 'http://dbpedia.org/resource/Iranian_peoples', 'http://dbpedia.org/resource/Woman', 'http://dbpedia.org/resource/Fiction', 'http://dbpedia.org/resource/Unpublished_Story', 'http://dbpedia.org/resource/Stoning']

GillesVandewiele commented 2 years ago

The embeddings returned by fit-transform will be in the same order as the entities. However, by wrapping your entities in a set, you lose ordering. Instead, just pass list(entities) instead to be guaranteed the same order.

maya-25 commented 2 years ago

I tried by passing list(entities) or just entities (because it was already a list which is filtered entities): In both cases, It is throwing an error below.

Working, when passing list(set(entities) only.

IndexError Traceback (most recent call last) in () ----> 1 rdf2vec_test2=rdftriplestovec("file_final_test.ttl",URI_list_test)

3 frames in rdftriplestovec(filepath, entities) 20 print(entities) 21 # Get our embeddings ---> 22 embeddings = transformer.fit_transform(kg, entities) 23 #print(embeddings) 24 return embeddings

/usr/local/lib/python3.7/dist-packages/pyrdf2vec/rdf2vec.py in fit_transform(self, kg, entities, is_update) 141 """ 142 self._is_extract_walks_literals = True --> 143 self.fit(self.get_walks(kg, entities), is_update) 144 return self.transform(kg, entities) 145

/usr/local/lib/python3.7/dist-packages/pyrdf2vec/rdf2vec.py in get_walks(self, kg, entities) 180 181 self._update(self._entities, entities) --> 182 self._update(self._walks, walks) 183 184 if self.verbose >= 1:

/usr/local/lib/python3.7/dist-packages/pyrdf2vec/rdf2vec.py in _update(self, attr, values) 266 tmp = values 267 for i, pos in enumerate(self._pos_entities): --> 268 attr[pos] = tmp.pop(self._pos_walks[i]) 269 attr += tmp 270

IndexError: list assignment index out of range

GillesVandewiele commented 2 years ago

I am a bit clueless as to what the error could be. Your triples.ttl seems extremely small to start, and many of the entities in your list are not even present in your turtle file? Could you perhaps share more complete data? Also, do the example scripts run for you?

maya-25 commented 2 years ago

Please find: URI_list.csv rdf_triples.txt

Passing file path as rdf_triples.ttl (I attach here as .txt as .ttl not supported here, please convert it to .ttl) and entities as URI_list.csv

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

def rdftriplestovec(filepath,entities):
    kg = KG(filepath)
    print(kg._entities)
    #transformer = RDF2VecTransformer()
    transformer = RDF2VecTransformer(walkers=[RandomWalker(3, None)], 
                                 embedder=Word2Vec(size=500))
    entities_names=[entity.name for entity in kg._entities]
    filtered_entities = [e for e in entities if e in entities_names]
    print(filtered_entities)
    not_found = set(entities) - set(filtered_entities)
    print(f'{len(not_found)} entities not found in the KG!')
    entities = list(set(filtered_entities)
    print(entities)
    # Get embeddings
    embeddings = transformer.fit_transform(kg, entities)
    return embeddings
GillesVandewiele commented 2 years ago
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
import pandas as pd

def rdftriplestovec(filepath, entities):
    kg = KG(filepath, fmt='turtle')
    # Sets will change the order, go see this for yourself in a shell or notebook
    entities = list(set([x.name for x in kg._entities]).intersection(entities))
    transformer = RDF2VecTransformer(walkers=[RandomWalker(3, None)], 
                                 embedder=Word2Vec(vector_size=500))
    embeddings = transformer.fit_transform(kg, entities)
    return embeddings

entities = list(pd.read_csv('URI_list.csv')['0'].values)
rdftriplestovec('rdf_triples.ttl', entities)

Works like a charm. But as I warned, if you use set() in python, the order will change! Try to avoid it (which I am not doing here), or store the result of converting to set() so that you can reconstruct the order.

maya-25 commented 2 years ago

Thanks a lot, It works! Just one question to clear me, Does it always produce embeddings of the type tuples instead of ndarray?

GillesVandewiele commented 2 years ago

Yes, it returns two things: embeddings and literals. Both will be numpy arrays.