Open heindorf opened 4 years ago
Possible implementation:
class DataFrame2Graph(BaseEstimator, TransformerMixin):
def __init__(self, base_url=u'http://example.com/'):
self.base_url = base_url
def fit(self, x, y=None):
return self
def transform(self, df):
df = df.astype(str)
graph = rdflib.Graph()
for subject, row in df.iterrows():
subject = urllib.parse.quote(subject)
subject = rdflib.term.URIRef(self.base_url + subject)
for predicate, obj in row.iteritems():
predicate = urllib.parse.quote(predicate)
obj = urllib.parse.quote(obj)
predicate = rdflib.term.URIRef(self.base_url + predicate)
obj = rdflib.term.URIRef(self.base_url + obj)
graph.add((subject, predicate, obj))
return graph
We omit rdflib due to scalability reason. Please try your suggested implementation on the provided datasets. You will see that it would take ages :)
A similar class as suggested above implemented in here, although creating KG using rdflib appears to require more than creating rdf ntriples from scratch, i.e. avoiding graph.add()
. We used the Graph class of the rdflib so that invalid nt issue will not occur again (see https://github.com/dice-group/Vectograph/issues/6).
FYI:
Transform method still returns the path of serialised graph. This is different than the workflow in sklearn as the sklearn is not concern with graphs, not with graph serialisation. Moreover, the path of kg is send to PYKE, so we do not need to keep graph in memory.
Passing a logger as an argument might be indeed a bad practice. However, it is better than not having a logging module right ? Moreover your suggestion is not clear to me as we already used logging module. Consequently, no changes are made regarding the logging since the project started we have not experience any negative outcome of this style.
Please @heindorf close this issue if answers are satisfying.
Currently,
KGCreator
is initialized with__init__(self, path, logger=None)
and thetransform
method returns a file path.I would suggest to omit both parameters:
transform
method should not return a path, but the actual data (to make the classKGCreator
more independent of the file system and becausesklearn
's method returns the actual data instead of a file path.transform
method to be an rdflibGraph
.logging
module, e.g., via logging.get_logger(...)