dice-group / vectograph

GNU General Public License v3.0
1 stars 2 forks source link

KGCreator Interface #7

Open heindorf opened 4 years ago

heindorf commented 4 years ago

Currently, KGCreator is initialized with __init__(self, path, logger=None) and the transform method returns a file path.

I would suggest to omit both parameters:

heindorf commented 4 years ago

Possible implementation:

class DataFrame2Graph(BaseEstimator, TransformerMixin):
    def __init__(self, base_url=u'http://example.com/'):
        self.base_url = base_url

    def fit(self, x, y=None):
        return self

    def transform(self, df):
        df = df.astype(str)
        graph = rdflib.Graph()

        for subject, row in df.iterrows():
            subject = urllib.parse.quote(subject)
            subject = rdflib.term.URIRef(self.base_url + subject)
            for predicate, obj in row.iteritems():
                predicate = urllib.parse.quote(predicate)
                obj = urllib.parse.quote(obj)

                predicate = rdflib.term.URIRef(self.base_url + predicate)
                obj = rdflib.term.URIRef(self.base_url + obj)

                graph.add((subject, predicate, obj))

        return graph
Demirrr commented 4 years ago

We omit rdflib due to scalability reason. Please try your suggested implementation on the provided datasets. You will see that it would take ages :)

Demirrr commented 3 years ago

A similar class as suggested above implemented in here, although creating KG using rdflib appears to require more than creating rdf ntriples from scratch, i.e. avoiding graph.add(). We used the Graph class of the rdflib so that invalid nt issue will not occur again (see https://github.com/dice-group/Vectograph/issues/6).

FYI:

  1. Transform method still returns the path of serialised graph. This is different than the workflow in sklearn as the sklearn is not concern with graphs, not with graph serialisation. Moreover, the path of kg is send to PYKE, so we do not need to keep graph in memory.

  2. Passing a logger as an argument might be indeed a bad practice. However, it is better than not having a logging module right ? Moreover your suggestion is not clear to me as we already used logging module. Consequently, no changes are made regarding the logging since the project started we have not experience any negative outcome of this style.

Please @heindorf close this issue if answers are satisfying.