DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
574 stars 65 forks source link

Refactor KnowledgeGraph class #237

Closed Mec-iS closed 2 years ago

Mec-iS commented 2 years ago

Currently KnowledgeGraph class is implemented in ~1000 lines of code. To improve design, readability and accessibility we should consider refactoring. I would suggest considering composition by grouping the different methods in components classes like:

class Loaders:
    ...
    def load_rdf(...):
        ...
    def load_jsonld(...):
       ...

class KnowledgeGraph:
    def __init__(...):
        self.loaders = Loaders()
    ...

So that the methods can be moved to different files like kglab/loaders.py and have a clearer layout of the code.

After refactoring the calls would look like:

import kglab

kg = KnowledgeGraph()
kg.loaders.load_rdf(...)
Mec-iS commented 2 years ago

see tests modules in #238 for a proposal how to break-down the class into: "loaders", "savers", "skos_owl", "utils", "querying"

tomaarsen commented 2 years ago

Currently the KnowledgeGraph class implements the following methods:

I agree with @Mec-iS that this can be refactored somewhat. However, I don't see the benefit in grouping e.g. load_rdf and load_jsonld into the same Loaders class. Personally, I would go for something along the lines of:

### In kglab/io/base.py
from abc import ABC, abstractmethod

class IO(ABC):
    @classmethod
    @abstractmethod
    def load(self, *args, **kwargs) -> "KnowledgeGraph":
        pass

    @classmethod
    @abstractmethod
    def save(self, kg, *args, **kwargs) -> None:
        pass
### In kglab/io/rdf.py
class RDF_IO(IO):
    @classmethod
    def load (
        path: IOPathLike,
        *,
        format: str = "ttl",
        base: str = None,
        **args: typing.Any,
        ) -> "KnowledgeGraph":
        ...

    @classmethod
    def save (
        kg,
        path: IOPathLike,
        *,
        format: str = "ttl",
        base: str = None,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> None:
### In kglab/io/jsonld.py
class JSON_LD_IO(IO):
    @classmethod
    def load (
        path: IOPathLike,
        *,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> "KnowledgeGraph":
        ...

    @classmethod
    def save (
        kg,
        path: IOPathLike,
        *,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> None:
...
### In kglab/kglab.py
from kglab.io import RDF_IO, JSON_LD_IO

class KnowledgeGraph:
    def __init__(self, ...):
        ...

    load_rdf = RDF_IO.load
    loadf_jsonld = JSON_LD_IO.load
    ...
    save_rdf = RDF_IO.save
    save_jsonld = JSON_LD_IO.save
    ...

Then you can easily open up this work to open-source developers, as you can make it very clear that all it takes for someone to implement a new datatype is to implement the two class methods in the IO abstract class. This means the developer doesn't need to spit through and understand the 1500 lines in kglab.py to understand what to add/edit in order to get a new datatype implemented. Beyond that, the io classes are separated in a folder specifically for reading and writing, and we don't have the the issue that kglab.py is completetly filled with docstrings for these read/write methods.

However, if Lorenzo's method is preferred, then I would at least add backwards compatibility so that kg.load_rdf(...) is still possible, and simply becomes kg.loaders.load_rdf() under the hood.

Beyond that, I would consider that moving the implementation of a method elsewhere does not necessarily make the file more readable. For example, the SPARQL query methods make up ~140 lines or so, but only ~30 or so are the implementation. The rest is newlines, the function signature and the docstrings. If you only move the implementation (i.e. move it to another file, and import some function so the new implementation is only one line in kglab.py), then you won't actually gain any readability. Plus, you will have to duplicate the signature and docstring in that case. However, you can use the "hack" from my code snippets above to actually gain readability, as that also moves the signature and docstring as well.

ceteri commented 2 years ago

@Mec-iS, @tomaarsen – this is really excellent. I've been worried about that module growing and growing...

A few points for consideration based on what's coming up:

  1. We're preparing for a 1.0.x release and need to keep backwards compatibility until then at least
    • Let's pursue the test structure that Lorenzo has started, using what Tom suggested above leveraging @abstractmethod then assigning to the base class
  2. We shouldn't divide these methods based on functionality; instead based on integration
    • to Tom's point, this way developers who come from any particular community don't have to "hunt" (like in, say, RDFlib) to find relevant integration points
    • in lieu of loaders.* and storers.* subpaths, instead have an io subpath with modules w3c, parquet, jsonld
      1. The parquet support will become a focal point for integration work in 1.0.x, so those two methods should be handled very gently :)
    • I'm under NDA at NVIDIA plus some other companies involved, while our discussions here are public, so it's probably best to just preview by saying we'll soon have much tighter integration with Dask Distributed, Apache Arrow, NumPy / cuNumeric, and NetworkX / cuGraph
  3. We'll soon have a "dual" mode for support of W3C technologies (e.g., RDF, SPARQL, SHACL, SKOS, RML, etc.) and for labeled property graphs (e.g., openCypher, APOC, etc.)
    • much of this involves a newgraph.py module which is an RDFlib Storage plugin (Lorenzo: pure Py this time, not Cython)
  4. PSL is about ready to become much more closely coupled with RDF predicates and Cypher relations

We also need to approach refactoring both subg and topo modules in a similar way.

ceteri commented 2 years ago

Here's how I propose to divide these public methods into subpaths:

kglab core:

io serialization (each abstract class has load and store):

import practices (which are either semantic data integration or weird, special cases – definitely, let's keep these out of io):

query:

w3c standards:

plus:

tomaarsen commented 2 years ago

What great detail! I love it. I'm a tad busy, but if I have some time then I'll see if there's some room for my help - whether with commits or reviews.

ceteri commented 2 years ago

@tomaarsen and @Mec-iS, one quick question:

In the @abstractmethod examples above these are @classmethod definitions. What's a good approach for method which will require instance member, e.g., access to kglib.KnowledgeGraph._g and others?

I've started on the query portion to see how this would look, and the WIP is in the base_class_refactor branch:

Mec-iS commented 2 years ago

sorry I missed this comment @ceteri. Great work as usual!

Do you have ideas on how to carry this on? Are we done with this or what is the next milestone?

Mec-iS commented 2 years ago

Use of Mixins #262

Mec-iS commented 2 years ago

263

Mec-iS commented 2 years ago

265 and we should be ok for now.