Closed Mec-iS closed 2 years ago
see tests modules in #238 for a proposal how to break-down the class into: "loaders", "savers", "skos_owl", "utils", "querying"
Currently the KnowledgeGraph
class implements the following methods:
I agree with @Mec-iS that this can be refactored somewhat. However, I don't see the benefit in grouping e.g. load_rdf
and load_jsonld
into the same Loaders
class. Personally, I would go for something along the lines of:
### In kglab/io/base.py
from abc import ABC, abstractmethod
class IO(ABC):
@classmethod
@abstractmethod
def load(self, *args, **kwargs) -> "KnowledgeGraph":
pass
@classmethod
@abstractmethod
def save(self, kg, *args, **kwargs) -> None:
pass
### In kglab/io/rdf.py
class RDF_IO(IO):
@classmethod
def load (
path: IOPathLike,
*,
format: str = "ttl",
base: str = None,
**args: typing.Any,
) -> "KnowledgeGraph":
...
@classmethod
def save (
kg,
path: IOPathLike,
*,
format: str = "ttl",
base: str = None,
encoding: str = "utf-8",
**args: typing.Any,
) -> None:
### In kglab/io/jsonld.py
class JSON_LD_IO(IO):
@classmethod
def load (
path: IOPathLike,
*,
encoding: str = "utf-8",
**args: typing.Any,
) -> "KnowledgeGraph":
...
@classmethod
def save (
kg,
path: IOPathLike,
*,
encoding: str = "utf-8",
**args: typing.Any,
) -> None:
...
### In kglab/kglab.py
from kglab.io import RDF_IO, JSON_LD_IO
class KnowledgeGraph:
def __init__(self, ...):
...
load_rdf = RDF_IO.load
loadf_jsonld = JSON_LD_IO.load
...
save_rdf = RDF_IO.save
save_jsonld = JSON_LD_IO.save
...
Then you can easily open up this work to open-source developers, as you can make it very clear that all it takes for someone to implement a new datatype is to implement the two class methods in the IO abstract class. This means the developer doesn't need to spit through and understand the 1500 lines in kglab.py
to understand what to add/edit in order to get a new datatype implemented.
Beyond that, the io classes are separated in a folder specifically for reading and writing, and we don't have the the issue that kglab.py
is completetly filled with docstrings for these read/write methods.
However, if Lorenzo's method is preferred, then I would at least add backwards compatibility so that kg.load_rdf(...)
is still possible, and simply becomes kg.loaders.load_rdf()
under the hood.
Beyond that, I would consider that moving the implementation of a method elsewhere does not necessarily make the file more readable. For example, the SPARQL query methods make up ~140 lines or so, but only ~30 or so are the implementation. The rest is newlines, the function signature and the docstrings. If you only move the implementation (i.e. move it to another file, and import some function so the new implementation is only one line in kglab.py
), then you won't actually gain any readability. Plus, you will have to duplicate the signature and docstring in that case. However, you can use the "hack" from my code snippets above to actually gain readability, as that also moves the signature and docstring as well.
@Mec-iS, @tomaarsen – this is really excellent. I've been worried about that module growing and growing...
A few points for consideration based on what's coming up:
1.0.x
release and need to keep backwards compatibility until then at least
@abstractmethod
then assigning to the base classRDFlib
) to find relevant integration pointsloaders.*
and storers.*
subpaths, instead have an io
subpath with modules w3c
, parquet
, jsonld
parquet
support will become a focal point for integration work in 1.0.x
, so those two methods should be handled very gently :)NumPy
/ cuNumeric
, and NetworkX
/ cuGraph
openCypher
, APOC, etc.)
graph.py
module which is an RDFlib Storage plugin (Lorenzo: pure Py this time, not Cython)We also need to approach refactoring both subg
and topo
modules in a similar way.
Here's how I propose to divide these public methods into subpaths:
kglab core:
rdf_graph
add_ns
get_ns
get_ns_dict
describe_ns
get_context
encode_date
add
remove
io
serialization (each abstract class has load
and store
):
parquet
load_parquet
save_parquet
w3c
load_rdf
load_rdf_text
save_rdf
save_rdf_text
jsonld
TODO: can be deprecated since RDFlib
absorbed rdflib-jsonld
load_jsonld
save_jsonld
import
practices (which are either semantic data integration or weird, special cases – definitely, let's keep these out of io
):
import_roam
becomes roam
external_import.import_from_neo4j
becomes neo4j
TODO: refactor to move over herematerialize
becomes rml
load_csv
becomes csvw
query
:
sparql
query
TODO rename as sparql_query
query_as_df
TODO rename as sparql_query_as_df
n3fy
n3fy_row
visualize_query
w3c
standards:
validate
becomes validate_shacl
(SHACL)infer_owlrl_closure
(OWL-RL)infer_rdfs_closure
(OWL-RL)infer_rdfs_properties
(OWL-RL)infer_rdfs_classes
(OWL-RL)infer_skos_related
(SKOS)infer_skos_concept
(SKOS)infer_skos_hierarchical
(SKOS)infer_skos_transitive
(SKOS)infer_skos_symmetric_mappings
(SKOS)infer_skos_hierarchical_mappings
(SKOS)plus:
subgraph
topology
What great detail! I love it. I'm a tad busy, but if I have some time then I'll see if there's some room for my help - whether with commits or reviews.
@tomaarsen and @Mec-iS, one quick question:
In the @abstractmethod
examples above these are @classmethod
definitions.
What's a good approach for method which will require instance member, e.g., access to kglib.KnowledgeGraph._g
and others?
I've started on the query
portion to see how this would look, and the WIP is in the base_class_refactor
branch:
sorry I missed this comment @ceteri. Great work as usual!
Do you have ideas on how to carry this on? Are we done with this or what is the next milestone?
Use of Mixins #262
Currently
KnowledgeGraph
class is implemented in ~1000 lines of code. To improve design, readability and accessibility we should consider refactoring. I would suggest considering composition by grouping the different methods in components classes like:So that the methods can be moved to different files like
kglab/loaders.py
and have a clearer layout of the code.After refactoring the calls would look like: