DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
574 stars 65 forks source link

Make easy to load default datasets #269

Closed Mec-iS closed 2 years ago

Mec-iS commented 2 years ago

I'm submitting a

Current Behaviour:

It is hard to load any of the default datasets.

Expected Behaviour:

there should be a straighforward way of loading existing datasets, for example:

kg = KnowledgeGraph
load_dataset("wikings-families", kg=kg)

Every dataset should have a name that if passed to load_dataset provides automatic imports of the dataset in a given graph; as for example provided by scikit-network load collection

ceteri commented 2 years ago

That's a helpful feature. It's specific to scikit-network and should be denoted as that in the method name.

Two concerns:

  1. We need to keep our serialization methods following a similar pattern:
    • Loads get applied to a graph
    • The effects of loads are cumulative (although does this make sense for scikit-network datasets ?)
  2. File locators get passed as PathLike, to allow for working consistently with non-Posix systems, such as cloud storage buckets

Instead I would use a pattern such as:

kg = KnowledgeGraph()
path = pathlib.Path("wikings-families")
kg.load_scikit_dataset(path)

BTW, this reminded me that the cloudpathlib library which our team uses elsewhere has become more general than the urlpath library which we used here in kglab, and we'll need to make that update throughout the serialization methods.

Mec-iS commented 2 years ago

It's specific to scikit-network and should be denoted as that in the method name.

No, it is a common pattern used by all the popular libraries, also pytorch and tensorflow provides it for example

The idea is just to encapsulate all this logic:

from os.path import dirname
import kglab
import os

namespaces = {
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gorm": "http://example.org/sagas#",
    "rel":  "http://purl.org/vocab/relationship/",
    }

kg = kglab.KnowledgeGraph(
    name = "Happy Vikings KG example for SKOS/OWL inference",
    namespaces=namespaces,
    )

kg.load_rdf(dirname(dirname(os.getcwd())) + "/dat/gorm.ttl")

into a method, so that the user can avoid knowing all these details.

Accepting your notes that could be:

kg = KnowledgeGraph()
load_dataset("wikings-families", kg=kg, path=None, title=None, namespaces=None)

So that parameters can be passed if needed.

Users will still be able to use kg.load_* explicitly if they need. The new one is just a convenience method for newcomers to quickly load one of the default dataset for experimentation.

ceteri commented 2 years ago

Thank you @Mec-iS , that helps me much understand better.

I see about the convenience method, although arguably this is a practice that create extra cognitive load, with PyTorch being an example cited.

For files used in our tutorials we want to emphasize examples of how to load or save files in storage, ideally as Posix files. The thinking is: this way there are less differences to overcome when people try to apply code from our examples for their own projects.

One problem we've encountered during Q&A is that there are namespaces which are difficult to understand, such as the RDF prefix namespace. Moving between different libraries (e.g., RDF vs. NetworkX) also introduces API namespaces to navigate. 'm apprehensive about adding a dataset namespace, since these are only for tutorial example sand not part of the library usage in production.

FWIW, I found this exchange between the fsspec and cloudpathlib communities entertaining :) https://github.com/drivendataorg/cloudpathlib/issues/96