DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
575 stars 65 forks source link

Integration with scipy and scikit-learn #261

Open Mec-iS opened 2 years ago

Mec-iS commented 2 years ago

One of the integration we are going to work on is the one with scikit-learn.

This conversation is to collect requirements and features to implement calling scikit-learn using kglab abstraction layer.

My point of view after taking a look to the API provided by popular data science libraries, these are the interesting scikit-learn and scipy functionalities that we could start with:

  1. Allow converting kglab's KnowledgeGraph data structures to observations matrix (to be defined), adjacency matrix and condensed distance matrix as defined by scipy. This will allow building up further flows (or "pipelines", chains of function calls) that the users can assemble to go from a KnowledgeGraph representation to a graph algebra representations. This is critical as we need to pick first principles or to provide different alternatives according to the type of graph or the different tasks the users may want to accomplish.
  2. After 1, let's start with an example flow in kglab for SciPy's Hierachical Clustering. It would be nice to have a flow that allow simple clustering. This implies providing switches to:
    1. Linkage procedures
    2. Tree building like sklearn.cluster.ward_tree

Other possible examples:

These are now in unordered fashion, will take some time to figure out which principles to import from scikit-learn and scipy so to build up proper user flows from knowledge graph as represented in RDF/kglab and graph algebra representations.

Please provide feedback and suggestions. I will create a Github project around this effort.

cc: @tomaarsen @SultanOrazbayev

ceteri commented 2 years ago

Wonderful! This is super helpful. The nearest neighbor parts would have some immediate use cases.

BTW, there's already the SubgraphMatrix class in subg.py which handles the transform/inverse_transform from an RDF graph to:

Mec-iS commented 2 years ago

we probably want some methods that returns numpy.array, I will reuse what it is already there for sure.

Mec-iS commented 2 years ago

@SultanOrazbayev mentioned the importance of having a descriptive summary of general metrics about a graph, something like pandas.describe(). These are the metrics that could be useful in an hypothetical SubgraphMatrix.describe():

tomaarsen commented 2 years ago

Agreed, sometimes it's hard to actually understand what kind of graph you're using..