agoose77 / jupyterlab-knowledge-graph

Knowledge graph for JupyterLab
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Use graph database #7

Open agoose77 opened 3 years ago

agoose77 commented 3 years ago

The current graph interface is a URI-indexed sequence of Record objects with associated links. This is not a convenient API for performing "interesting" queries. Rather than requiring the user to reconstruct these data into a graph-like format, we could leverage a graph database as our primary data-model, to further decouple the view from the model, and provide interesting ways to query the data. This would move things into a server extension.

bollwyvl commented 3 years ago

Yep, getting some persistence is big.

Despite its warts, SPARQL is still my go-to "gotta-have" feature of a graph store, as then there are at least a few options without having to re-engineer everything once someone decides they want to scale up (e.g. to a team using JupyterHub against 100s or 1000s of separate documents than change over time). While SPARQL is a beast, over on jupyrdf, we've got some stuff in the pipeline to make writing said queries less terrible, but non-buzzword, actually-getting-stuff-done, thinking-about-thinking graph work is boring as hell and hard to "sneak" into a more "mainstream" data science/engineering problem, vs drawing graphs of graphs, which can be cross-applied to many problems.

oxigraph just might be able to do this on the front- and backend, and is rather exciting to me. I haven't used it in anger, and it doesn't say "works in browser" on the tin... but WASM means it might be possible. I would not deploy a node-in-wasm-from-rust database... though if they figured out a deno integration...

If that doesn't work out, the lowest-pain server SPARQL store would probably be rdflib-sqlalchemy, backed by sqlite, for a casual individual, or postgresql for a team (or phd candidate). Difficullty-of-casual-deployment rapidly ramps up from there with more industrial-grade stores such as virtuoso, but then you start getting things like inference "for free," as well as more predictable performance/robustness. AWS will take your money, too.

The only other thing that has tempted me away from SPARQL might be GraphBLAS, for which there are already some implementations, but it's still quite early, and more focused on shredding honking huge graphs, rather than semantic data.

If query language portability and ease-of-deployment were not concerns, some very interesting offerings:

I personally avoid neo4j, as one must almost entirely buy into their ecosystem, which hits a brick wall, scale-wise.

bollwyvl commented 3 years ago

So in some (not-yet-pushed) code, cribbing heavily from an (unmerged) sqlite.js wrapper, i was able to wrap the oxigraph WASM in a comlink webworker, load in some triples with turtle, and query it back out with SPARQL into a datagrid (as a stand-in for a better UI) inside a labextension.

Time-to-first-query-results is of course bounded by starting the worker and downloading the WASM, and filling it with data. It has no persistence that i could find (though i'd wager we could lazily persist/restore stuff into IndexedDB).

I'll also take a look at seeing if i can get the python version to build for conda-forge, which is kind of my litmus test for "could use this for something i can recommend using," but i'm heartened by what i've seen.

bollwyvl commented 3 years ago

So I've gotten (py)oxigraph up for conda-forge... took some doing (e.g. new versions of rust and maturin, but i digress) but am now checking it, and its rdflib bindngs out. Haven't had the chance to get back on the machine where I've got it working in the browser, but on the main, I think this might be the most reasonable path forward for single-user-to-small-team case.