Relations - Githubissues

arrdem commented 6 years ago

At present, Shelving's API is adequate for traditional K/V store access. Records go in, IDs come out, IDs go in and records come back out again.

core.logic.pldb and traditional Prolog/Datalog style stores work by defining "facts" being "relations" between "entities". These relations fit in a (entity rel entity) pattern, and select queries can pretty easily be written in terms of selecting from such tuples with logic variables.

I've found pldb to be cumbersome in the past because I was constantly defining mappings back and forth between the relational data model and the more record structured data format I wanted to be processing in "normal" Clojure code - there was a hard edge and a significant semantic mismatch between my data storage layer and the structures I wanted to be manipulating.

The Big Idea

The original idea for Tupelov was that, using specs, data structures can automatically be translated tuples to the form of (parent-id kw-or-idx child-spec child-id) and back again. Similarly, queries on nested data structures can be mapped down to traditional logic queries, albeit with unfortunately many intermediate "parent" IDs to unify.

This would enable you to get a content hash based storage layer with logic queries on your data as you understand it basically for free. Which would be awesome.

The 0.1.0

This is a pretty meaty idea for a project, and rapidly begins to run afoul of many database implementation details such as what indices and records are loaded at any point in time, how expensive different scans are and soforth. Furthermore to serve Shelving's primary goal of backing Grimoire v3 / Stacks / cljdoc, it doesn't need anything nearly so ambitious.

With Reference to Grimoire

The current Grimoire case study defines three specs: ::package, ::entity and ::annotation all of which are multispecs.

The idea is that ::package abstracts over the many possible identifiers for a jarfile or other Clojure resource. A Maven repository is expected to be the common case and is the reference implementation.

The ::entity spec abstracts over the elements of a package, and requires that :parent be a package of some kind.

The ::annotation spec abstracts over annotations such as articles, docstrings and metadata on either entities or packages. It requires that :parent be either a package or an entity.

Without relationships, one could implement database search say for all annotations on a package by simply doing a linear scan over the annotation table by ID, deserializing each annotation record and testing for equality between the annotation and the desired parent package.

In lib-grimoire, this access pattern was somewhat acceptable because the original data model was strictly hierarchical, and scans only ever occurred over a set of records which already shared a single parent structure keeping the scan space small. This also meant that the parent was an implicit part of every persisted structure, and was not repeated. Naively, without relationships, we'd persist multiple copies of the parent structure and have to use a compressed serialization target to sort that out.

Proposed Implementation

For an initial version, relations should simply seek to provide the abstraction of generating relating indices. Implementations may choose to play tricks using relating indices under the covers, but that's beyond the scope of the intentional API.

shelving.core/shelf-rel

(shelf-rel schema from-spec conform-k to-spec)

Defines a relation on a spec in the in terms of the key in the s/conform output on from-spec where the substructure(s) conforming to to-spec are to be found. from-spec and to-spec must both be present in the schema when writes occur. Rels are uniquely identified by pairs [from-spec, to-spec].

The abstraction should be [from-id [from-spec to-spec] to-id] in terms of a triple store.

shelving.core/enumerate-rels

(enumerate-rels conn)

Returns a sequence of pairs identifying relations in shelf.

shelving.core/enumerate-rel

(enumerate-rel conn rel)

Returns a sequence of (from to) pairs according to the identified relation on the shelf.

martinklepsch commented 6 years ago

Hey @arrdem! I've been thinking about shelving yesterday a bit and so here are some thoughts primarily from a perspective of Grimoire and usage as part of cljdoc:

Observation: In contrast to current Grimoire store it seems to trade an obvious/easy to understand file system layout for a more generally usable store
Observation: As you mentioned in the README Grimoire does ~indexing via it's hierarchical storage. Shelving seems to effectively be a blob store with "buckets" for different record types so this kind of implicit index goes away for anything but the records type. This means we'll need something extra for indexing.
Question: Where do you see the main benefits over an actual database (e.g. Postgres) or other existing server-less Databases? (e.g. H2 or MapDB)
Question: If/when a schema gets updated, how does that affect existing records. Is rewriting all data required? Can they be retrieved even though their schema may be lacking a newly added required field? Migrations?

Maybe I'm a little too attached to storing everything on S3 but it seems especially the last (indexing) observation may make any remote storage somewhat impractical (even with single writer) as there is no append operation for S3 objects (If there were the index could be a log-store of sorts). Also indexes would need constant syncing since even trivial retrieval has some indirection — unlike the current Grimoire store where you can just write out the path to an object immediately.

Grimoire's model is limited but I think much of it's beauty lies in the simplicity of it's underlying storage layer (filesystem tree). If you can't use Grimoire because you're not in Clojure it's still relatively trivial to get data out of a Grimoire store. (This is probably not worth optimizing for but it's still nice.)

I hope these points make sense and don't come across as negative. I like what Shelving aspires to be but the trade-offs are certainly different to Grimoire's.

arrdem commented 6 years ago

Hey @martinklepsch! I really appreciate the feedback!

I agree that the current "trivial EDN" backend is very different from lib-grimoire's current filesystem storage layer. Some of this is by accident, and some of it is by design. The original Grimoire store was designed to optimize for me editing its elements by hand with the hope that people would contribute to that effort and that the Grimoire document store itself would be both repository and deployment artifact. The reality is that it's largely proven to be a compilation artifact produced by lein-grim that's only rarely edited by hand.

The existing lib-grimoire store is VERY tightly coupled to the existing Thing hierarchy, with the benefits that you note. Unfortunately as we've both experienced this means that, when the Thing hierarchy is semantically inadequate significant changes or significant abuse of lib-grimoire's structures are required. The ::package, ::artifact, ::annotation sketch of v2 lib-grimoire and this library are both deeply motivated by wanting to get me and my code off the critical path for you and other users who wish to interact with Grimoire stores.

It should be possible for me to specify an open data schema leveraging shared multispecs, publish those multispecs and a persistence layer that only requires them, and then let people go crazy. That's the design vision here.

Re: Amazon S3, this is an appropriate use for a storage layer like the trivial EDN backend which retains everything in-memory and does single bulk load/dump operations. It'd be nice to build backends which have finer grained semantics than that, but the API is intended to defer those decisions to implementers.

Re: Schema updates, the expectation is that "real" schema updates never occur. Schemas can be extended - new records added, but you can't do a full table migration. That'd be a new database to Rich's points about never re-using names. The intent of the open Grimoire v2 schema is again to provide a minimal basis that other things can build over by supersetting, so this isn't really a concern. Some attempt at spec migration could be supported.

Re: MapDB, MapDB would be appropriate as a storage layer implementation given what I'm seeing about it, but it still doesn't give you indices or any search capabilities. This ticket for adding such would still be relevant.

Re: H2, just using JDBC + H2 would definitely be a way to sidestep all of this. But it does incur a similar semantic mismatch to say PLDB or some other structure in that we're now dragging around a full SQL implementation + Clojure's JDBC layer probably + one of the Clojure SQL toolkits. It's a lot, and frankly I'm not that comfortable with SQL or JDBC. If this is tractable, simple and provides adequate performance I'd rather take this route.

At least this is my thinking.

arrdem / shelving

Relations #2