Pluggable persistence backends

michielbdejong commented 5 years ago

There's a problem we're running into for pluggable backends now, since we switched from BlobTree to IResourceStore. The IResourceStore interface does a lot more than just persisting data, for instance it can parse a sparql-update query, apply it to a json-ld document, and save the result as turtle. That's not something we want to reimplement over and over for every persistence technology out there.

I'm thinking the answer is probably 'decorator', but then it still doesn't make sense because the thing that gets decorated would still need to implement the IResourceStore interface, just not the IResourceStore functionality.

Is there some way we can describe the functionality we require from a persistence backend, instead of only pointing to the interface signature, and without going back to the BlobTree abstraction we rejected?

If someone wanted to use pod-server on top of, say, Amazon S3, they need to write an S3 backend, but can we give them some way to reuse all the RDF-aware stuff from the in-memory backend?

RubenVerborgh commented 5 years ago

Very good point, thanks for bringing it up. This is indeed something we want to carefully discuss.

That's not something we want to reimplement over and over for every persistence technology out there.

Definitely.

it can parse a sparql-update query, apply it to a json-ld document, and save the result as turtle

Note that the parsing of the patch is done by specific BodyParsers, which are instructed to do so by the LdpHandler. This is because LdpHandler needs to parse the patch to determine required permissions before sending it to the operation. The operation receives a parsed patch, and passes this to the store. So the store never needs to worry about parsing patches.

The application of the patch that would also not belong in a store, but in a patcher, to similarly allow reuse.

There's another reason why we separate patchers from stores, namely because some stores can apply some patches more optimally than others. For instance, here is a patching strategy that works for all backends. We could implement it as, for instance, MemorySparqlPatcher:

Receive a Turtle file in its entirety. Parse it. Patch it. Serialize back as Turtle.

However, there are several cases where this algorithm is suboptimal:

If the back-end is a SPARQL endpoint, just send the UPDATE query to it directly.
If the query is insertion-only, and the back-end is file-based and supports appending, just append triples rather than reading the whole file.
If the store does not support insertions, only deletions, reject patches before even reading anything.

For the above cases, we can implement patchers that are reused across store implementations.

The concept of patchers is introduced on https://rubenverborgh.github.io/solid-server-architecture/solid-architecture-v1-2-0.pdf#page=6

I'm thinking the answer is probably 'decorator',

Decorator doesn't cover this case, as it is the store itself that needs to decide which patching strategies it supports.

Is there some way we can describe the functionality we require from a persistence backend

So that functionality is the one described by the interface. Some implementation-specific patchers are allowed to assume more about the back-end as needed.

If someone wanted to use pod-server on top of, say, Amazon S3, they need to write an S3 backend, but can we give them some way to reuse all the RDF-aware stuff from the in-memory backend?

The concept of patchers allows for that indeed.

Do you have other use cases in mind besides patch?

michielbdejong commented 5 years ago

Thanks Ruben, your solution is very useful for taking advantage of backends that have some RDF-, image-, or anything- awareness. But in the real world (file systems etc.), the vast majority of persistence system, unfortunately don't speak RDF, image manipulation, etc. yet. So for all of those, what if we just:

create a BlobTreeBasedResourceStore, that takes (something like) a BlobTree in its constructor and adds all the higher level RDF-aware stuff to it.
Apart from our existing BlobTreeInMem, BlobTreeNssCompat and BlobTreeRedis, implement a simple BlobTreeS3 that only deals with tree-structure-aware persistence.
Use the BlobTreeBasedResourceStore class to create InMemoryResourceStore, NssCompatResourceStore, RedisResourceStore and S3ResourceStore.

This is basically what I already started doing with https://github.com/inrupt/wac-ldp/blob/master/src/lib/storage/BlobTreeInMem.ts#L97-L109 and https://github.com/inrupt/wac-ldp/blob/master/src/lib/storage/BlobTreeNssCompat.ts#L138-L149.

RubenVerborgh commented 5 years ago

very useful for taking advantage of backends that have some RDF awareness.

The architectural pattern is not RDF-specific. See examples at https://rubenverborgh.github.io/solid-server-architecture/solid-architecture-v1-2-0.pdf#page=6 where there is LineBasedPatch, BinaryPatch, ImageFilter. Note in particular that the above MemorySparqlPatcher does not assume RDF knowledge; it just receives and returns a file.

Basically, every ResourceStore accepts a list of patchers in its constructor, and, when receiving a modifyResource instruction, loops through all of them, asking if they support the given patch, and if yes, to apply it. Whether or not these patchers are RDF-aware, image-aware, or anything else, does not need to be known by the store.

michielbdejong commented 5 years ago

Whether or not these patchers are RDF-aware, image-aware, or anything else, does not need to be known by the store.

Thanks for pointing that out, I edited the intro to my question to also mention images etc.

michielbdejong commented 5 years ago

My actual question there was the part that starts with "what if we just ...", the part you commented on was just an intro to that.

RubenVerborgh commented 5 years ago

what if we just:

create a BlobTreeBasedResourceStore, that takes (something like) a BlobTree in its constructor and adds all the higher level RDF-aware stuff to it.

Apart from our existing BlobTreeInMem, BlobTreeNssCompat and BlobTreeRedis, implement a simple BlobTreeS3 that only deals with tree-structure-aware persistence.

Use the BlobTreeBasedResourceStore class to create InMemoryResourceStore, NssCompatResourceStore, RedisResourceStore and S3ResourceStore.

First of all, I currently don't have all materials to properly make the case either way. I would need a detailed architectural diagram with this solution to be able to comment meaningfully. I realize that doing this is quite a burden, but it's unfortunately the only way that I know how to evaluate architectures. So given these limitations on my side, take everything below with a grain of salt.

Recall that one of the key points in the architecture is to minimize the knowledge each component needs. In the proposal with Patcher, the knowledge of any ResourceStore can be limited to:

handing off to a patcher

This pattern could even be implemented in a base class, such that subclasses do not need any knowledge about patching at all. Importantly, when the ResourceStore wants to take control of patching (for specific optimizations), it can; it just does not have to.

Based on the terms and links above, these are the things a ResourceStore needs to be aware of with the BlobTreeBasedResourceStore:

blobs (not every back-end has them, e.g., SPARQL endpoint)
trees (not every back-end has them, e.g., SPARQL endpoint)

And sure, these stores could be implemented with another base class; but then we also need another implementation of patch logic, in addition to all of the implementations provided by the individual BlobTrees, which is my next point.

It also seems that every BlobTree need to know about:

patching
- picking the right parser for the format
- all possible patch algorithms
RDF

The above can be seen in the repetition in https://github.com/inrupt/wac-ldp/blob/350c5f1141e35cc8d5896511fead034aeaa01f20/src/lib/storage/BlobTreeInMem.ts#L69-L164 and https://github.com/inrupt/wac-ldp/blob/master/src/lib/storage/BlobTreeNssCompat.ts#L77-L205. I.e., the code is so highly similar that it indicates a problem of reuse: all these implementations need the above knowledge again and again. Which means we need to test the same things repeatedly for all of them, ensure the same bug fixes are applied to all of them, etc. Note also how these BlobTree implementations, because of their hardcoding of the patch logic, currently only support RDF patches, and are not extensible for optimized scenarios such as append-only patches, triple store backends, etc. So extending them to support these cases will lead to even more complexity, that is repeated across all trees.

The goal of the Patcher-based design is exactly to ensure that one piece of knowledge is only encoded in one single place, such that it needs to be tested only once, and changed only in one place whenever needed. And indeed, the above list of things that every BlobTree implementation needs to implement again, would instead be implemented in one specific parser and be reused across different stores. Adding support for a new store in the Patcher design comes down to only coding read/write operations for that store.

In that sense, we can see the Patcher design as a version of the BlobTree design where all common knowledge has been abstracted away into one single place, making the BlobTrees themselves empty and leaving only the store-specific code to be implemented, which is about as minimal as it gets. And there can be multiple versions of the abtracted logic (different patches) and multiple stores (different back-ends), and they can be combined in different ways and evolve independently.

With a simple example: the impact of adding support for line-based patches involves writing one small class with the Patcher design. With the BlobTree design, it involves changing every single BlobTree in existence (based on the code that I am seeing right now). In both designs, adding support for one store requires implementing one class, but the code to be implemented in the Patcher design is much smaller, as no patching logic is needed for a store.

So my answer to the "what if" question is, based on my current (limited) understanding: the design is less flexible with regard to what kind of patches can be supported and how they are applied, and results in the same knowledge being spread out across multiple components, and repeatedly so, which negatively impacts quality, testability, maintenance, and evolvability. Plus large costs to add and maintain new patches and back-ends.

acoburn commented 5 years ago

This is an excellent issue to raise. I encountered what appears to be a very similar issue when building out the Trellis architecture: I had multiple persistence backends and there was a lot of repetition of code/logic in those implementations.

In a word, this is all about trade-offs. If the common code is put into a higher layer (as was done in Trellis), then it simplifies the implementation of the backend. Doing so also puts greater constraints on those backends: implementation-based optimizations (such as appending to a file with PATCH rather than having to replace the entire file) become less possible. My perspective was that I absolutely wanted multiple persistence implementations, and so I "optimized" the ability of the developer to write them, as opposed to making the interfaces more flexible and optimizing the potential runtime of particular implementations. That is a judgement call, and as stated above, there are trade-offs.

An alternative that might allow you to factor out common code while also keeping this architecture (which has certain types of optimizations built-in), is to create a separate solid-utils module. It would contain reusable methods that different implementations could use. Because this code would be used internally by implementations, the utility classes can be highly pragmatic and scoped to exactly what you find useful. So these classes could sit alongside the public interfaces described in this repository, without needing the formal semantics of this architecture.

michielbdejong commented 5 years ago

OK thanks for the detailed answers!

OK, then we stick to IResourceStore as the one and only pluggable backend interface, and leave it at that.

I do like @acoburn's suggestion of a sort of 'utils' pattern as a way to reuse code without imposing assumptions. Will also keep that in mind, thanks!

michielbdejong commented 5 years ago

Just discussing this with @jaxoncreed, it may make sense to have an abstract class KeyValueResourceStore, in which there are two abstract methods, get(key) and set(key, value). That would at least be useful to easily build for instance an in-memory resource store and a bunch of other simple ones. But I'll leave that discussion to you to have directly.

RubenVerborgh commented 5 years ago

Yes, that’s in there: KeyValueStore on page 4.

RubenVerborgh / solid-server-architecture

Pluggable persistence backends #35