RubenVerborgh / solid-server-architecture

Proposed architecture for a Solid server
https://rubenverborgh.github.io/solid-server-architecture/solid-architecture-v1-3-0.pdf
13 stars 2 forks source link

Can you clarify the use of both 'Resource' and 'Representation' in the ResourceStore interface? #10

Closed pmcb55 closed 5 years ago

pmcb55 commented 5 years ago

The methods of the ResourceStore interface use both 'Resource' and 'Representation', yet there is no interface for 'Resource', but there is for a Representation (https://github.com/RubenVerborgh/solid-server-ts/blob/master/src/ldp/Representation.ts).

I think at least 'addResource()' should be renamed to 'addRepresentation()' to be consistent, but in fact I think the interface would be simpler if it just used 'Resource' everywhere and dropped 'Representation'. But maybe you could clarify why you may still think a distinction is necessary (and if so then why isn't there a need for a 'Resource' interface)...?

RubenVerborgh commented 5 years ago

TL;DR: In the REST architectural style, resources are manipulated (only) through representations; I believe we need to adhere to this throughout.

The methods of the ResourceStore interface use both 'Resource' and 'Representation', yet there is no interface for 'Resource'

That is correct; "representing" a Resource would kind of be a contradiction in terms. Resources are identified, but manifest themselves concretely as representations. So a Resource in that sense is nothing but a set of representations; and I nowhere have found the need to explicitly represent that full set.

I am at the moment pedantically sticking to the REST distinction of representation/resource (https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm#sec_5_2_1_1) to make sure we use exact terms when discussing. I am okay with using "resource" more relaxed in concrete implementations (although I'm still in favor of the distinction).

Question: is everyone aware of these definitions? https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm#sec_5_2_1_1

I think at least 'addResource()' should be renamed to 'addRepresentation()' to be consistent

This is actually deliberate, and in line with the above definitions. addResource creates a new resource within the container through a representation. addRepresentation (technically) would be when we are adding a new representation to a resource. (For instance, adding a Turtle representation to a resource that currently only has a JSON-LD representation.)

But maybe you could clarify why you may still think a distinction is necessary (and if so then why isn't there a need for a 'Resource' interface)...?

A crucial concept in the REpresentational State Transfer architectural style is the fact that resources are always manipulated through representations. Given that we have a very resource-centric world (with RDF and LDP) and that we will deal with different representations of the same resources, I feel that it is important to have clear terminology throughout. Understanding the distinction is key to understanding the architectural underpinnings. I'm afraid it will bite us if we start thinking about it too loosely. Doing LDP/Solid right means correctly carrying the distinction of a resource and a representation throughout the architecture for me.

pmcb55 commented 5 years ago

I really like the idea of sticking to the formal terminology from REST (even though it's 'just' a PhD thesis, and not a normative standard or anything). And you're absolutely right about us needing to be very precise in our distinctions, and I also agree that Resource and Representation are indeed distinct things.

So from a Solid perspective, built on LDP, can we capture this distinction by formally stating these two things:

  1. A single Resource can only ever be expressed as an RDF Source or a Non-RDF Source, yet regardless:
    • it is always identified with an IRI
    • it's most fundamental physical form is a byte stream
  2. A single Representation can only ever be expressed as a media-type:
    • it is always identified with an official IANA media-type string (e.g. 'text/turtle' or 'application/ld+json' for RDF Sources, or 'image/jpeg' or 'audio/mpeg', etc. for Non-RDF Sources)
    • it's most fundamental physical form is a text string

Beyond our definitions though, I think confusion arises when people conflate two things that should remain completely separate: 1) the physical storage of the byte stream that makes up the Resource data 2) the media-type used when passing a Resource from one place to another (formally we should probably say '...when passing a Representation from...', but let's keep things simple here for the moment!)

These two things become conflated when we ask: Do all Resources always have an associated set of Representations?

This following sentence from the thesis seems to explicitly say 'Yes' to that question:

5.2.1.2 Representations If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

I think this makes sense (even just intuitively) for Non-RDF Sources (e.g. for a single Non-RDF Resource '/myCat' I can have an associated set of representations (such as 'image/jpeg', 'image/bmp' and 'image/gif'), each of which has a distinct physical form as a distinct stream of bytes at the physical storage level.

But my question is, do RDF Sources also have an associated set of representations (e.g. for a single RDF Resource '/myProfile' having a 'text/turtle' representation and a 'text/n3'), each of which has a distinct physical form as a distinct stream of bytes at the physical storage level?

My feeling is that supporting multiple representations of a single RDF Source at the physical storage level (i.e. as distinct streams of bytes), could lead to very confusing and conflicting situations (e.g. if I had a Turtle and a JSON-LD representation of a resource, what do I return when a client asks for Trig? How do I perform a single Patch operation on that single resource?).

Instead, RDF is an abstract model for expressing a graph, so it can be thought of as only ever having a single 'representation' - i.e. a graph of nodes and directed edges.

So if we completely ignore (or formally disallow) the notion of 'Representation' (i.e. media types) for RDF Sources at the physical storage layer, then that layer is free to persist that RDF data however it sees fit. We need this freedom if we want to simplify our support for RDBMS, triplestores, document stores (e.g. MongoDB), column stores (e.g. Cassandra), etc. None of those systems have the concept of media-types as a first-class citizen (only the file system does, with it's file extensions).

Therefore I think the notion of Representation from a REST perspective (i.e. media types) for RDF Sources only ever needs to come into play for actually passing a Resource from one place to another.

But another reason for the conflation of 1) and 2) above is the need to also support using a file system as the physical storage layer. In this particular case (and I think in only the particular case of the file system), there is no inherent need for the storage layer to convert the incoming Resource data into some internal 'storage layer specific' concepts - i.e. a triplestore needs to parse RDF into subject, predicate and object to update it's indexes, an RDBMS needs to parse the RDF to store the data into various tables, MongoDB needs to convert the RDF into JSON-LD to store it as BSON, etc. (Yes, an RDBMS could just store the RDF as a 'Binary Blob', and MongoDB could just store the RDF as a Base64 encoded string maybe, but that would defeat the purpose of supporting various persistence mechanisms in the first place, since it would make it impossible to use their respective query, sharding, replication, etc. mechanisms).

So how should we use the file system to persist an RDF Source if our storage layer interface completely ignores Representation? One option would be that we always convert RDF Sources into one specific serialization (e.g. Turtle or N-Triples) and always store as .ttl or .nt files (in fact, I kinda like that idea!). Another option (which is the current NSS behaviour) is simply to not convert at all, and just persist the data as-is and use the media type to create the file extension. This has the advantage of allowing us preserve the data exactly as-is (e.g. Turtle comments are preserved), but the disadvantage is that as soon as we perform an update, we need to formally parse that RDF, perform the update and store as pure RDF again - so all that comment information will necessarily be lost.

It's this uncertainty in the latter approach that makes me like the former approach, since it ensures the data is always consistent.

However, I do still like the concept of being able to persist a Resource (RDF or not) exactly, byte-for-byte, 'as-is' too. But for that I would force the client to explicitly state they want their data persisted that way (e.g. they should include a Link: header). The storage layer should still do it's normal thing of storing that resource, but also store a link to an 'as-is-at-time-X' copy of that resource data. I think the 'at-time-X' is important to reflect the fact that subsequent updates to the resource can't be reflected in the 'as-is' copy (since that's impossible to support anyway). And I guess we'd also need some new mechanism (e.g. Link: header) to allow the client stipulate they want the 'as-is-at-time-X' version of that data! So this would be a completely separate topic I think...

So my question again, should we support RDF Sources having an associated set of representations, each of which has a distinct physical form as a distinct stream of bytes at the physical storage level?

RubenVerborgh commented 5 years ago

Thanks a lot, @pmcb55. First of all, these discussions prove that we are on the right track; the fact that these questions and points are being raised shows that we are thinking about the important things.

I'll try to be as brief as possible below. Let me first get to the meat, and then go into details.

_Do all Resources always have an associated set of Representations?_

Non-information resources never have an associated set of representations. Information resources always have an associated set of representations, whose size is zero or more.

However, this is a distinct question from:

So my question again, should we support RDF Sources having an associated set of representations, each of which has a distinct physical form as a distinct stream of bytes at the physical storage level?

Because my answer to that is "they might". Whether or not they exist on disk is not observable by the client, so this means that implementations have a free choice.

I see good reasons for having one (or sometimes more) representations on disk, but some resources will have zero representations on disk and be generated on the fly (interesting example: a folder index, but also every RDF resource when your back-end is a triple store).

the formal terminology from REST (even though it's 'just' a PhD thesis, and not a normative standard or anything)

Important to mention (for those who might read) is that, although a scientific work, it has very directly influenced the later HTTP specs and a lot of the Web architecture thinking.

1. A single Resource can only ever be expressed as an RDF Source or a Non-RDF Source

Did you mean "either"? Because in that case, I disagree: the same resource can have RDF- and non-RDF representations. Generic example: I can represent a description about a person in HTML or RDF. Format-specific examples: strip an HTML+RDFa document of its RDFa, or a JSON-LD document from its context, and you respectively have HTML and JSON.

* it's most fundamental physical form is a byte stream

Disagree. a) there is no "most fundamental form" for resources in general, b) the most fundamental form for an RDF document for me would be the RDF graph it represents. I know that such a graph is conceptual and not have a fixed serialization—which is precisely my point. The physical form (in general) comes when we talk about representations.

2. A single Representation can only ever be expressed as a media-type:

Disagree. A representation is a concrete byte stream version of a resource, and fixes that resources in one or multiple dimensions, such as media type, charset, content language, content profile (e.g., RDF shape), date time (memento), etc.

* it is _always_ identified with an official IANA media-type string (e.g. 'text/turtle' or 'application/ld+json' for RDF Sources, or 'image/jpeg' or 'audio/mpeg', etc. for Non-RDF Sources)

A content-type is not always sufficiently specific (not that important), but definitely is insufficiently precise (important), given the other possible dimensions of negotiation.

* it's most fundamental physical form is a text string

Byte stream.

I think this makes sense (even just intuitively) for Non-RDF Sources (e.g. for a single Non-RDF Resource '/myCat' I can have an associated set of representations (such as 'image/jpeg', 'image/bmp' and 'image/gif'), each of which has a distinct physical form as a distinct stream of bytes at the physical storage level.

Same here: non-observable behavior. For all we know, the image conversion might happen on the fly. Maybe even image generation happens on the fly. So, as in my answer at the top: zero or more representations might be on disk.

But my question is, do RDF Sources also have an associated set of representations (e.g. for a single RDF Resource '/myProfile' having a 'text/turtle' representation and a 'text/n3'), each of which has a distinct physical form as a distinct stream of bytes at the physical storage level?

My answer is the exact same.

For some people and some documents, the on-disk representation of a Turtle file is important. For instance, I might have a hand-written Turtle file with comments and a certain structure, which would inevitably be lost on re-serialization.

The on-disk version might be Turtle, or JSON-LD, or maybe both.

What is on disk doesn't matter; what matters is: are there cases where the back-end is able to quickly/efficiently determine a certain representation of a resource? And there are such cases, as I've exemplified. The back-end will not be able to generate all representations in general, but it will consider some (= zero or more) representations as native, so it makes sense for it to be able to represent when it wants.

My feeling is that supporting multiple representations of a single RDF Source at the physical storage level (i.e. as distinct streams of bytes), could lead to very confusing and conflicting situations

Not disagreeing, this can indeed be complex. But that does not mean we should take away the right of back-ends (in general) to have the possibility to suggest specific representations. Many back-ends will not want to do representations; fair enough, that's their right, and they will be the more simple back-ends. But some of them will, and I want to give them that option. (Hence the importance of the resource/representation distinction all the way through!)

(e.g. if I had a Turtle and a JSON-LD representation of a resource, what do I return when a client asks for Trig? How do I perform a single Patch operation on that single resource?).

Up to the back-end. The patch is for the resource, so it could for instance:

These are edge cases though; only back-ends that are willing to deal with this complexity should have to think about this.

Note that such a situation exists on-disk with Tim already: he writes ontologies in N3, and a script generates the other representations (on-disk). And Apache then content negotiates between them. So it's not all that far-fetched.

Instead, RDF is an abstract model for expressing a graph, so it can be thought of as only ever having a single 'representation' - i.e. a graph of nodes and directed edges.

I just want to point out that this conflicts with your definition of the most fundamental form of a resource 😉, but I definitely agree.

So if we completely ignore (or formally disallow) the notion of 'Representation' (i.e. media types) for RDF Sources at the physical storage layer, then that layer is free to persist that RDF data however it sees fit. We need this freedom if we want to simplify our support for RDBMS, triplestores, document stores (e.g. MongoDB), column stores (e.g. Cassandra), etc. None of those systems have the concept of media-types as a first-class citizen (only the file system does, with it's file extensions).

I am in full agreement—with "ignore", not "disallow". Most stores will not want this level of control, some of them might want it.

However, take a triple store as back-end. You need to get the triples out in some way, so presumably a SPARQL query. So we could do a SPARQL query, parse the triples in-memory, then generate a representation. Or we could just say to the back-end: can you try representing this in Turtle? Then we don't have to re-serialize. Worst case, it doesn't, and then we just parse; but it can be so much more efficient if the back-end supports it, so why waste that?

But another reason for the conflation of 1) and 2) above is the need to also support using a file system as the physical storage layer. In this particular case (and I think in only the particular case of the file system), there is no inherent need for the storage layer to convert the incoming Resource data into some internal 'storage layer specific' concepts

Yeah (but my triple store is evidence to the contrary for "only"). So we agree that some storage mechanisms do have a say about representations?

Because that is why my architecture gives them the option to care about representations, but only if they want to. Most will not, and that's fine. The architecture doesn't depend on them doing representations.

And I guess we'd also need some new mechanism (e.g. Link: header) to allow the client stipulate they want the 'as-is-at-time-X' version of that data!

Memento protocol please 🙂 RFC7089

However, I do still like the concept of being able to persist a Resource (RDF or not) exactly, byte-for-byte, 'as-is' too. But for that I would force the client to explicitly state they want their data persisted that way (e.g. they should include a Link: header).

That is an interesting point, but another topic altogether. I would be inclined to say that, if you PUT something with a certain MIME type, then all equivalent forms under that MIME type are acceptable. So with Turtle: randomize the order as you see fit. If you don't want that, PUT with text/plain. But open to discussion.

However, this concept of a fixed representation necessitates the (optional) ability of the back-end to have a suggestion for representations. So strengthening my suggestion that the architecture should support (but not mandate) that ability.

It's this uncertainty in the latter approach that makes me like the former approach, since it ensures the data is always consistent.

Different stores, different options. I don't want to mandate either, but I do want to give the store the option, because there are meaningful scenarios for either.

Hence my proposal for the architecture the way it is.

pmcb55 commented 5 years ago

Cheers Ruben - lot's to pick apart in there, but maybe just a quick one (as I thought I was on safe ground here :) !):

[PMcB] A single Resource can only ever be expressed as an RDF Source or a Non-RDF Source

[RV] Did you mean "either"? Because in that case, I disagree: the same resource can have RDF- and non-RDF representations.

Yep, I did mean 'either'. I prefixed my question with 'from a Solid perspective, built on LDP' - as I thought POSTing a resource to an LDP server required it to be either an RDF Source, or a Non-RDF Source - i.e. it can't be both (as you seem to be suggesting here). According to the LDP spec, an RDF Source is 'An LDPR whose state is fully represented in RDF'. But I could be wrong here (Aaron has mentioned that the LDP spec can be very open to interpretation)...?

But your example of a HTML document with RDFa embedded in it has come up before, and it's certainly a tricky one. My suggestion for that is that the client treat it as an LDP RDF Source (i.e. they include the 'Link: http://www.w3.org/ns/ldp#RDFSource; rel="type"' header), but they also stipulate the 'persist-as-is-too' mechanism I mentioned above. This would mean the physical storage layer would treat the resource as RDF (i.e. it would parse out all the RDFa, and treat all those triples as being the 'resource'), but included with those triples would also be a reference to a full-blown copy of the original HTML+RDFa document that the storage layer also persists.

(This suggestion is always the client's choice though. They'd only do the above if they explicitly wanted the RDFa triples to be queryable and/or processable. If they only ever wanted to store it as HTML+RDFa, then they'd just ask the LDP server to treat it as a normal Non-RDF Source.)

kjetilk commented 5 years ago

This is a very interesting thread, and I support @RubenVerborgh 's suggested API. There have been a couple of things that I have raised an eyebrow over, but I believe them to be well discussed. I'll add just one point that could help with concerns over conflicting representations.

The problem of the existence of several representations of a single resource has been bothering me for over a decade, and it is indeed when they may be equivalent in the eyes of the consumer that are the trickiest. The example of HTML+RDFa vs. plain HTML is a good one, for a human user, they appear equivalent, but to a machine that expects triples, they most certainly are not. The "Turtle comments" is another example.

Pragmatically, we should see what we can do to alleviate some of the concerns that arise from having multiple representations, and I have come to the conclusion that for RDF information resources, a digest that is invariant with different serializations is terribly helpful, if it can be computed fast enough. With that, we could check quickly if two serialized representations are in fact the same graph. That would inform how we detect conflicts and make updates.

There's a bunch of digest algorithms for RDF graphs in the literature, but AFAIK, they are engineered with cryptographic strength in mind, and I'm not sure that is required for this purpose and if we could speed up the digest if we do it with different assumptions.

Apart from using it internally, we could also expose that as an ETag (though, there are problematic sides to that too), which is helpful when dealing with conflicts in editing, can enhance performance when used with caches and conditional requests, etc.

So, my conclusion would be to keep the resource/representation distinction, but add methods to check for representation equivalence (keeping in mind the difficulty of defining that). There is also precedence in the conneg algo to rank representations on the server side, which is another thing we should explore.

pjworrall commented 5 years ago

hi My understanding has always been RDF Resources are the deferencable objects that can be represented by different serializations. Using the http header media-type to specify the required serialization I thought was the standard way to do that.

I wasn't sure whether the scope of this thread was the API or the implementation. The storage is out of scope if this is just defining the API other than if it is felt necessary to "peak" at implementation to clarify API functionality. If this is implementation architecture then, of course, you are going to have to think hard about how you persist Resources.

How Resources are stored is a detail of the implementation and I would expect this is how you, or another implementation, would differentiate yourself. For example, some triplestores target their market based on integration with relational databases, or integrate with XML databases like MarkLogic, or fast index and query like with Jena TDB's custom implementation of threaded B+Trees.

I can understand a simple store using the file system but would expect that to have very few features and be a high risk of becoming a low integrity fragile mess if it stored the same Resources in different Representations. Yuk! . But, like a basic web server, that would be the administrators choice. Strategically you would support this option to emphasis the importance of an enterprise version with a better architecture and administrative features.

RubenVerborgh commented 5 years ago

Explained now at length by c7e884190c1d6d71910b20cc36b84ad58bfb9984.