SEMICeu / LinkedDataEventStreams

The Linked Data Event Streams specification
https://tree.linkeddatafragments.org/linked-data-event-streams/
24 stars 10 forks source link

Motivate streams not ordered by publish time #10

Open tuukka opened 3 years ago

tuukka commented 3 years ago

Intuitively, a "stream" refers to a collection of items ordered by the time they were published in the collection. Thus, the stream grows at its end. The specification seems to consider these an important but not the only relevant type of "streams". (There is a note saying "A 1-dimensional fragmentation based on creation time of the immutable objects is probably going to be the most interesting and highest priority fragmentation for an LDES" but then continuing "sometimes the back-end of an LDES server cannot guarantee that objects will be published chronologically".)

Should these two separate cases be motivated, perhaps already in the introduction?

pietercolpaert commented 3 years ago

@ddvlanck can you answer this one?

ddvlanck commented 3 years ago

Hi @tuukka ,

As indicated in the specification, ordering the events by the time they are published in the collection, is indeed one of the most interesting fragmentations, because it allows us to describe more detailed relations to other pages so that query agents can easily decide whether or not it is useful to visit a page. However, it is possible that the backend system on which the LDES has been built does not receive the events at the time they occurred. For example, this is the case with the address and building registry in Flanders where it is possible to receive events today that already occurred in 2019, due to human errors (forgetting to indicate that a change was made), external systems, or just latency.

If we would apply a time-based fragmentation in this situation, we would end up with pages that constantly change and thus lose one of the main advantages of Linked Data Fragments: caching. Therefore, for the address and building registry, we choose to publish the events in the order that they are received by the backend system, which allows us to cache each page because that order is never going to change. However, in that situation, we lose the ability to describe detailed relations to other pages, because there is not really a pattern in the content of the pages (events from 2012 and 2019 can be in the same page). So we just provide the link to the next page (similar to hydra):

"@id" : "http://example.org?page=1",
"tree:relation": [
        {
            "@type": "tree:Relation",
            "tree:node": "http://example.org?page=2"
        }
    ]

I'll update the specification to make it more clear that it is possible to have a Linked Data Event Stream without a time-based fragmentation.

tuukka commented 3 years ago

Thank you for the reply @ddvlanck! This may be a concern of terminology that I'm not familiar with. I would like to understand why you call a collection an event stream even if it does not grow at its end; or conversely, why you don't define the time-based fragmentation based on when the event arrived at the stream as opposed to how a source system dates it.

In your example case, could it make sense to talk of two orderings of the events: one is "logical" (when the event occurred legally?) and another is "physical" (when the event arrived at the stream)? If I understand correctly, it would be possible to expose both as distinct properties of the events and distinct fragmentations of the stream. What's more, every event stream would be able to and could be required to provide both of these (in simple cases, they would be identical). The "physical" dimension would be useful for caching and synchronising, and the "logical" dimension would be necessary to capture the real-world changes represented by the data.

pietercolpaert commented 3 years ago

The core event stream fragmentation should be based on how you can make as many pages as possible cache immutable. All other fragmentations or orderings/paginations/indexing is optional

tuukka commented 3 years ago

@pietercolpaert Right, so then you should make fragmentation by the "physical" time dimension mandatory in the spec? [Because it's what you need for 100% cache immutability, and it's always possible.]

pietercolpaert commented 3 years ago

@tuukka Good point! I think it should be a recommendation! If there would be some exotic reason for which you wouldn’t be able to do it by physical time dimension I think the LDES client will still work, just not in the most optimal way!