SEMICeu / LinkedDataEventStreams

The Linked Data Event Streams specification
https://tree.linkeddatafragments.org/linked-data-event-streams/
23 stars 9 forks source link

Some ideas about streams... #24

Open sandervd opened 2 years ago

sandervd commented 2 years ago

First of all, I'm really new to the spec, so I'm sure I'll make some wrong assumptions here... But I have leaned that the easiest way to learn something on the internet is to make a statement, and someone will try to prove you wrong.

So here it goes: I under the impression that the design of the event stream was focused on being bulk-loadable in a triplestore as-is, in such a way that the data can be queried without further processing; the state is present in the stream itself. Although I think this is a really interesting feature, its biggest disadvantage is that the semantics of the event metadata and the entity data get mixed. I'll try to argue here for an alternative approach, where the event stream is seen as more as transaction log.

So this is how I understood the spec; what I thought to be the requirements that where behind the definition of spec:

So now for what I see to be the disadvantages to the current LDES approach:

So, after this, I would like to propose another event representation that tries to address these issues(Example 1 of spec):

ex:C1 a ldes:EventStream ;
    tree:shape ex:shape1.shacl .

[ 
    a ldes:Event ; # Could be implied if pages are dedicated?
    ldes:sequence "1"^xs:integer ;
    ldes:commitTime "2021-01-01T00:00:00Z"^^xsd:dateTime ; # Used as metadata in case a point-in-time recovery of the stream needs to be performed, and the recovery point needs to be determined).

    ldes:key <ex:Observation1> ; # The subject of the object represented.
    ldes:value [ # value is optional, if not present this is a tombstone event (delete).
        a sosa:Observation ;
        sosa:resultTime "2021-01-01T00:00:00Z"^^xsd:dateTime ;
        sosa:hasSimpleResult "..." .
    ]
]

This also allows for representing the stream in a key-value ledger model as used by Apache Kafka. The semantics for materialising this stream in a triplestore would be simple. For each event in stream:

Under this model, versioning of objects can be done at the application level using the isVersionOf approach if the versions concerns the application (e.g. multiple versions are kept so users can roll back to a previous version of an entity), as well by the log retention policy algorith that is chosen on the stream level, which only concern is to be able to move back in time (for instance for point in time recovery), and doesn't need to concern about application versioning.

A retention policy could keep the full state (decided by the producer what should make up the state), as well as the transaction log for a certain time. (following the idea of how topic log compaction is done in Apache Kafka)

Additional properties could be used to signal transaction boundaries (if retention policies apply, the boundaries can only be used when reading the head of the log, where messages are sequential). e.g. ldes:trx_id "1" If the stream delivery method can't guarantee that events are committed to the stream representation in an atomic way, a transaction event count could be added to the first event part of a transaction, to indicate the number of events that should be read as part of the transaction.

pietercolpaert commented 1 year ago

Seems like you propose a kind of reification model to point at an event. You can still create a data model, for example a Sander’s Data Stream vocabulary that defines sds:Event and sds:sequence and sds:key as you propose. We can then apply just default LDES techniques, on members that are shaped according to the sds:Event.

There are multiple ways of organizing streams so I’d like to keep this to application-level. I want to limit LDES to the concept of a container of immutable objects, as multiple organizational system can be built on top of that. It is true however that we do need descriptions on top of an LDES or a ViewDescription to give hints to an application about how a stream is managed. For example, that’s why tree:shape, tree:timestampPath, tree:versionKey, etc. on top of ldes:EventStream exist. This makes it more clear to LDES clients what should happen in a back-end when the stream is being processed.

Other solutions could entail:

In a Slack thread you specifically indicated problems with tombstone objects. I think we could add however a specific hint in the LDES description to the type that will be used for tombstone objects, so we can still use different vocabularies within LDES to manage the stream.