SEMICeu / LinkedDataEventStreams

The Linked Data Event Streams specification
https://tree.linkeddatafragments.org/linked-data-event-streams/
23 stars 9 forks source link

Distributed transactions in LDES #46

Open pietercolpaert opened 8 months ago

pietercolpaert commented 8 months ago

How do we indicate that we have a consistent knowledge graph across LDESes?

For instance, what if we split Linked OpenStreetMap into 3 LDESes: one for nodes, one for ways and one for relations. Then, for each LDES, multiple objects are added into one transaction. Just processing one member would not be very valuable, as you would get an inconsistent replication: the osm:Way would for example not yet have the necessary osm:Nodes to point to.

A solution would be to have a transaction system that indicates whether the transaction is still in progress, and define the bounds (probably based on a timestamp?). The set of members can then only be processed into a derived view or service from the moment the transaction is marked as completed.

Any ideas on this?

A way to circumvent transactions in LDES is to just post members at exactly the same datetime. However, we might want to introduce parts of the knowledge graph changes at different times.

Related work:

sandervd commented 8 months ago

I would avoid splitting up streams and the need for distributed transaction as much as possible. If the information is needed together, why not store it together? If we maintain the system of record (SoR) in a mixed LDES (the server must guarantee that a transaction is written atomically), we only need to concern ourselves with adding transaction semantics to the events (trx id, number of members in transaction, offset).

In this case I would probably use auto-increments generated by the server to indicate the position of a member in the log (in this case is rather natural to see the LDES as the binlog of the database system), as the position in the WAL needs to be exact. Given that the timestamp has sufficient granularity we could use it, but I would prefer a clean cut: in case of a WAL it is strictly the server who determines the position in the log; as the spec leaves it open to use instance data (a timestamp generated elsewhere) for ordering, I feel it would communicate a wrong message.

When the split between metadata and data is clear (by using named graph tree members), adding the additional transaction semantics to the event shouldn't be much of a hassle. It is then up to the client to either interpret or ignore the transaction semantics.

It is also important that we can communicate in which part of the log the transactional boundaries are respected; for instance when state + time retention (see #36 ) is enabled, consistency can only be guaranteed in the last part (where time based retention prevents member removal), as in the first part of the log members could be missing as they are cleaned up by retention mid transaction.

To come back to the multiple streams: the splitting into multiple streams could be achieved later (for clients that don't need transactionality) by a fragmentation or to multiple streams, knowing that each stream/fragmentation breaks the borders of transactions and guarantees of integrity (integrity can only be guaranteed in the mixed stream).