Support streaming parsing of fragments (profile)

sandervd commented 10 months ago

While the most common read pattern for clients will be to read at the end of the log, from time to time new clients will show up that want to sync over the full history of the stream. As I explained in #40, the issue in open fragments is the maximum size of a fragment has an exponential impact on the necessary bandwidth and processing in the client. This would argue for creating smaller fragments, however this also has downsides, as smaller fragments would mean more requests to the server. This is why it would be ideal to rewrite historical fragments (say older than a day, immutable), into larger fragments. Fetching bigger files (especially if the HTTP headers indicate relations, so this can add concurrency) is much more efficient than many smaller ones (connection setup, higher compression rate,...), but has a drawback in the current form: no tree:Node streaming parser exists, essentially requiring the entire graph (of one page) to be parsed in memory. When compacting historical fragments into these larger graphs, this could be an issue. This is why I would suggest a default way of structuring the data in a page, such that a stream aware parser can stream parse the document, and emit members as they are processed. This would significantly reduce the memory requirements in the case of large fragments.

The layout of a page (say, using turtle serialization as it offers best compression) could look something like this:

First the stream membership statements, required to find the tree members
Then the members, one by one, ordered first by object id, then timestamp path ( this allows for member skipping if the client is interested in latest state only, reducing the number of upserts on the database the stream is projected in.)
Last the relation pointing to the next page

As all member triples are 'grouped', the parser can read one member at the time.

As the document would be a normal RDF file, and the only semantics added are there to support the streaming behavior, this should be completely backwards compatible for clients that don't support streaming tree parsing. The capability could be indicated by a statement on the view.

sandervd commented 3 months ago

Perhaps we could create a LDES/protobuf serialization?

pietercolpaert commented 3 months ago

Valid point - the biggest problem is the member extraction algorithm at this moment that takes the full HTTP response as bounds of still potentially finding other quads. We’d need to extend existing serializations to indicate the bounds of a member in order to support streaming.

A protobuf LDES proto schema based on the SHACL would indeed be interesting. I’ll see whether we can find a master thesis on this

SEMICeu / LinkedDataEventStreams

Support streaming parsing of fragments (profile) #42