Arlodotexe / OwlCore

Have you ever seen an Owl do a barrel roll? Me neither. Essential supplemental tooling for .NET development.
MIT License
24 stars 3 forks source link

Feat: PartitionedStream #2

Closed Arlodotexe closed 1 month ago

Arlodotexe commented 2 years ago

Background

This PR is a first-pass implementation of the proposal at https://github.com/CommunityToolkit/dotnet/issues/133 for a PartitionedStream.

Motivation

Streams are an easy way to store/read a chunk of data from an arbitrary source. However, the only convenient option for storing several chunks of (closely related) data are:

Both of these options have obvious issues. Writing to multiple files can quickly make a mess of a file system, when you could have had a single file instead. But, using a single file means loading the entire thing into memory and manually splitting into several Streams, which is really bad for performance.

For the purpose of seamlessly reading and writing multiple chunks of data to a single dataset, I propose the ParitionedStream.

Proposal

Overview

A ParitionedStream allows for partitioning a given stream of data into multiple Streams, which can be passed around to standard APIs and operated on without loading the entire dataset into memory.

The API surface for this is open to discussion, but the base requirements are:

Technical challenges

Fast partition discovery

To discover partitions efficiently, it's vital that we have a partition map which can be read sequentially to avoid making multiple (potentially expensive) read operations just to discover partitions.

When deciding how to discover partitions, the obvious answer is to place a "map" at the start of the stream that we can read sequentially. However, since this map contains data about the partitions, when a new partition is added or an existing one is drastically changed, it will offset every single byte in all partitions.

This is expensive and something we want to avoid as much as possible, so I suggest placing the partition map at the end of the stream instead.

Fast reads, fast writes

Stream allows for synchronous byte-per-byte reads/writes, which many APIs use. This introduces an interesting technical challenge.

To transparently facilitate byte-per-byte reads and writes without breaking functionality, concurrency, or sacrificing performance, we have two options:

Arlodotexe commented 1 month ago

This feature is implemented as FullDuplexStream in Nerdbanks.Streams, which is part of the dotnet foundation. I'm closing this ticket by recommending this library instead.