Streams are an easy way to store/read a chunk of data from an arbitrary source.
However, the only convenient option for storing several chunks of (closely related) data are:
Use multiple streams pointing to multiple resources (files, etc)
Manually load the entire resource into memory, separate the data into multiple streams, then pass it along.
Both of these options have obvious issues. Writing to multiple files can quickly make a mess of a file system, when you could have had a single file instead. But, using a single file means loading the entire thing into memory and manually splitting into several Streams, which is really bad for performance.
For the purpose of seamlessly reading and writing multiple chunks of data to a single dataset, I propose the ParitionedStream.
Proposal
Overview
A ParitionedStream allows for partitioning a given stream of data into multiple Streams, which can be passed around to standard APIs and operated on without loading the entire dataset into memory.
The API surface for this is open to discussion, but the base requirements are:
A PartitionedStream that takes a Stream in the constructor where partitions can be read/written.
A StreamPartition : Stream that is thread-safe and can be used as a normal Stream, but only interacts with a single partition.
Method on PartitionedStream that finds and returns all existing StreamPartitions.
Method on PartitionedStream that adds a partition and returns a new StreamPartition
Method to delete a partition
Either directly on StreamPartition, or
On the PartitionedStream. taking a StreamPartition as a parameter.
Technical challenges
Fast partition discovery
To discover partitions efficiently, it's vital that we have a partition map which can be read sequentially to avoid making multiple (potentially expensive) read operations just to discover partitions.
When deciding how to discover partitions, the obvious answer is to place a "map" at the start of the stream that we can read sequentially. However, since this map contains data about the partitions, when a new partition is added or an existing one is drastically changed, it will offset every single byte in all partitions.
This is expensive and something we want to avoid as much as possible, so I suggest placing the partition map at the end of the stream instead.
The length of the map is the very last byte, allowing the initial read of the map to be done in as few requests as possible.
The map can move as data is added or removed, but it's a much smaller amount of data to shift in memory.
Fast reads, fast writes
Stream allows for synchronous byte-per-byte reads/writes, which many APIs use. This introduces an interesting technical challenge.
To transparently facilitate byte-per-byte reads and writes without breaking functionality, concurrency, or sacrificing performance, we have two options:
Direct insert
Directly insert new data to the correct partition
Shifts all succeeding partitions on every write.
Every read/write operation needs to handle shifting partition positions in real time.
Significantly easier to implement than fragmented partitions.
Fragmented partitions
Append new data to the end of the stream (whether single byte or a buffer of bytes)
For finding fragments, either:
The start of each fragment is a pointer to the position of the next fragment (sentinel value if none)
Each fragment range is stored separately in the map
Only fragmented in memory. Must be defragmented when flushed for fast sequential reads when read back later.
Better for performance-critical reads/writes (data isn't being constantly shifted around)
Until defragmented, uses slightly more memory to store pointers.
This feature is implemented as FullDuplexStream in Nerdbanks.Streams, which is part of the dotnet foundation. I'm closing this ticket by recommending this library instead.
Background
This PR is a first-pass implementation of the proposal at https://github.com/CommunityToolkit/dotnet/issues/133 for a PartitionedStream.
Motivation
Streams are an easy way to store/read a chunk of data from an arbitrary source. However, the only convenient option for storing several chunks of (closely related) data are:
Both of these options have obvious issues. Writing to multiple files can quickly make a mess of a file system, when you could have had a single file instead. But, using a single file means loading the entire thing into memory and manually splitting into several Streams, which is really bad for performance.
For the purpose of seamlessly reading and writing multiple chunks of data to a single dataset, I propose the
ParitionedStream
.Proposal
Overview
A
ParitionedStream
allows for partitioning a given stream of data into multipleStream
s, which can be passed around to standard APIs and operated on without loading the entire dataset into memory.The API surface for this is open to discussion, but the base requirements are:
PartitionedStream
that takes aStream
in the constructor where partitions can be read/written.StreamPartition : Stream
that is thread-safe and can be used as a normalStream
, but only interacts with a single partition.PartitionedStream
that finds and returns all existingStreamPartition
s.PartitionedStream
that adds a partition and returns a newStreamPartition
StreamPartition
, orPartitionedStream
. taking aStreamPartition
as a parameter.Technical challenges
Fast partition discovery
To discover partitions efficiently, it's vital that we have a partition map which can be read sequentially to avoid making multiple (potentially expensive) read operations just to discover partitions.
When deciding how to discover partitions, the obvious answer is to place a "map" at the start of the stream that we can read sequentially. However, since this map contains data about the partitions, when a new partition is added or an existing one is drastically changed, it will offset every single byte in all partitions.
This is expensive and something we want to avoid as much as possible, so I suggest placing the partition map at the end of the stream instead.
Fast reads, fast writes
Stream
allows for synchronous byte-per-byte reads/writes, which many APIs use. This introduces an interesting technical challenge.To transparently facilitate byte-per-byte reads and writes without breaking functionality, concurrency, or sacrificing performance, we have two options: