Content Adressability and File Segmenting

Files over 16MB will be segmented in order to improve computational parallelization and to support streaming very large files.

Segments are different than chunks in that there will always need to be 4/8 chunks, but there can be many segment increments of 16MB.

In order to support parallelization, a content catalog is needed in order to refer to the original content that was encoded. This content catalog will be storage frontend-specific. For BitTorrent it'll be a SHA-2 hash, for IPFS it'll be a Blake2b Multihash, and for the HTTP frontend, it'll use a Blake3 hash. In all cases, the client is encouraged to hash the contents received once-over in order to verify it has indeed received the correct data. Content catalogs will be Carbonado-encoded on-disk, with optional encryption in order to preserve privacy at-rest.

For each frontend supported, a YAML file is used to simplify inspection, and it will contain a list of segments indexed by the Bao hash used to encode them. Additional metadata can also be included such as offset and index within the file to align the contents with IPLD DAGs or BitTorrent chunks. For the rsync frontend, original file metadata can be stored, and the rsync frontend indexes files by a hash of their path. Blake3 hashes will be keyed using the file's public key in order to improve privacy by breaking authoritative content hash tables (such as a sort of Rainbow table used to index files known by state actors).

diba-io / carbonado

Content Adressability and File Segmenting #17