Closed AkihiroSuda closed 7 years ago
cc @tonistiigi
This proposal might be related for sending build context with large files
@stevvooe @dmcgowan WDYT?
@AkihiroSuda I don't think we should integrate a partial solution for this into continuity. I already have some content-based chunking design that handles the issues around digest-based storage, but this is ultimately part of the storage system, rather than built into the format. The reason behind such a design is that different applications may have different storage requirements. Baking this deep into a distribution format hinders both the distribution systems flexibility (has to implement chunk-based model), as well as storage systems model (must store chunks as distribution system sees fit).
Put shortly, the system would work like this:
chunkmap := GetChunkMap(resource.Digest)
for _, chunk := range chunkmap {
data := GetChunk(chunk.Digest)
}
Even more interesting, are models where you can instantiate an io.ReaderAt
with a chunkmap:
readerAt := NewChunkMapReaderAt(chunkmap)
But the biggest benefit to this approach is that it allows both systems to benefit mutually.
From my research, the best chunk model looked something like this:
message Chunk {
// Digest identifies the chunk by hash. Generally, this field is always set.
string digest = 1;
// Offset indicates the offset into a blob. If not part of a blob, this value may
int64 offset = 2;
// Length specifies the length of the chunk. This should always be set.
int64 length = 3;
// Data contains the actual bytes for the chunk. This will be unset when
// using the Chunk as a metadata object or query object.
bytes data = 252; // use high number to ensure Data is always last
}
message Blob {
// Digest identifies the blob by the content hash.
string Digest = 1;
// length is the total length, in bytes of the data targeted by the blob
// descriptor. In blobster, we use "length" and "size" interchangably, but
// the value is always serialized under "length".
int64 length = 2;
// chunks describes the offset and size of each chunk making up the blob.
// These should be ordered by offset but implementations should validate
// that before processing. Typically, the "data" field of each chunk will
// be unset. Small blobs may just include data in cases where a blob is
// very small.
repeated Chunk chunks = 4;
}
Again, in practice, this makes a lot more sense as an implementation detail of storage system. The less you specify in the distribution format, the access model gains a lot more flexibility.
OK, makes sense
This proposal enables specifying digest of a partial chunk of a file.
I'd like to integrate this into https://github.com/AkihiroSuda/filegrain