Closed bbengfort closed 7 years ago
Based on the following:
It looks like ext4 allows an unlimited number of files per directory. However ls, find, readdir, etc. read 32k of directory entries at a time - so I’m going to call that the upper limit of # of files in a directory.
Blob hashes are configurable with the following algorithms: md5, sha1, sha224, sha256, and murmur. Signature lengths for both hex and b64 encoding are as follows:
sha224: 56 hex chars - 7.0 blocks of 8 chars
sha256: 64 hex chars - 8.0 blocks of 8 chars
sha1: 40 hex chars - 5.0 blocks of 8 chars
murmur: 32 hex chars - 4.0 blocks of 8 chars
md5: 32 hex chars - 4.0 blocks of 8 chars
sha224: 40 b64 chars - 5.0 blocks of 8 chars
sha256: 44 b64 chars - 5.5 blocks of 8 chars
sha1: 28 b64 chars - 3.5 blocks of 8 chars
murmur: 24 b64 chars - 3.0 blocks of 8 chars
md5: 24 b64 chars - 3.0 blocks of 8 chars
If we use subdirectories of 8 characters, the “blocks” are the depth of the storage tree.
My feeling is to default to SHA256 + Base64 Encoding (without padding) for a tree depth of 5 to minimize blob name collisions and have a low likelihood of >32k files per directory.
I also asked Kostas who asked Amol about how pages are stored in Postgres; he said that pages follow a BTree implementation, but that I should look at the source code for that. I can do that if this scheme doesn’t seem to fit.
Store blobs in a directory structure such that blobs are not in a single directory but rather in multiple directories based on the prefix of their hash.