filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Inline CIDs #2320

Open Stebalien opened 4 years ago

Stebalien commented 4 years ago

@Kubuxu took a block size:

Histogram of value sizes (in bytes)
Total count: 34788050
Min value: 1
Max value: 49371
Mean: 765.27
                   Range     Count
[         0,          2)         1
[         2,          4)         1
[         4,          8)        34
[         8,         16)    476039
[        16,         32)   2369324
[        32,         64)   7740959
[        64,        128)   3360095
[       128,        256)   3710247
[       256,        512)   6946622
[       512,       1024)   1477828
[      1024,       2048)   4482532
[      2048,       4096)   3761116
[      4096,       8192)    445006
[      8192,      16384)     18242
[     16384,      32768)         3
[     32768,      65536)         1

Given this, inlining small blocks into CIDs using the identity hash function would save at least 12% of disk space (probably more because these CIDs would often be smaller).

It would also save us from having to write/read all these small objects. Unfortunately, we don't have an access histogram.

Here's an auto-inlining CID builder: https://github.com/ipfs/go-cidutil/blob/master/inline.go

The tricky part is how to wire this in. Ideally, we'd expose the CID builder on the runtime and use it internally inside the CBOR store. Unfortunately, we have some objects that expose a Cid() function to create their own CID.

The best reasonable solution is to:

  1. Have some common package (e.g., the specs-actors?) export a common CIDBuilder.
  2. Have cbor.NewCborStore take a CIDBuilder in the constructor.
vmx commented 4 years ago

That might not be the perfect place to bring it up, but it's so related. As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.

I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?

ribasushi commented 4 years ago

( we should probably take this into a separate issue ) @vmx there are definitely deployments out there today ( i.e. peergos ) using ~2k inlined CIDs. Generally any data that you know won't ever be repeated is a good candidate for inlining. An upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ). 64b is most definitely arbitrary and I'd be very sad if we adopt that.

vmx commented 4 years ago

I don't want to derail this issue, hence I openend https://github.com/multiformats/multihash/issues/130 (I should've from the start, sorry).

arajasek commented 4 years ago

Closed by #2568, I think?

Stebalien commented 4 years ago

No. That paved the way to support this feature, but we still don't actually inline small blocks into CIDs.