Open rvagg opened 4 years ago
/cc @warpfork
Changed the title of this after playing a bit with the format. Dropping some thoughts here as I explore this.
The bitfield at the moment uses a big.Int
which can spit out a type of big-endian format. (some of the guts are in https://golang.org/src/math/big/nat.go with the API in https://golang.org/src/math/big/int.go). We're using the Bytes()
and SetBytes()
methods for serialization and deserialization. What this seems to give us is the most compact big-endian representation of the number it holds. We're using SetBit()
to set individual bits on and off to indicate the presence or absence of an element in our data array, so the number is arbitrary, it's the bits that matter.
The maximum length of the bitfield should be 2^bitWidth
to hold enough bits to cover all the indexing we need for any node. So a bitWidth of 8 gives us 256 bits needed to store our bitmap data. So if we were to turn on all the bits because we have a full data array, we'd end up serializing 0xff...
for 32-bytes, i.e. 256 1's. But if we only tinker with the first 8 bits, then we only need to serialize one byte. e.g. if we only had an element at index 0 then our bitfield would be a single byte, 0x01
, but if we only set the 8th bit then we need to bytes so would serialize 0x0100
, and so on.
Filecoin only uses a bitWidth of 5, so that's 32-bits, or 4-bytes needed to represent the bitfield.
Some thoughts about this format:
big.Int
and set them to a big.Int
but it's going to be slightly annoying for everyone else unless they have something already that works in exactly the same way. The ideal internal representation is for a node to have a bitfield of exactly 2^bitWidth
ready and available to set and unset bits on. The convenience of big.Int
bypasses this entirely but that's not going to be the same story across languages. I have https://github.com/rvagg/iamap/blob/master/bit-utils.js for this in JS, but to go to and from this serialization format I'd have to be trimming left-most bytes that contain zeros on the way in and padding them back on the way out.big.Int
. It'll treat as valid a block that has a byte array 1000 long in the position for the bitfield (I believe big.Int
can handle this kind of arbitrary size). But then it should round-trip it back out as just-long-enough if the block was re-serialized. (So this is in a similar category to the problems suggested in https://github.com/filecoin-project/specs/issues/1045).I don't have a strong opinion here yet, would like to hear others' thoughts. My personal preference would be for it to be stable and consistent, with the bitfield byte array in CBOR being exaclty 2^bitWidth
(exactly 4 bytes, every time, for Filecoin) so serialization, validation and explanation of this spec is simpler than it currently is. I doubt that the number of bytes being saved here is very meaningful—but it's not zero.
@warpfork @Stebalien @anorth thoughts?
Thoughts in no particular order:
(4-1)/2 = 1.5
bytes as "waste" per bitfield (given max four bytes, and random distribution, as you outlined) seems near to a rounding error... and that's the cost when one bit is set; the expected value for 'waste' decreases as more bits are set. So if even a handful of bits are set, the expected value of waste is less than a single byte. Seems pretty cheap for the consistency and simplicity gain of a fixed size.
So, I haven't estimated how much work it would be to change this, but I think I broadly agree with your preference for the bitfield byte array to be exactly 2^bitWidth
.
@warpfork For comparison, we don't use Java's BigInteger in our implementation, but BitSet which is much simpler and I think inline with go's big.Int? https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html
I agree with just storing the entire bitmap.
Storing the bitfield explicitly sgtm.
In https://github.com/ipld/specs/pull/282 I'm claiming that the
map
(bitfield
) is addressed here in the same way as the specification, and therefore the reference implementation for the IPLD HashMap. The spec currently doesn't use super clear language about how you address a specific bit of a byte array but it's assuming standard LE addressing where "readingindex
bit of the byte array" implies a certain ordering that you can apply masks to. I don't know how the Go bitfield works but it would be good to verify if the bit checking used here matches operations on the same byte array as used here https://github.com/rvagg/iamap/blob/master/bit-utils.js and if not, document exactly how it works so implementers can be crystal clear.