filecoin-project / go-hamt-ipld

An implementation of a HAMT using ipld
MIT License
29 stars 25 forks source link

Simplify bitfield format; size, ordering, etc. #54

Open rvagg opened 4 years ago

rvagg commented 4 years ago

In https://github.com/ipld/specs/pull/282 I'm claiming that the map (bitfield) is addressed here in the same way as the specification, and therefore the reference implementation for the IPLD HashMap. The spec currently doesn't use super clear language about how you address a specific bit of a byte array but it's assuming standard LE addressing where "reading index bit of the byte array" implies a certain ordering that you can apply masks to. I don't know how the Go bitfield works but it would be good to verify if the bit checking used here matches operations on the same byte array as used here https://github.com/rvagg/iamap/blob/master/bit-utils.js and if not, document exactly how it works so implementers can be crystal clear.

rvagg commented 4 years ago

/cc @warpfork

rvagg commented 4 years ago

Changed the title of this after playing a bit with the format. Dropping some thoughts here as I explore this.

The bitfield at the moment uses a big.Int which can spit out a type of big-endian format. (some of the guts are in https://golang.org/src/math/big/nat.go with the API in https://golang.org/src/math/big/int.go). We're using the Bytes() and SetBytes() methods for serialization and deserialization. What this seems to give us is the most compact big-endian representation of the number it holds. We're using SetBit() to set individual bits on and off to indicate the presence or absence of an element in our data array, so the number is arbitrary, it's the bits that matter.

The maximum length of the bitfield should be 2^bitWidth to hold enough bits to cover all the indexing we need for any node. So a bitWidth of 8 gives us 256 bits needed to store our bitmap data. So if we were to turn on all the bits because we have a full data array, we'd end up serializing 0xff... for 32-bytes, i.e. 256 1's. But if we only tinker with the first 8 bits, then we only need to serialize one byte. e.g. if we only had an element at index 0 then our bitfield would be a single byte, 0x01, but if we only set the 8th bit then we need to bytes so would serialize 0x0100, and so on.

Filecoin only uses a bitWidth of 5, so that's 32-bits, or 4-bytes needed to represent the bitfield.

Some thoughts about this format:

I don't have a strong opinion here yet, would like to hear others' thoughts. My personal preference would be for it to be stable and consistent, with the bitfield byte array in CBOR being exaclty 2^bitWidth (exactly 4 bytes, every time, for Filecoin) so serialization, validation and explanation of this spec is simpler than it currently is. I doubt that the number of bytes being saved here is very meaningful—but it's not zero.

@warpfork @Stebalien @anorth thoughts?

warpfork commented 4 years ago

Thoughts in no particular order:

So, I haven't estimated how much work it would be to change this, but I think I broadly agree with your preference for the bitfield byte array to be exactly 2^bitWidth.

ianopolous commented 4 years ago

@warpfork For comparison, we don't use Java's BigInteger in our implementation, but BitSet which is much simpler and I think inline with go's big.Int? https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

Stebalien commented 4 years ago

I agree with just storing the entire bitmap.

anorth commented 4 years ago

Storing the bitfield explicitly sgtm.

rvagg commented 4 years ago

done in https://github.com/ipfs/go-hamt-ipld/pull/63