filecoin-project / venus

Filecoin Full Node Implementation in Go
https://venus.filecoin.io
Other
2.06k stars 462 forks source link

Implement Consistent Varint Encoding #340

Closed dignifiedquire closed 6 years ago

dignifiedquire commented 6 years ago

Description

Currently the code base uses a mix of big.Int and {u}int{32|64} to represent numbers, which get translated as such into encoded blocks.

The spec though says to use LEB128 with limitations on size in some places when decoding.

Your task, should you accept it, is to review the code base and

  1. make a suggestion on how to best integrate this into the codebase
  2. after feedback, implement.

Acceptance criteria

Risks + pitfalls

Where to begin

mishmosh commented 6 years ago

Implementation proposal, please.

mishmosh commented 6 years ago

@acruikshank took a look, needs more info. @dignifiedquire will elaborate.

dignifiedquire commented 6 years ago

@acruikshank could you provide some questions on what is unclear please?

acruikshank commented 6 years ago

@dignifiedquire I was slightly mis-remembering when I said this was unclear (since part of the task is to clarify). The only real question is the scope of this change. There are many places we encode values. Are we only concerned with ints that will ultimately live on-chain here? That would include message parameters and anything that stays in actor memory. If that's it, I think this is ready to start (researching).

dignifiedquire commented 6 years ago

We are concerned with all values that are ending up on the chain and also all the ones on disk, as it will make life easier in the future for us.

phritz commented 6 years ago

@dignifiedquire (and anyone else knowledgeable about this area):

Edit: for clarity

phritz commented 6 years ago

I started digging into this today and figured I'd surface some initial thoughts for feedback/input. This is new territory for me so please poke holes and point out oversights in what follows.

There's more but it doesn't rise to the level of discussion at this point and depends on input on the above....

phritz commented 6 years ago

Also:

dignifiedquire commented 6 years ago

Why did we choose LEB128? Do we have a reason to be opposed to using something like prefix varint which is exactly like LEB128 but just shifts the length bits to the front for efficiency (eg https://news.ycombinator.com/item?id=11263378)?

We are optimizing for (1) storage space, esp efficient encoding for small values and (2) what else? I'm not asking what would be nice, I'm asking what is most important.

  1. future proofing, making sure we don't suddenly run out of number space
  2. minimizing storage size on chain
  3. performance
  4. developer ux

Why not just have two varint types, signed and unsigned as opposed to having size limits filecoin-project/specs@7b8d7d7#diff-958e7270f96f5407d7d980f500805b1bR56? Is it to restrict the size of inputs to prevent DOS/resource consumption or are there other reasons?

Two ideas to this

Re encoding: the encoding needs to be deterministic and unique. LEB128 is not unique out of the box (you can left-pad the encoded value with 0's or 1's if negative) but it looks like we can just disallow zero padding.

good catch, we should choose a single option and make sure to add it to the spec and enforce it.

I wonder whether all of this structured data shouldn't be custom types sooner rather than later...

What exactly do you mean with custom types?

phritz commented 6 years ago

LEB128 is the only well known, somewhat standardized varint encoding format that I could find, all other derivatives are mostly one off projects with little standard support outside of their specific ecosystem

OK. The fact that these "standards" are like <= 40 lines of code therefore easy for anyone to implement inclines me to pick the best one, as opposed to the one most people have heard of. But it bears a bit more investigation.

Two ideas to this

be able to optimize runtime performance by just using 32/64 bit numbers instead of actual varints on these, for now while still knowing we can upgrade without breaking the chain format make validation easier/stricter and as such reduce the ability for dos/resorce consumption attacks during parse time

Potential optimizations noted. My instinct is to go the other, simpler direction though: define exactly two types (one varint, one varuint) with a reasonable max size say 256 bits and just use that everywhere until we have a good reason to introduce a bunch of types. I'm asking for feedback from everyone: are there reasons to prefer defining different sizes varints (say, varint32, varint64, varint128, varint256) rather than just defining one size (say, var(u)int256) and being done with it for now? I'm having trouble marshaling an argument in favor of many types -- it seem slightly more safe in the sense that fields and parameters are more guaranteed to be right-sized, but it's hard for me to know how much safer that makes us or if there are other reasons to prefer lots of types... Thoughts?

I wonder whether all of this structured data shouldn't be custom types sooner rather than later... What exactly do you mean with custom types?

The idea would be to introduce types.Map and types.Array and use those on chain and in actors. That way we can be sure that maps and arrays are being used (eg, indexed) in safe/proper ways; we start formalizing the structured data api which is what actors see, what goes into the chain, and what we're going to have to implement on the vm side; and it's clear what to merkelize (at the level of the structured data types). I guess if we start going down this route we need to decide how deep to go: do we mix these types with go types or go all the way in (eg, types.Struct instead of go structs)?

porcuquine commented 6 years ago

There's a lot going on here, so I'll just comment on the explicit question above.

Abstractly, I favor minimizing the number of variable-sized formats. Just one for signed and one for unsigned sounds good to me. The alternative feels like premature optimization with a complexity cost (and especially the overhead of forcing design decisions up-front) which undermines some benefits of adopting a flexible and future-proof format in the first place.

acruikshank commented 6 years ago

I'm ambivalent on having multiple types. For me it's a trade-off between a slightly more complicated spec vs. the slight absurdity of using a 256 bit (capable) number to store something that won't get out of the thousands. I am wondering if we might decide to store something like a public key as a number, in which case we'd need something bigger than 256.

aboodman commented 6 years ago

It looks like what's required here is defining one or more varint types with an interface similar to big.Int (Add, Mul, Exp, LessThan, etc)

I would really encourage people to think of this kind of work in the other direction: from serialization up. The serialization is the most important thing. It needs to meet our use cases, needs to be guaranteed deterministic, and cannot easily be changed once shipped. It becomes part of the spec. The API is just API. It's language-specific, and can change whenever. As you say, the API can just be big.Int for now, or even int in most cases for now.

The utility of having types of varintN for different values of N is unclear to me.

Again, it sort of depends on the level you are speaking at:

picking a reasonable max value N

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

Re encoding: why LEB128? It seems like a good encoding on several axes but there are potentially better encodings out there. For example prefix varint which is basically LEB128 but with the continuation bits packed at the front (see previous comment); this results in significant performance gains or so the internets tell me.

I don't have a strong opinion on encoding issues, except that I prefer to just have one that is as general and well-tested as possible.

Actor code uses Go slices and maps. A slice is indexed by int64 in miner.go, but I think slice indices are only 32 bits on some platforms, so that gets me a little worried. If indexing a slice by a varint we could convert to 32/64 bits, this would just be a call like varint.ToInt64() or similar. Maps we're indexing by string, I think this works fine for varints -- if want to use a varint index by its .String() value. But this gets me wondering whether sooner rather than later we should be using our own slice/array and map types here instead of the Go types. It would enable us to enforce how the types are used and also start giving shape to how these things should actually be implemented. @aboodman you've built this type system before, whatchoo think?

Yeah this is a big topic.

  1. People like to use the language-native types (map, slice, whatever), especially in Go as API. I think that is fine for most cases, at least for now. But whatever API we use to access the data model, we need to be absolutely sure of our serialization and validation between that and the on-chain representation. We've already run into several bugs where the serialization was non-deterministic (e.g., https://github.com/polydawn/refmt/pull/28), and this makes me very nervous. I think that we need to control this code directly.

  2. Because people generally prefer to use the Go types, my preference would be to focus on that for now and really make sure we have the translations correct there. If we introduce another API there will be tension over which to use and we'll have to ensure both are correct, and we'll just double the amount of work we have to do.

  3. Large-scale data. We are much different than, e.g., Ethereum because we want to store lots of structured data inside actor memory. Like, the entire storage market. We cannot just deserialize the entire market into a Go map and call it a day. So this is going to eventually have to get more complicated. We are going to need collections of some sort that we can read and write incrementally.

  4. Structure/types/interop. Just like how we have a current desire to restrict the size of integers in parameters to messages, we will, over time, probably want to add additional structure to message parameters. Instead of just a buffer, I want to say that you should pass me a struct that represents a multisig signing policy. Putting it on developers to reinvent deterministic serialization for complex data, and putting decoding and validation inside actors doesn't seem like it's going to work long-term. It also makes it harder for actors to interop. If I want to rely on two different actors, but they have different ways to represent list or struct, that makes it pretty hard to compose things. I believe that over time there will be pressure to increase the richness of the vocabulary you can use in message parameters for this reason.

I'm going to be thinking about the data model more as part of the vm work. For now I think it makes sense to continue the status quo, but maybe fork the serialization code so that we can improve it faster (of course making it public so that upstream can take the work if they want).

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?

I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Seems like we can't launch without custom types

Why can't we launch w/o ooc?

phritz commented 6 years ago

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

That's an interesting idea. Does LEB128 (basically encode 7 bits of number per 8 bit byte) qualify to you as compact? Or do you mean a fixed-size encoding like IEE 754 floats (eg, maybe 80-bit extended precision or 128 bit)?

I wonder...

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ? I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Sorry I was talking about the size of the address space. 4GB is tiny. There must be a mechanism that enables us to address >4GB of data?

Seems like we can't launch without custom types Why can't we launch w/o ooc?

We can, but if we do seems like we're accepting go semantics for our types. Seems like it has far-reaching implications, beyond even stuff like "you can't index a slice with an int64". For example if we want to index a map with our varint type, how does that work? If we use the varint type directly then it has to support == properly. Or we could index with varint.String() by convention or maybe varint.Bytes() -- or anything else that supports equality. Or we could support it as a first-class thing with our own map type and define key and equality ourselves. Maybe we want it keyed by hash? Think of the different ways these choices serialize. Whatever we choose we're locking ourselves into it. Seems unlikely to me that what go does is exactly what we want for the long term. Similarly, map iteration order is random. I guess we could say "dont iterate maps" and launch with that and figure it out later. Or we could define our own type that prohibits iteration or defines an order we think makes sense. I'm kind of going down the line for each of these types and wondering we are making an explicit choice to accept go semantics, and if we should be.

Maybe it's just a matter of putting an API in place to constrain how they work, I dunno.

Edit: in the above I'm conflating two things, how data types are serialized and their semantics (how they work). The former is the biggest concern and I guess it doesn't require custom types it just requires fine-grained control over the serialization format. The latter is still a concern, but a lesser one I guess.

phritz commented 6 years ago

Re:

In the serialization, the only value I can imagine is efficiency of decoding/encoding.... ... At the message level, it is nice to be able to be specific about the parameter and return types....

Yeah I was referring to the API level. At the serialization level I think we'll have exactly one or at most two serialization formats no matter how many integer types we have. Either exactly one serialization format that works for both signed and unsigned integers, or exactly one for signed and exactly one for unsigned. I wouldn't encode anything about sizes into the serialization format, that would be a higher-level API thing (basically, barf once you've read too many bits of data or overflow).

People like to use the language-native types (map, slice, whatever), especially in Go as API. I think that is fine for most cases, at least for now. But whatever API we use to access the data model, we need to be absolutely sure of our serialization and validation between that and the on-chain representation.

Totally agree and this is what's worrying me at the moment. We need a line of VM work in this area and it sounds like you are on top of it.

Because people generally prefer to use the Go types, my preference would be to focus on that for now

OK. I'm not going to worry too much about the implications of the new varint type on composite data types, I assume we're going to have work to do in that area as the data model emerges and we'll get to it then.

aboodman commented 6 years ago

Maybe stupid question, but do we even have to do that? Is it not possible to come up with a relatively compact encoding that can represent any rational value and call it a day? This is what I wish we did in Noms, FWIW.

That's an interesting idea. Does LEB128 (basically encode 7 bits of number per 8 bit byte) qualify to you as compact?

No because for values with large magnitude but low precision, there are lots of wasted bits. I mean something more like this:

https://github.com/filecoin-project/specs/pull/78/files#diff-958e7270f96f5407d7d980f500805b1bR62

But changing:

(Warning: I haven't thought this through, there might be major plotholes. Will think about it some over the weekend.)

Or do you mean a fixed-size encoding like IEE 754 floats (eg, maybe 80-bit extended precision or 128 bit)?

I think we should declare that floats are verboten in this system.

I wonder...

  • Up to what size we want exactly represented integers? Maybe a better way to put it: what is the lowest number at which having a non-exact integer representation would be a deal-breaker? 2**64-1?

I feel weird limiting this at the serialization layer if we can avoid it.

  • How much we care about bit-twiddling?

We who? If we can isolate the bit-twiddling in a library, it seems like it's just an issue for efficiency of decoding and encoding.

Speaking of 32 bits... wasm's address space is 32 bits. How do we imagine that working? @aboodman ?

I don't understand the question. We can represent integers bigger than 32 bits in a 32 bit environment. It's just slower.

Sorry I was talking about the size of the address space. 4GB is tiny. There must be a mechanism that enables us to address >4GB of data?

I think that the storage system will be separate from address space when we introduce the vm. Think of just like any classic storage API -- the address space inside the VM will have no bearing on the size of data storable.

Seems like we can't launch without custom types

Why can't we launch w/o ooc?

We can, but if we do seems like we're accepting go semantics for our types.

.. this is a very deep question that I care a lot about. How about we split it out into its own bug or work stream or something?

phritz commented 6 years ago

I think that the storage system will be separate from address space when we introduce the vm. Think of just like any classic storage API -- the address space inside the VM will have no bearing on the size of data storable.

Ah, that makes sense. If this is the case we should stop talking about memory as a linear byte space and start talking about it as a structured data dag or merkelspace or something like that, meaning data structures are merkelized and references are hashes.

.. this is a very deep question that I care a lot about. How about we split it out into its own bug or work stream or something?

Sure. Digging into this touched on a bunch of issues that we don't have to solve now, they just got me thinking.

phritz commented 6 years ago

Popping back up to the actual problem. Here's what I've gathered.

Goals:

Assumptions/requirements for encoding scheme:

Assumptions/requirements for API:

The plan:

Other things that I considered/encountered:

BTW @aboodman thanks for the discussion this thinking in particular was very helpful:

I would really encourage people to think of this kind of work in the other direction: from serialization up. The serialization is the most important thing.

aboodman commented 6 years ago
  • For the API we'll mimic big.Int and do the small corrective surgery to convert our tokes/bytes types over.

Why mimic big.Int instead of just using it directly? The way we do marshaling today is we start with Go types and marshal to cbor. We frequently define custom serializations for various Go types, we could do the same for big.Int. We could also do the same for int8, int16, etc and have them all marshal to the same encoding, so it would be purely programmer convenience which Go type to choose at any random call site.

Don't we need scientific notation fairly early for token amounts? It is justified in the same spec link above. I suppose it's not "right now", but by testnet. And if we do need it, does that mean we'll end up having a few different serializations for numeric types, or would you migrate them all over at that time?

BTW @aboodman thanks for the discussion

:)

dignifiedquire commented 6 years ago

Don't we need scientific notation fairly early for token amounts?

Yes, we do. An immediate follow on story to this is moving all token amounts to rational numbers and encode them as such.

phritz commented 6 years ago

Why mimic big.Int instead of just using it directly?

Because these types are found in the spec and big.Int is a Go thing? I have assumed -- perhaps incorrectly -- that the spec needs to define the semantics of all the data types used in algorithms covered by the spec. For example the payment channel actor implementation: https://github.com/filecoin-project/specs/blob/master/drafts/retrieval-market.md It seems like we're going to need to give precise algorithms for the implementations of its methods, which means defining exactly what operations you can do with data types and how they should work.

Under the above assumption we could adopt bit.Int as the interface and everything that comes with it, but it seems easier and simpler to define our own interface, taking the subset of big.Int that we need and making any tweaks that seem useful, as opposed to doing exactly what bit.Int does. (This is another reason it seems inevitable to me that we define our own types for map, array, and the like.)

I admit I don't fully understand how the spec is supposed to facilitate separate implementations. At one extreme a strategy is to precisely define all the consensus/chain-affecting types in use and how they work. This strategy means it's pretty straightforward to do a separate implementation: filecoin defines a bunch of types and interfaces and gives you the algorithms in terms of those types (in Go seems fine, or something close to it). You implement the types and interfaces as described. A less prescriptive strategy would be to precisely define just what's encoded on chain and then describe in more general terms the algorithms in use. That seems fraught with potential for divergent behavior so I guess I never considered it, but maybe that's what's intended. Or maybe I'm missing some middle ground?

cc @dignifiedquire and @whyrusleeping to cross-check my assumptions.

Don't we need scientific notation fairly early for token amounts?

We need something for token amounts. Based on the conversations in https://github.com/filecoin-project/specs/pull/78 I'm not getting warm fuzzies that we have fully thought it through yet. Will continue the conversation there.

An immediate follow on story to this is moving all token amounts to rational numbers and encode them as such.

Let's continue this specific part of the conversation in https://github.com/filecoin-project/specs/pull/78.

aboodman commented 6 years ago

Why mimic big.Int instead of just using it directly?

Because these types are found in the spec and big.Int is a Go thing? I have assumed -- perhaps incorrectly -- that the spec needs to define the semantics of all the data types used in algorithms covered by the spec. For example the payment channel actor implementation: https://github.com/filecoin-project/specs/blob/master/drafts/retrieval-market.md It seems like we're going to need to give precise algorithms for the implementations of its methods, which means defining exactly what operations you can do with data types and how they should work.

I think I see where you are coming from now.

My thinking was just that big integer arithmetic is sufficiently well understood that the answers we get from, e.g., big.Int.Add are going to be the same answers that our spec will demand. Not because we're cribbing from Go but because there's only one possible correct answer for adding two integers (maybe I'm wrong and there's subtlety here).

As a comparison, we don't implement sha2 ourselves. The behavior is well-defined and we use a library that implements it per the spec.

Serialization is a different matter. Our spec will demand a very specific serialization and we will likely have to implement that ourselves no matter what runtime library we use. Just as we implement the serialization of sha2 hashes ourselves today.

The reason I bring this up is because as a user, it would be nice to be able to use int64 or big.Int in the cases that I can, because I already know those APIs well.

I admit I don't fully understand how the spec is supposed to facilitate separate implementations. At one extreme a strategy is to precisely define all the consensus/chain-affecting types in use and how they work. This strategy means it's pretty straightforward to do a separate implementation: filecoin defines a bunch of types and interfaces and gives you the algorithms in terms of those types (in Go seems fine, or something close to it).

I think this is the intent.

You implement the types and interfaces as described.

Or maybe you use off-the-shelf libraries for some of the types and interfaces, which have the same behavior?

phritz commented 6 years ago

I talked to aaron directly. In all worlds we have to precisely specify serialization formats. Beyond that there are a couple of questions about how prescriptive we are in the spec:

  1. Are the types we use in the spec (eg, bigints, maps, arrays, etc) supposed to be understood as abstract types or concrete types for which we give a ~full interface and description of semantics?
  2. How closely if at all does our implementation have to track the type interfaces in the spec? For example if the spec shows an array used in an algorithm is it important that in that algorithm the thing we use looks and acts like an array? Or could it for example be a sequence accessed by an cursor?

It seems like the answers should be "types in the spec are abstract types" and "our implementation does not have to track the types interfaces in the spec". Seems like this has to be the case to give implementors (ourselves) freedom to choose the best interface for the job given their language and constraints. There will be some places where we'll have to dial up our prescriptiveness on implementation (eg, "rounding must work this way"), but we shouldn't feel the need to implement what's in the spec.

Correct?

dignifiedquire commented 6 years ago

How closely if at all does our implementation have to track the type interfaces in the spec? For example if the spec shows an array used in an algorithm is it important that in that algorithm the thing we use looks and acts like an array? Or could it for example be a sequence accessed by an cursor?

The point of the spec here is to enable interop, so if your implementation gets the same result using MyFancyCursorThatIterates100000timesFasterThanYourArray, your implementation is welcome to use it. Remember, we hash our results, so as long as the hash is the same, and subsequently the data you put on the wire things are good.

dignifiedquire commented 6 years ago

Are the types we use in the spec (eg, bigints, maps, arrays, etc) supposed to be understood as abstract types or concrete types for which we give a ~full interface and description of semantics?

I think they are more abstract types and should map to a clear definition of how they behave under certain operations. If there is ambiguity we should spell it out in the spec (if it influences the end result).

phritz commented 6 years ago

OK, finally get back around to this. @whyrusleeping points out in https://github.com/filecoin-project/go-filecoin/pull/527#issuecomment-396666377 why infinite precision is a bad idea. I want to get consensus on what to do before we move forward.

The easiest thing to do is pick a minimum unit of FIL and store a varint count of of those units. This is basically a fixed-point representation with the point at 10-18 and is what ethereum does. So for example if the minimum unit is 10-18 which I've heard several times then we'd store 1 FIL as 1000000000000000000 encoded as a varint. This is simple but it's also really expensive, storage-wise. Storing the quantity 1 FIL takes 9 bytes.

There are a few alternatives, for example

whyrusleeping commented 6 years ago

we could have the minimum unit be 10**-9

No, this isnt good. For the reasons in my other issue and just in general.

we could still have the representation be in units of 10**-18 so operations are fixed cost but store in scientific notation at a known

This sounds perfectly reasonable to me. We can even restrict the exponent field to be a single byte varint.

It also helps to correct our thinking on the units here. Instead of thinking of the minimum unit of filecoin being represented here by the number 10**-18, it could instead be 1 (as it is in ethereum). This should make the base-change logic for combining numbers much simpler.

phritz commented 6 years ago

I'm going to close this issue out. The only thing left to do here that is a priority is the following which I put in the Ready queue:

https://github.com/filecoin-project/go-filecoin/issues/587

Also, I dumped state into a designdoc for the benefit of posterity:

https://docs.google.com/document/d/12xNPzVCPSC2bTv7myxNRoMGx79AqFHQb89LfmRcFpGc/edit#

Other, lower priority items include: https://github.com/filecoin-project/go-filecoin/issues/559 https://github.com/filecoin-project/go-filecoin/issues/585 https://github.com/filecoin-project/go-filecoin/issues/584