ProvableHQ / leo

🦁 The Leo Programming Language. A Programming Language for Formally Verified, Zero-Knowledge Applications
https://leo-lang.org/
GNU General Public License v3.0
4.77k stars 655 forks source link

[Proposal] bits/bytes syntax #682

Open gluax opened 3 years ago

gluax commented 3 years ago

💥 Proposal bits/bytes syntax

Feature

Currently, many gadgets in Leo support a to_bit and to_bytes method on the rust side. The goal is to expose this side to Leo via some syntax. I would like to propose two different syntaxes so that we can decide between them or choose both.

Proposed Syntax 1

Coming from a rust background and something that is syntactically satisfying is the From implementations. So in Leo, we would have something as follows:

let x = 1u32;
let b1 = Bytes::from(x);
let b2 = Bytes::from(1u8);
let b3 = Bytes::from(some_method_that_returns_applicable_value());

I also believe this method would be easy to integrate into our stack. It's simply adding another statement, then at compile-time, we try actually converting the type to bits/bits with the methods provided it implements the trait. If it doesn't implement the method/trait, we can throw an error during the compiler.

Proposed Syntax 2

In contrast(or in conjunction), we could have the method approach. The goal syntactically would be to allow the call on any data type, the syntax would be as follows:

let x = 1u32;
let b1 = x.bytes();
let b2 = 1u8.bytes();
let b3 = some_method_that_returns_applicable_value().bytes();

Not sure if we currently allow method calls on data types themselves, but that shouldn't be too bad. Similarly to above, we check at compile time the type can be properly converted and in the compiler, we throw an error if the type is not convertible.

collinc97 commented 3 years ago

Thanks for writing this up @gluax! I have a few comments for discussion:

Syntax 1 requires adding a Bytes circuit to the core module. At the top of the file, developers would need to import the circuit before using the method. In addition, we would have to add support for function overloading: Bytes::from(u64), Bytes::from(field) etc. Each one of these overloaded functions would have to be loaded into the program tree as well - adding to compilation time.

Syntax 2 requires no additional imports - since every datatype where to_bits/to_bytes is supported will have the method built in. We do not have to add function overloading support there either - since multiple datatypes can have method names that are the same. Lastly, the cost of each function is only added to the program tree when it is called. One drawback of this approach is that it may require more syntax to be added to the grammar file - as opposed to syntax 1 which merely requires an additional statement rule.

For these reasons I gravitate towards Syntax 2 but would like to hear others thoughts. @acoglio @howardwu

acoglio commented 3 years ago

Which types are we going to use for bits, bytes, and their sequences? A natural choice for bytes is u8. The closest (isomorphic, in fact) type to bits we have now is bool, but I think we'll want bits to be numbers somehow. If we adopt proposal https://github.com/AleoHQ/leo/issues/627, then u1 would be a natural choice for bits. I guess we would use arrays for sequences (of bits or bytes).

I agree with @collinc97 that Syntax 2 may be preferable to Syntax 1. Among other things, overloading adds a bit of complication to type checking and inference (still doable of course, just a bit more complicated).

I was thinking whether we might use the type cast syntax (https://github.com/AleoHQ/leo/issues/600) for converting from/to bits and bytes, since we'll want to add it anyways for other types (as exemplified in that proposal):

let x = 1u32;
let b1 = x as [u8; 4];
let b2 = 1u8 as [u8; 1];
let b4 = something as [u8; N];

Note that we would have to specify the length of the array, but the length must be known at compile time anyhow. If we want to spare the user, we could introduce a type notation like [T; ?] for an array of an unstated size, with the intent that the size must be uniquely determined statically (i.e. at compile time), during the constant folding/propagation phase.

This type notation [T; ?] should be a separate proposal, in fact, which may be useful regardless of conversions to/from bits and bytes. Just like now one can omit a whole type, this is omitting part of a type. We could go a little crazier and allow [?; N], and, well, in fact, ? anywhere, as a sort of wildcard. For example, (u8, ?, [field; (?, 4)]). I can make an RFC just for this wildcard idea (just popped up as I was commenting this bits/bytes syntax, clearly), unless everybody dislikes it right away.

collinc97 commented 3 years ago

The to_bits rust gadget returns a vector of booleans. The to_bytes rust gadget returns a vector of u8's. The types that have rust gadgets implemented are ints, uints, fields, groups, addresses.

The first version of the bits and bytes methods should only call the underlying gadgets on those types.

Once that is finalized, we can extend syntax to aggregate types such as arrays.

@acoglio I like the idea of extending type casting to arrays of bits and bytes but I do think that should be a separate feature.

Similarly, the generic array length [T; ?] could also be a separate feature but I think is out of scope for this proposal.

acoglio commented 3 years ago

I'll make a separate RFC for type wildcards. (Not sure there's a strong use case, but it seems worth considering, maybe in the future.)

Understood about the underlying Rust gadgets. But what's not clear to me is what is the Leo type of b1, say. Presumably it consists of a sequence of 4 bytes, because it comes from a u32, but is it (u8, u8, u8, u8) or [u8; 4], or am I missing something?

acoglio commented 3 years ago

I was also wondering about endianness. Would the bits/bytes be big endian or little endian? Should both options be provided?

E.g.:

258:u32 --> [0 0 1 2] // big endian bytes
258:u32 --> [2 1 0 0] // little endian bytes
collinc97 commented 3 years ago

I was also wondering about endianness. Would the bits/bytes be big endian or little endian? Should both options be provided?

E.g.:


258:u32 --> [0 0 1 2] // big endian bytes

258:u32 --> [2 1 0 0] // little endian bytes

This is a good point. The gadgets will return vectors in little endian form. Should we provide additional methods to specify target endianness?

gluax commented 3 years ago

I was also wondering about endianness. Would the bits/bytes be big endian or little endian? Should both options be provided? E.g.:


258:u32 --> [0 0 1 2] // big endian bytes

258:u32 --> [2 1 0 0] // little endian bytes

This is a good point. The gadgets will return vectors in little endian form. Should we provide additional methods to specify target endianness?

I would say a good approach is to have method default one, i.e. .bytes -> little-endian bytes and we document that. Then we can have another method .big_endian_bytes(). At least that's my thoughts on the matter!

acoglio commented 3 years ago

My first instinct would be to have "symmetry", e.g. big_endian_bytes() and little_endian_bytes(), or just bebytes() and lebytes(), but it is more verbose, especially if there is a more common case, so I'm also happy with @gluax's proposal.

acoglio commented 3 years ago

By the way, having two possible conversions (big vs. little endian) reinforces the fact that this should not use a type cast syntax (against my proposal earlier in this message thread), because the latter implies just one conversion.

acoglio commented 3 years ago

@collinc97 @gluax I'm still unclear about what, say, the type of b1 would be (see my earlier message). (Sorry if I'm being thick.)

gluax commented 3 years ago

@acoglio On the rust side it returns a vec of u8 for the bytes gadget. So I believe on the Leo side it would an array of the appropriately sized array for the type of u8s. So for b1 I think the correct choice would be to have it typed as [u8; 4].

collinc97 commented 3 years ago

@acoglio I think we should start a separate proposal for a type casting to bits/bytes syntax after #600 is implemented.

I also think that specifying endianness + reversing endianness should be a separate proposal and we can simply support little endian format as the default for now.

collinc97 commented 3 years ago

Implementation

@gluax To finalize, this proposal can be closed by implementing:

gluax commented 3 years ago

Having chatted with both @Protryon and @collinc97 about this topic, as it currently stands there isn't a good way to implement these into Leo. The ideal way would be to have the Leo types as circuits or related to circuits, then have static/instance methods on that circuit that calls to some rust sided gadget implementation.

However, I'd love to hear if anyone else has different ideas, but I will make this a separate proposal.

howardwu commented 3 years ago

Adding my thoughts on this thread, I think the .bits() and .bytes() proposal is better than the Bytes::from proposal.

However, from many first-hand encounters, I think the (slightly less attractive) convention of .bits_le(), .bits_be(), .bytes_le(), and .bytes_be() should be used to clearly delineate the endianness for the end user.

To add some evidence supporting this from other domains, we are adding this to our convention in snarkVM and Rust uses a similar delineation convention as well - to_le_bytes() and to_be_bytes()

howardwu commented 3 years ago

I'll add an additional proposal:

We could stick with the .bits() and .bytes() convention, strictly (and explicitly) enforcing only 1 form of endianness (e.g. all calls will always return little-endian), and in addition, we introduce a .rev() or .reverse() method to reverse the endian representation. The methods can then be chained to form .bits().reverse() for example.

In my opinion, this approach could lead to unnecessary verbosity and hurt readability if the chaining sequence becomes too long.

gluax commented 3 years ago

Blocked by #975.

gluax commented 3 years ago

Due to the essence of time, the two proposed ways of implementing this quickly are:

  1. Through typecasting.
  2. An operator that signifies to convert the type.
acoglio commented 3 years ago

Just to enhance the discussion, this would be a temporary solution for expediency. Just after DP3, we will revisit this.

I also cast my vote for 2, as it seems to provide a more natural transition towards the use of scalar type methods. Type cast would have to pick either little or big endian, and do so implicitly. So my preference is to be explicit and symmetric.

collinc97 commented 2 years ago

Will revisit after testnet3 refactor