Implement RLP decode for stdlib

jonathanxuu commented 2 years ago

Hi, Miden VM has a powerful ability to handle numerical mathematical problems, but many other types of data exist in real life, such as String types. We are thinking about ways to add String related libs in Miden VM. Here are some initial ideas.

The actual operations on String type data may include:

To check whether String A contains a sub-string String B
To concatenate two String A || B
etc.

It would be great if there is a way to extend Miden to handle non-numerical problems as well. Here are some ideas for dealing with String type data: RLP (Recursive Length Prefix) is the main encoding method used to serialize objects in Ethereum. I'm wondering if we could allow the above operations on Strings by first RLP encoding the non-numeric type data.

Here is one example: Alice wants to know if a sentence contains the word 'guitar' The sentence is : 'I can play guitar.' Its RLP encoded result is : [146, 73, 32, 99, 97, 110, 32, 112, 108, 97, 121, 32, 103, 117, 105, 116, 97, 114, 46] With RLP encoded result, It's much easier to tell if a sentence contains 'guitar': Alice can just compare the RLP encoded result.

Besides, for String type integer such as '21', we can perform RLP decode operations within the VM to convert them to their original values -- 21 for subsequent numerical mathematical operation.

Maybe we could add the implementation of a RLP decoder in stdlib. If you have any comments on this idea, please let me know.

bobbinth commented 2 years ago

Hi! I think having an RLP decoder could be very interesting. But I haven't thought about this too much yet - so, not sure what it will entail. A few questions that come to mind:

What a single field element could contain up to 7 bytes. Is that how we should think about binary data within the VM? In this case, would the RLP decoder take a sequence of such field elements and output one char per element? Or would it do something else?
Regarding strings:
- We could also say that a single char is 7 bits (ASCII encoding). In that case, as single field element could fit 9 chars. And we could actually have a notion of "short string" which would consist of 1 word (4 field element). Thus, a short string could have up to 36 chars.
- Should we try to support full UTF8. In this case we would probably do 1 char per element?

For the RLP decoder we probably don't need to answer the string questions - these could be handled separately.

Did you have any thoughts on what RLP procedures could look like in terms of inputs/outputs?

jonathanxuu commented 2 years ago

Hi, thanks for your reply! First, I would like to list the data formats that may exist throughout the Decoding Process:

Original RLP Code;
Binary RLP Code;
RLP Code in the form of field element;
Better representing format for String Operations, such as ASCII in form of field element(UTF-8 for further optimization)

About '3.RLP Code in the form of field element': To store RLP Code efficiently, we should store them as field elements. Each RLP Code element(hereinafter called 'R-byte') is between '0x00' and '0xff', which needs 8 bits to represent. So in the VM, we can use a field element to contain up to 7 R-bytes.

RLP Decoder

General idea

RLP decoder is used to decode RLP Code into some better form which can be stored and used for subsequent String Operations.

Implementation

Inputs: The RLP Code in the form of field element For example：

The RLP Code of String: 'rlp_code' is [136, 114, 108, 112, 95, 99, 111, 100, 101]

Its binary form is：[10001000, 01110010, 01101100, 01110000, 01011111, 01100011, 01101111, 01100100, 01100101]

Merge 7 R-bytes into 1 field element, little-endian, add leading zeros to make it u64:

[0000000001101111011000110101111101110000011011000111001010001000, 0000000000000000000000000000000000000000000000000110010101100100] The field elements [31352983974081160, 25956] are passed to the VM as inputs

The procedure does: RLP decoder aims to convert [31352983974081160, 25956] to field element form of ASCII. Steps are as follows：

Read 8 bits from each field element and convert it into ASCII: [114, 108, 112, 95, 99, 111, 100, 101]
Convert ASCII into binary form, 7 bits per ASCII, little-endian, add leading zeros to make it u64: 0000000011001011100100110111111000111011111111000011011001110010

Thus, the final decoding result is 57301590653810290, which has better performance when performing String Operations.

Outputs: The field element form of ASCII (Better representing format for String Operations)

For UTF-8, a field element can contain up to 2 UTF-8(a UTF-8 needs 1-4 bytes), which needs more space to store. For now, I think ASCII might be enough.

What do you think about the final representation of the decoding result? If you think this approach feasible and if there's anything you need, we would like to help.

bobbinth commented 2 years ago

This makes sense, and I think this would be a great addition to the standard library. A couple of questions/comments:

Should this procedure work with fixed-size inputs or variable-size inputs? Or should we have two different versions?
For fixed size inputs, what would be a good setting? e.g., should it work with single field elements or with words (4 field elements)? Or maybe something else?
For variable size inputs, how do we want to define the interface? e.g., does the procedure assume that there are two pointers at the top of the stack - one for where to read inputs from, and the other for where to write the output to?
- For output, we should probably define a way to specify the number of characters in the string (or maybe it should be the number of field elements?) It could be done by reserving 4 bytes in front for the length. Another approach is using "fat pointers" where say the lower 32 bits contain the actual memory pointer, and the upper 32 bits contain the length.

My thinking that the fixed sized version is going to be much simpler than the variable size version - so, maybe we should do it first?

jonathanxuu commented 2 years ago

Yes, we should start with 'fixed-size' inputs.

For 'fixed-size' inputs, I tend to start with 1 word (the word is provided directly by the stack).

Or use 5 words as 'fixed-size' which could contain up to 140 characters, or 10 words(280 chars like Twitter) which could contain more chars?

As mentioned before in the thread, 1 word can contain 28 characters (1 field element could hold 7 R-bytes), and the ‘fixed-size-input’ procedure converts this 1 word to ASCII in the form of field_element, with the number of characters in the string.

For example, the ‘fixed-size’ procedure's output is: [14, field_element_0, field_element_1] (Here, '14' means there are 14 characters in this 'word')

For further extension, maybe we should use 'variable-sized' to replace 'fixed-size', because fixed-size:

will cause a lot of memory waste (if the fixed-size is large, when dealing with short-string, there might be lots of '0's );

can not handle long-text which is longer than 'fixed-size'.

For 'variable-sized' inputs:

For Inputs, we need to put two elements on the top of the stack:

Fat pointer: low 32 bits indicates the start of the memory address where the procedure read from, high 32 bits indicates the start of the memory address where the procedure write into.

Number: the number of memories the procedure needs to read in total

For Output: also requires two elements:

Fat pointer: low 32 bits indicates the beginning of the memory address where the decoding result is stored, high 32 bits indicates the total number of memory used to store the decoding result.

Number: the number of characters in the string

For 'fixed-size' input, if this is used to test the RLP decoding, 1 word might be enough. Otherwise, it may be necessary to set a more general length, which one do you prefer?(5 words(140), 10 words(280), ...)
What do you think of the 'variable-size' input and output?(btw, the idea of 'fat pointer' is excellent)

bobbinth commented 2 years ago

As mentioned before in the thread, 1 word can contain 28 characters (1 field element could hold 7 R-bytes), and the ‘fixed-size-input’ procedure converts this 1 word to ASCII in the form of field_element, with the number of characters in the string.

I'd prefer the 1 word approach. Though I wonder if we should do 5 elements -> 4 elements conversion. If I understand things correctly, 5 field elements could contain up to 35 characters. Once decoded, these characters could be grouped into 4 field elements (assuming ASCII encoding) as 9 * 4 = 36. But if this complicates things, we can stick to 4 elements -> 4 elements conversion.

In the future, if we want to, we could extend this to 20 -> 16 element conversion which could contain up to 140 characters.

Regarding variable-size encoding, I agree with what you said - but I'd prefer to defer thinking about the specifics until the fixed-size decoding is more or less working. I think making fixed-size work will inform how we could approach variable-size decoding.

0xPolygonMiden / miden-vm

Implement RLP decode for stdlib #249

RLP Decoder

General idea

Implementation

For example, the ‘fixed-size’ procedure's output is: [14, field_element_0, field_element_1] (Here, '14' means there are 14 characters in this 'word')

can not handle long-text which is longer than 'fixed-size'.