0xPolygonMiden / miden-vm

STARK-based virtual machine
MIT License
630 stars 160 forks source link

Implement RLP decode for stdlib #249

Open jonathanxuu opened 2 years ago

jonathanxuu commented 2 years ago

Hi, Miden VM has a powerful ability to handle numerical mathematical problems, but many other types of data exist in real life, such as String types. We are thinking about ways to add String related libs in Miden VM. Here are some initial ideas.

The actual operations on String type data may include:

  1. To check whether String A contains a sub-string String B
  2. To concatenate two String A || B
  3. etc.

It would be great if there is a way to extend Miden to handle non-numerical problems as well. Here are some ideas for dealing with String type data: RLP (Recursive Length Prefix) is the main encoding method used to serialize objects in Ethereum. I'm wondering if we could allow the above operations on Strings by first RLP encoding the non-numeric type data.

Here is one example:   Alice wants to know if a sentence contains the word 'guitar'   The sentence is : 'I can play guitar.'   Its RLP encoded result is : [146, 73, 32, 99, 97, 110, 32, 112, 108, 97, 121, 32, 103, 117, 105, 116, 97, 114, 46] With RLP encoded result, It's much easier to tell if a sentence contains 'guitar': Alice can just compare the RLP encoded result.

Besides, for String type integer such as '21', we can perform RLP decode operations within the VM to convert them to their original values -- 21 for subsequent numerical mathematical operation.

Maybe we could add the implementation of a RLP decoder in stdlib. If you have any comments on this idea, please let me know.

bobbinth commented 2 years ago

Hi! I think having an RLP decoder could be very interesting. But I haven't thought about this too much yet - so, not sure what it will entail. A few questions that come to mind:

For the RLP decoder we probably don't need to answer the string questions - these could be handled separately.

Did you have any thoughts on what RLP procedures could look like in terms of inputs/outputs?

jonathanxuu commented 2 years ago

Hi, thanks for your reply! First, I would like to list the data formats that may exist throughout the Decoding Process:

  1. Original RLP Code;
  2. Binary RLP Code;
  3. RLP Code in the form of field element;
  4. Better representing format for String Operations, such as ASCII in form of field element(UTF-8 for further optimization)

About '3.RLP Code in the form of field element': To store RLP Code efficiently, we should store them as field elements. Each RLP Code element(hereinafter called 'R-byte') is between '0x00' and '0xff', which needs 8 bits to represent. So in the VM, we can use a field element to contain up to 7 R-bytes.

RLP Decoder

General idea

RLP decoder is used to decode RLP Code into some better form which can be stored and used for subsequent String Operations.

Implementation

Inputs: The RLP Code in the form of field element For example:

  • The RLP Code of String: 'rlp_code' is [136, 114, 108, 112, 95, 99, 111, 100, 101]
  • Its binary form is:[10001000, 01110010, 01101100, 01110000, 01011111, 01100011, 01101111, 01100100, 01100101]
  • Merge 7 R-bytes into 1 field element, little-endian, add leading zeros to make it u64:
  • [0000000001101111011000110101111101110000011011000111001010001000, 0000000000000000000000000000000000000000000000000110010101100100] The field elements [31352983974081160, 25956] are passed to the VM as inputs

The procedure does: RLP decoder aims to convert [31352983974081160, 25956] to field element form of ASCII. Steps are as follows:

  1. Read 8 bits from each field element and convert it into ASCII: [114, 108, 112, 95, 99, 111, 100, 101]
  2. Convert ASCII into binary form, 7 bits per ASCII, little-endian, add leading zeros to make it u64: 0000000011001011100100110111111000111011111111000011011001110010

Thus, the final decoding result is 57301590653810290, which has better performance when performing String Operations.

Outputs: The field element form of ASCII (Better representing format for String Operations)

For UTF-8, a field element can contain up to 2 UTF-8(a UTF-8 needs 1-4 bytes), which needs more space to store. For now, I think ASCII might be enough.

What do you think about the final representation of the decoding result? If you think this approach feasible and if there's anything you need, we would like to help.

bobbinth commented 2 years ago

This makes sense, and I think this would be a great addition to the standard library. A couple of questions/comments:

My thinking that the fixed sized version is going to be much simpler than the variable size version - so, maybe we should do it first?

jonathanxuu commented 2 years ago

Yes, we should start with 'fixed-size' inputs.

For 'fixed-size' inputs, I tend to start with 1 word (the word is provided directly by the stack).

Or use 5 words as 'fixed-size' which could contain up to 140 characters, or 10 words(280 chars like Twitter) which could contain more chars?

As mentioned before in the thread, 1 word can contain 28 characters (1 field element could hold 7 R-bytes), and the ‘fixed-size-input’ procedure converts this 1 word to ASCII in the form of field_element, with the number of characters in the string.

For example, the ‘fixed-size’ procedure's output is: [14, field_element_0, field_element_1] (Here, '14' means there are 14 characters in this 'word')

For further extension, maybe we should use 'variable-sized' to replace 'fixed-size', because fixed-size:

    1. will cause a lot of memory waste (if the fixed-size is large, when dealing with short-string, there might be lots of '0's );
    1. can not handle long-text which is longer than 'fixed-size'.

      For 'variable-sized' inputs:

  • For Inputs, we need to put two elements on the top of the stack:
    1. Fat pointer: low 32 bits indicates the start of the memory address where the procedure read from, high 32 bits indicates the start of the memory address where the procedure write into.
    1. Number: the number of memories the procedure needs to read in total
  • For Output: also requires two elements:
    1. Fat pointer: low 32 bits indicates the beginning of the memory address where the decoding result is stored, high 32 bits indicates the total number of memory used to store the decoding result.
    1. Number: the number of characters in the string
  1. For 'fixed-size' input, if this is used to test the RLP decoding, 1 word might be enough. Otherwise, it may be necessary to set a more general length, which one do you prefer?(5 words(140), 10 words(280), ...)
  2. What do you think of the 'variable-size' input and output?(btw, the idea of 'fat pointer' is excellent)
bobbinth commented 2 years ago

As mentioned before in the thread, 1 word can contain 28 characters (1 field element could hold 7 R-bytes), and the ‘fixed-size-input’ procedure converts this 1 word to ASCII in the form of field_element, with the number of characters in the string.

I'd prefer the 1 word approach. Though I wonder if we should do 5 elements -> 4 elements conversion. If I understand things correctly, 5 field elements could contain up to 35 characters. Once decoded, these characters could be grouped into 4 field elements (assuming ASCII encoding) as 9 * 4 = 36. But if this complicates things, we can stick to 4 elements -> 4 elements conversion.

In the future, if we want to, we could extend this to 20 -> 16 element conversion which could contain up to 140 characters.

Regarding variable-size encoding, I agree with what you said - but I'd prefer to defer thinking about the specifics until the fixed-size decoding is more or less working. I think making fixed-size work will inform how we could approach variable-size decoding.