Extending RSV to support Base64-encoded binary data out-of-the-box

CC007 commented 9 months ago

Idea

From a comment on your Youtube video by Rik Schaaf (me) (https://www.youtube.com/watch?v=tb_70o6ohMA&lc=Ugzsfj_OUAK4s_IYaNZ4AaABAg):

What about extending the RSV format to support Base64 encoded binary data, by prefixing a string with \xFB (I chose FB to easily remember B for Binary/Base64, while still being invalid for UTF-8, to prevent collisions, according to your table at 4:41). This would make it much cheaper to represent numbers (with more than 2 digits), and dates (for example as timestamps). It also would allow for lossless transfer of floating point values (which is a problem when just using strings, since their decimal string representation doesn't losslessly map to its binary representation) It could even allow the encoding of an image as a bitmap or any other binary data. This extension would turn the Array<Array<String | null>> data structure into Array<Array<String | Array | null>> instead. With this addition, you could even embed an RSV file within an RSV file, because the inner RSV file would be Base64 encoded, preventing any collisions with the special characters.

Example

So:

[
 [1234567890, "Hello", "🌍", null]
]

Would translate to:

251 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 72, 101, 108, 108, 111 | 255 | 240, 159, 140, 142 | 255 | 254 | 255 | 253
\FB | B64: SZYC0g==                   | \FF |        "Hello"         | \FF |        "🌍"        | \FF | \FE | \FF | \FD
    | Hex: 499602D2                   |
    | Dec: 1234567890                 |

So in essence, without prefix you get UTF-8 encoded data and with the \FB prefix you get Base64 encoded data (ASCII and UTF-8 compatible, to my knowledge)

What is this addition trying to do

The advantage from this encoding addition is that non-unicode characters could also be represented without risk of collisions, including the RSV special characters themselves.

Another advantage is that some data types can be stored more efficiently, like numbers and dates.

What is this addition NOT trying to do (but what could be added in a separate issue)

This is not a change to add the data types themselves to RSV. This additional special character only signifies the encoding, not the datatype, so you wouldn't know if the data represents an integer, timestamp, float, etc., just like you wouldn't know this with the current implementation. This is still left to the program that is using the RSV file.

If the data type would have to be derived from this binary data, the base64 value could be prefixed (after the \FB) by a string surrounded by non-base64 characters, to signify the data type, like (i32) for 32-bit integers. Example:

 251 | 40, 105, 51, 50, 41 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
 \FB |  type: 32-bit int   |         value: SZYC0g==         | \FF | \FD

...which would represent a single integer (int32) value that equals 1234567890. Or you could use something more simple, but restrictive typing system, that uses a single non-base64 character to define the type, followed by a single character for the size.

 251 | 35, 52 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
 \FB |   #4   |             SZYC0g==            | \FF | \FD

...where # defines an integer and 4 defines a size of 4 bytes (32 bit): 1234567890

 251 | 35, 52 | 81, 69, 107, 80, 50, 119, 61, 61 | 255 | 253
 \FB |   ~4   |             QEkP2w==             | \FF | \FD

...where ~ defines a floating point value and 4 defines a size of 4 bytes (32 bit): 3.141592... This is out of scope for this issue though.

Considerations

With this addition, the name isn't really accurate anymore, so would this be RBSV (Rows of Binary or String Values)?

CC007 commented 3 months ago

To my knowledge, if you know that the resulting binary is 8-bit aligned, you can also skip the = character at the end of the Base64 string

So you would get 83, 90, 89, 67, 48, 103 instead of 83, 90, 89, 67, 48, 103, 61, 61

CC007 commented 3 months ago

Base64 encoding is a 6-bit encoding scheme, but since only F8-FF are reserved, you could get away with using a 7-bit encoding, like ASCII (with the input padded to a multiple of 7 bits, just like is done in Base64 for the 6-bit encoding).

The only thing would be that you can't cleanly view the characters, which also hinders the ability to copy. I don't know if that's an important consideration though.

Stenway / RSV-Specification