creationix / nibs

MIT License
22 stars 3 forks source link

[Proposal] Add Text Format based on superset of JSON #3

Open creationix opened 2 years ago

creationix commented 2 years ago

Nibs is primarily a binary format to enable fast parsing and random access. But sometimes it's really nice to have a textual way to visualize or specify the data.

This text format is a superset of JSON extended to support all of nibs's types:

An example document:

{
  // keys can be integers
  1: "yes",
  [ 1, 2, 3 ]: "How about complex structures for keys?",
  // some string keys don't need quotes
  isCool: true,
  binary: <e9cb1ffede0347ad7b15088dcad361caf5f2487e>,
  content-type: "text/nibs",
}
jjg commented 2 years ago

This would make it a lot easier to encourage adoption (especially around web tech). I'd like to explore ways to make the JSON output survive digestion by nibs-unaware JSON parsers but that could come in the form of additional post-processing or just optional generation flags.

creationix commented 2 years ago

So there could be a pure JSON output mode with tradeoffs in it's design between preserving all semantics or producing more vanilla/compact JSON.

If someone were to use the JSON tagging system they would need a way to also escape valid data that happens to collide with the tags.

For now I think the encoder should default to the lossy options when encoding as JSON to keep things simple.

A library could have a .toString() method that turns a nibs value into a nibs-text encoding. But it could also have a .toJson() method that emits valid JSON using the lossy methods.

creationix commented 2 years ago

Another flag that might be useful is ASCII mode for the text encoding. I recently learned that if you give aws S3 meta fields non-ascii data it will be encoded using rfc2047. This is best to avoid since JSON already have a method for escaping unicode characters.

In this proposed ASCII-only encoding, and non-ascii characters will be encoded as \uxxxx in JSON. Any character higher than fits in the 16 bit index can be encoded using surrogate pairs.

But by default the nibs text format should leave unicode characters in native utf-8 encoding and not use JSON escaping.

reconbot commented 2 years ago

What's the difference between binary data and strings?

jjg commented 2 years ago

Another flag that might be useful is ASCII mode for the text encoding.

I hadn't considered this but I completely agree. I have selfish reasons for wanting this but from an interop perspective I think it's very practical and I also think it aligns with the purpose of the text format itself.

jjg commented 2 years ago

A library could have a .toString() method that turns a nibs value into a nibs-text encoding. But it could also have a .toJson() method that emits valid JSON using the lossy methods.

I think this is a good way to "nudge" consumers toward using the higher-fidelity text format without surprising anyone looking for parsable JSON.

creationix commented 2 years ago

What's the difference between binary data and strings?

In the binary encoding, the only difference is a different type tag. In the text encoding they are very different. This type tag is very useful for languages that have different types for binary and strings since most languages have some sort of unicode capability in strings.

For example in JavaScript, strings can be normal String values, but binary can be represented as ArrayBuffer or Uint8Array or node Buffer depending on what's common to that library's usage.

Even in lua where strings are technically 8-bit binary data, it's a good convention to only use strings for textual data and always encode it as utf-8 in the binary lua string. Then binary data can be represented using a luajit cdata like uint8_t[?] which is essentially a fixed byte array.

In both JS and Lua, strings are interned and immutable values, but binary is non-interned and mutable. They are very different types from the language's point of view.

creationix commented 2 years ago

Technically you could store arbitrary binary data in JSON strings by simply encoding the 8-bit values as matching unicode code points. In the early node.js days we called this hack "raw" encoding. You would just have to know that a unicode string is actually binary data and do the conversion when you need the raw bytes.

These extended values from 128-255 can be encoded as normal UTF-8 in the JSON string or if you're encoding in ASCII mode they can use \uXXXX encoding.

For example, the nibs-text value <deadbeef> encoded as a "raw" string which can be represented as either ASCII or UTF-8 JSON.

> rawAscii = '"\\u00de\\u00ad\\u00be\\u00ef"'
'"\\u00de\\u00ad\\u00be\\u00ef"'
> rawAscii.length
26
> a = JSON.parse(rawAscii)
'Þ­¾ï'
> a.length
4
> a.charCodeAt(0).toString(16)
'de'
> a.charCodeAt(1).toString(16)
'ad'
> a.charCodeAt(2).toString(16)
'be'
> a.charCodeAt(3).toString(16)
'ef'
> Buffer.from(a) // While it's length is 4, it's actually 8 bytes when encoded as UTF-8
<Buffer c3 9e c2 ad c2 be c3 af>
> rawUtf8 = JSON.stringify(a) // by default, JSON.stringify uses utf8-encoding
'"Þ­¾ï"'
> rawUtf8.length
6
> Buffer.byteLength(rawUtf8)
10
creationix commented 2 years ago

Note that the nibs-text encoding for binary is ASCII safe so it only costs 2x to encode. The ASCII safe version of JSON strings if using raw encoding costs 6x for the \uXXXX format. The utf-8 JSON encoding is slightly better than hex since lower values only cost one byte and higher values cost two bytes in utf-8, but the output is very ugly and dangerous to copy-paste.

Binary could also be encoded in JSON as hex strings or base64 strings. In all cases the consumer would be missing the type tag and would need to know if it's supposed to be interpreted as binary which is why the nibs-text format is preferred when possible.

creationix commented 2 years ago

Extended string encoding as railroad diagram. image

creationix commented 2 years ago

Float is the same as JSON, except the fractional part is not optional. image Integer is any integer in decimal or hex or octal or binary image

creationix commented 2 years ago

binary is simple: image

creationix commented 2 years ago

Here is the full proposed syntax as a railroad diagram. I'm not happy about the amount of duplication between integer and float. This will require some state and/or lookahead in parsers.

Also note the change in list vs array where the JSON array syntax maps to nibs array and nibs list is renamed tuple since it's using parenthesis.

value whitespace

creationix commented 2 years ago

Hmm, this is still not good enough. The strings without quotes conflict with the keyword based values true,false,null,nan,inf. Maybe they can use a $ prefix or just be dropped?

Also I forgot nan, inf, and -inf in the diagram.

creationix commented 2 years ago

This is what it looks like with the $ added in (and the missing floats added. At this point I don't see enough value in string without quotes and should probably remove it. The other option is the spec could be like JavaScript and allow any string that's not a keyword?

value

reconbot commented 2 years ago

String without quotes is a pita, half the yaml has quotes anyway

creationix commented 2 years ago

Yeah, let's just remove it. Less is more.

creationix commented 2 years ago

I also removed the hex/binary/octal encoding and was able to merge the two number types.

combined

creationix commented 2 years ago

Initial stab at text spec here https://github.com/creationix/nibs/blob/add-text-format/docs/text-format.md

creationix commented 2 years ago

Proposed spec in PR

8

creationix commented 2 years ago

We should also have a text format for disassembled nibs to enable tools like this https://geraintluff.github.io/cbor-debug/

The 3 formats should be able to be converted between each other.