FasterXML / smile-format-specification

New home for Smile format (https://en.wikipedia.org/wiki/Smile_(data_interchange_format))
BSD 2-Clause "Simplified" License
92 stars 14 forks source link

Clarification about WebSockets compatibility and avoidance of 0xFF byte marker #13

Closed jviotti closed 3 years ago

jviotti commented 3 years ago

The Smile design goals document mentions that:

(Smile) SHOULD be usable with [[http://en.wikipedia.org/wiki/Web_socket | WebSockets]] to degree feasible. Means that effort should be made to avoid use of byte 0xFF (end marker) -- for some content (binary data), it may be necessary to introduce separate "safe" (or "framing"?) mode to use additional escaping?

I might be looking at the wrong place, but The WebSocket Protocol (RFC 6455) does not seem to mention a 0xFF end marker byte. Can you point me to where this is mentioned? It seems odd at first sight that a protocol such as WebSockets conditions the data encoding layer to not contain certain byte patterns, but I might be missing something.

cowtowncoder commented 3 years ago

I can try digging for reference that I remember seeing, but note that 0xFF is not a byte that UTF-8 encoded Strings would ever contain, so it would be naturally missing from textual JSON or XML with UTF-8 encoding. So my comment was wrt one particular subset of websocket content, for textual content; and like you point out could not really be limit in general for binary formats.

cowtowncoder commented 3 years ago

So, I think at the time I had seen references to this draft:

https://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76

and found by searching, through this:

https://stackoverflow.com/questions/4846740/what-is-the-meaning-of-x00-and-xff-in-websockets

and quote

Once the client and server have both sent their handshakes, and if the handshake was successful, then the data transfer part starts. This is a two-way communication channel where each side can, independently from the other, send data at will.

   Data is sent in the form of UTF-8 text.  Each frame of data starts
   with a 0x00 byte and ends with a 0xFF byte, with the UTF-8 text in
   between.

It is quite possible that WebSocket specification evolved further and perhaps this framing mechanism is not used at all any more.

cowtowncoder commented 3 years ago

I would be happy to get a PR that updates this part of rational since it looks like the framing was redone and does not really use this approach at all... :)

jviotti commented 3 years ago

@cowtowncoder Thanks for the response. I see. So it looks like Smile would be fine without the 7-bit floating-point approach currently used to avoid the 0xFF byte and use standard IEEE 764, right? I don't remember if similar tricks are applied to other data types, but this looks like something that should be changed in a new major version of Smile, or somehow added with backwards compatibility in mind.

What do you think?

cowtowncoder commented 3 years ago

If there was desire to allow this optimization -- similar to using straight-on raw binary data, instead of 7-bits/byte mode -- sure, it could be added, but I don't think it is worth changing this without backwards-compatibility. I suppose header bit 2 ("this document may contain 'raw' binary") could also indicate additional "may contain 4/8 byte floating-point" numbers... except that we'd need markers for values.

Note, however, that reserving 0xFF was not only desired for WebSocket; that was more of an example. Having reliable and efficient framing seems valuable for many use cases -- this allows finding document boundaries for things like splitting segments of hadoop input without decoding content. So WS was just an example, not a major goal.

jviotti commented 3 years ago

@cowtowncoder That makes sense. What do you think about this clarification then: https://github.com/FasterXML/smile-format-specification/pull/14?

cowtowncoder commented 3 years ago

Looks good, merged, thanks!