Closed jviotti closed 3 years ago
I can try digging for reference that I remember seeing, but note that 0xFF
is not a byte that UTF-8 encoded Strings would ever contain, so it would be naturally missing from textual JSON or XML with UTF-8 encoding. So my comment was wrt one particular subset of websocket content, for textual content; and like you point out could not really be limit in general for binary formats.
So, I think at the time I had seen references to this draft:
https://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76
and found by searching, through this:
https://stackoverflow.com/questions/4846740/what-is-the-meaning-of-x00-and-xff-in-websockets
and quote
Once the client and server have both sent their handshakes, and if the handshake was successful, then the data transfer part starts. This is a two-way communication channel where each side can, independently from the other, send data at will.
Data is sent in the form of UTF-8 text. Each frame of data starts
with a 0x00 byte and ends with a 0xFF byte, with the UTF-8 text in
between.
It is quite possible that WebSocket specification evolved further and perhaps this framing mechanism is not used at all any more.
I would be happy to get a PR that updates this part of rational since it looks like the framing was redone and does not really use this approach at all... :)
@cowtowncoder Thanks for the response. I see. So it looks like Smile would be fine without the 7-bit floating-point approach currently used to avoid the 0xFF
byte and use standard IEEE 764, right? I don't remember if similar tricks are applied to other data types, but this looks like something that should be changed in a new major version of Smile, or somehow added with backwards compatibility in mind.
What do you think?
If there was desire to allow this optimization -- similar to using straight-on raw binary data, instead of 7-bits/byte mode -- sure, it could be added, but I don't think it is worth changing this without backwards-compatibility. I suppose header bit 2 ("this document may contain 'raw' binary") could also indicate additional "may contain 4/8 byte floating-point" numbers... except that we'd need markers for values.
Note, however, that reserving 0xFF
was not only desired for WebSocket; that was more of an example. Having reliable and efficient framing seems valuable for many use cases -- this allows finding document boundaries for things like splitting segments of hadoop input without decoding content. So WS was just an example, not a major goal.
@cowtowncoder That makes sense. What do you think about this clarification then: https://github.com/FasterXML/smile-format-specification/pull/14?
Looks good, merged, thanks!
The Smile design goals document mentions that:
I might be looking at the wrong place, but The WebSocket Protocol (RFC 6455) does not seem to mention a
0xFF
end marker byte. Can you point me to where this is mentioned? It seems odd at first sight that a protocol such as WebSockets conditions the data encoding layer to not contain certain byte patterns, but I might be missing something.