edn-format / edn

Extensible Data Notation
2.62k stars 96 forks source link

When to decode strings? #59

Closed jml closed 10 years ago

jml commented 11 years ago

edn streams & elements are all UTF-8 encoded, which is great. However, there's no guidance in the spec for whether a reader should decode strings into Unicode.

This matters, since sometimes you need to send a sequence of literal bytes (which should not be decoded), and other times you need to send human-readable text (which most definitely should).

One solution would be to say that the reader should never decode strings, and add a built-in tagged element #text that decodes the string.

jml commented 11 years ago

Of course, another solution would be to interpret all strings as text and require custom extensions for literal bytes, e.g.

#bytes (72 101 108 108 111 32 119 111 114 108 100)

or

#base64 "SGVsbG8gd29ybGQ="
benmosher commented 11 years ago

The #base64 option is how I am planning to implement raw byte data in my obj-c implementation.

I'm confused about your issue; are you suggesting you'd stash arbitrary bytes into a string (i.e. "[unreadable gibberish]")? Wouldn't this violate the spec, in that it could (and likely would frequently) contain byte sequences that are not valid UTF-8? (off the top of my head, 0xFF would be an invalid UTF-8 byte)

jml commented 11 years ago

It would indeed violate the spec. (TIL: Not all sequences of bytes are valid UTF-8).

Although I think the spec could perhaps be clearer about whether readers should decode strings to unicode, I'm happy to consider this issue closed.

richhickey commented 10 years ago

There will be separate tag for bytes/base64

paxan commented 10 years ago

@richhickey any guidance on what the tag will be called?

My vote: #base64 "YW55ICsgb2xkICYgZGF0YQ=="