Reader flag to support unicode escapes in string literals

avodonosov commented 4 years ago

Despite the edn format doesn't specify unicode escapes in string literals (unlike for characters), in practice, it is very inconvenient sometimes. Optional support for unicode escapes in string literals, managed by reader config flag is very desirable.

Looking into the code I think unicode escapes are not even supported in characters. This is a bug - the EDN specification requires unicode escapes in characters: https://github.com/edn-format/edn#characters.

bpsm commented 4 years ago

A few things:

I totally missed the addition of \uXXXX for character literals way back in 2014. edn-java does not support that, but should do so. #60
As I'm sure you're aware, unicode escapes in strings are not, strictly speaking, necessary since "edn elements, streams and files should be encoded using UTF-8.".

What's the use case for unicode escapes in string literals? Are you getting edn that you have no control over that is broken in this way? Are you dumping raw bytes of some other format (JSON) directly into an edn stream, crossing your fingers and hoping everything will be fine? Do you need to persist or transmit edn in a character encoding other than UTF-8?

I'd be willing to add \uXXXX interpretation when reading string literals with edn-java, but it would help to better understand how it would be useful to you.
I wouldn't want printing to ever produce \uXXXX escapes, however, as this would produce output that's broken for other edn parsers.

avodonosov commented 4 years ago

Hello, thank you for the response.

The motivation are invisible, zero-width characters like \u200c.

I keep in .edn files certain configuration and unit test data for that configuration. One of the examples I must support includes text containing \u200c. If I simply save it as UTF-8, other people reading this file will have difficulty understanding what is going on.

As for printing, I also was thinking it should not produce unicode escapes, even for characters, because I don't know how to decide what characters should use the escaped representation.

bpsm commented 4 years ago

Yes, I can see the utility of that. I'll see what I can do about that over the weekend.

bpsm commented 4 years ago

Please have a look at what I just pushed to the develop branch.

avodonosov commented 4 years ago

@bpsm , thank you, that would suit my needs.

So you don't' see a need for a special flag to enable Unicode escapes in string literals, it's fine to be always enabled?

bpsm commented 4 years ago

Yes, I think I'll just have it always enabled. It seems to me like a sensible addition and inoffensive enough to leave always on.

It occurred to me (after I'd implemented it) that there's actually another approach you could have taken to address your use case. (Maybe you'd thought of it, but I hadn't.) EDN is, Extensible data notation, after all:

#my/unicode-escaped "\\u1234 mumble ~ mumble \\u4444 mumble"

And then install a corresponding handler for the tag #my/unicode-escaped which scans the string for things that look like \uXXXX and does the necessary conversion. Note, however that if you're married to \uXXXX as the concrete syntax this solution means using \\ in the string literal since them's the rules. Of course, you could invent your own convention that doesn't overload \ for representing arbitrary Unicode characters if you wanted to at this point.


#my/unicode-swizzled "~1234 mumble ~007E mumble ~4444 mumble"

avodonosov commented 4 years ago

If you keep this feature always on, it's less work for me - just update the version number in dependency.

But to be honest I have to tell you one pro-flag consideration: people who want only valid EDN to enter their system. So if they can parse it with edn-java and store somewhere, then any other component down the processing chain is guaranteed to read the file.

That's why I reported the issue mentioning the flag.

The workaround with tags is interesting, I haven't thought of it. If you reject this issue and not implement I maybe use it. But would prefer to avoid - my configs are complex enough already for users to edit, EDN is not as widely know as JSON or XML, the config includes regular expressions represented as EDN string literals, so users already have to deal with double escaping of \ (regex syntax + edn string literals), adding tags and custom Unicode escapes into the mix would not be friendly.

Ideally, IMHO, the spec would include unicode escapes for string literals. Missing them, while they are already specified for characters looks like the result of the spec being a bit drafty, not polished.

bpsm commented 4 years ago

I hadn't considered the use case of using edn-java to check incoming edn for well-formedness before persisting the original bytes. (Rather than, printing the just parsed edn back out to fresh bytes). I suppose that would argue in favor of making this extension off by default. (Maximum compatibility.)

I'll have a look at what adding this as an option to the parser configuration would look like.

bpsm commented 4 years ago

Please have a look at this change which makes \uXXXX an option to be configured by installing a TagHandler into the Parser.Config and tell me if you think that would be acceptable to you.

avodonosov commented 4 years ago

@bpsm , thank you, this solution would work for me.

FWIW, I'm not sure why you decided to express this option as a pseudo tag handler. Other pseudo tag handlers allow the user to customize the representation of EDN numbers in the object structure created by the parser. In the case of Unicode escapes not much room for custom behaviour, only one of the two predefined variants to chose: support or not support. Also, I noticed passing through the tag handler takes some redundant memory allocations: firs the Unicode escape text is put nnto the scanner buffer, then extracted as a substring of it and passed to the tag handler, the tag handler takes the substring of unicode escape to pass to parseInteger, then the resulting character is converted into a string and returned from the tag handler - 3 temporary string objects are created. When imagining how to implement the option I personally thought about simply adding an IF within the previous case 'u' : branch.

Of course, unicode escapes would usually be rare so these allocations will not be significant usually. If you like the code as is, I'm fine.

Thank you again for your support.

avodonosov commented 4 years ago

@bpsm, another aspect to consider is escapes support by the https://github.com/clojure/tools.reader from official clojure github project.

(Of course, double escapes are used here)

So it supports unicode and even octal escapes by default.

In this case, maybe edn-java does not need to make this feature optional, probably I was overthinking.

bpsm commented 4 years ago

Yea, I think you're right. I'll have edn-java follow clojure.tools.read.edn's lead on this.

So, I'll go back to making support for "\uXXXX" always on.

(This reminds me of another thing I have yet to create an issue for, Clojure's reader supports unicode characters in symbol and keyword names, though the spec makes it sound like it doesn't and edn-java does not currently support this. Do you have any thoughts on this?)

The garden path that lead me to (ab)using a TagHandler for in-string unicode escaping was something like this:

I didn't want to add a method to the Parser.Config interface just in case someone had chosen to write their own implementation of that interface, though that's not intended usage. (I had forgotten, since edn-java was originally Java 6 compatible that default methods are a thing since Java 8.)
I already use edn-specific TagHandlers at various points for extension purposes. (When all you have is a hammer, everything looks like a nail.)
I considered having the tag handler accept an Integer instead of a String, but thought it might be convenient to have the contract go String -> String since that would allow trivial Identity tag handler to just leave the string unchanged. I now think that's more "cute" than useful.

Shrug.

bpsm commented 4 years ago

This is implemented in 0.7.0

(But 0.7.0 is not currently on Maven central because of GPG issues I have been unable to resolve.)

avodonosov commented 4 years ago

Thank you!

bpsm / edn-java

Reader flag to support unicode escapes in string literals #59