Closed avodonosov closed 4 years ago
A few things:
What's the use case for unicode escapes in string literals? Are you getting edn that you have no control over that is broken in this way? Are you dumping raw bytes of some other format (JSON) directly into an edn stream, crossing your fingers and hoping everything will be fine? Do you need to persist or transmit edn in a character encoding other than UTF-8?
Hello, thank you for the response.
The motivation are invisible, zero-width characters like \u200c.
I keep in .edn files certain configuration and unit test data for that configuration. One of the examples I must support includes text containing \u200c. If I simply save it as UTF-8, other people reading this file will have difficulty understanding what is going on.
As for printing, I also was thinking it should not produce unicode escapes, even for characters, because I don't know how to decide what characters should use the escaped representation.
Yes, I can see the utility of that. I'll see what I can do about that over the weekend.
@bpsm , thank you, that would suit my needs.
So you don't' see a need for a special flag to enable Unicode escapes in string literals, it's fine to be always enabled?
Yes, I think I'll just have it always enabled. It seems to me like a sensible addition and inoffensive enough to leave always on.
It occurred to me (after I'd implemented it) that there's actually another approach you could have taken to address your use case. (Maybe you'd thought of it, but I hadn't.) EDN is, Extensible data notation, after all:
#my/unicode-escaped "\\u1234 mumble ~ mumble \\u4444 mumble"
And then install a corresponding handler for the tag #my/unicode-escaped which scans the string for things that look like \uXXXX
and does the necessary conversion. Note, however that if you're married to \uXXXX
as the concrete syntax this solution means using \\
in the string literal since them's the rules. Of course, you could invent your own convention that doesn't overload \
for representing arbitrary Unicode characters if you wanted to at this point.
#my/unicode-swizzled "~1234 mumble ~007E mumble ~4444 mumble"
If you keep this feature always on, it's less work for me - just update the version number in dependency.
But to be honest I have to tell you one pro-flag consideration: people who want only valid EDN to enter their system. So if they can parse it with edn-java and store somewhere, then any other component down the processing chain is guaranteed to read the file.
That's why I reported the issue mentioning the flag.
The workaround with tags is interesting, I haven't thought of it. If you reject this issue and not implement I maybe use it. But would prefer to avoid - my configs are complex enough already for users to edit, EDN is not as widely know as JSON or XML, the config includes regular expressions represented as EDN string literals, so users already have to deal with double escaping of \ (regex syntax + edn string literals), adding tags and custom Unicode escapes into the mix would not be friendly.
Ideally, IMHO, the spec would include unicode escapes for string literals. Missing them, while they are already specified for characters looks like the result of the spec being a bit drafty, not polished.
I hadn't considered the use case of using edn-java to check incoming edn for well-formedness before persisting the original bytes. (Rather than, printing the just parsed edn back out to fresh bytes). I suppose that would argue in favor of making this extension off by default. (Maximum compatibility.)
I'll have a look at what adding this as an option to the parser configuration would look like.
Please have a look at this change which makes \uXXXX an option to be configured by installing a TagHandler into the Parser.Config and tell me if you think that would be acceptable to you.
@bpsm , thank you, this solution would work for me.
FWIW, I'm not sure why you decided to express this option as a pseudo tag handler. Other pseudo tag handlers allow the user to customize the representation of EDN numbers in the object structure created by the parser. In the case of Unicode escapes not much room for custom behaviour, only one of the two predefined variants to chose: support or not support. Also, I noticed passing through the tag handler takes some redundant memory allocations: firs the Unicode escape text is put nnto the scanner buffer, then extracted as a substring of it and passed to the tag handler, the tag handler takes the substring of unicode escape to pass to parseInteger, then the resulting character is converted into a string and returned from the tag handler - 3 temporary string objects are created.
When imagining how to implement the option I personally thought about simply adding an IF within the previous case 'u' :
branch.
Of course, unicode escapes would usually be rare so these allocations will not be significant usually. If you like the code as is, I'm fine.
Thank you again for your support.
@bpsm, another aspect to consider is escapes support by the https://github.com/clojure/tools.reader from official clojure github project.
(Of course, double escapes are used here)
So it supports unicode and even octal escapes by default.
In this case, maybe edn-java does not need to make this feature optional, probably I was overthinking.
Yea, I think you're right. I'll have edn-java follow clojure.tools.read.edn's lead on this.
So, I'll go back to making support for "\uXXXX" always on.
(This reminds me of another thing I have yet to create an issue for, Clojure's reader supports unicode characters in symbol and keyword names, though the spec makes it sound like it doesn't and edn-java does not currently support this. Do you have any thoughts on this?)
The garden path that lead me to (ab)using a TagHandler for in-string unicode escaping was something like this:
Shrug.
This is implemented in 0.7.0
(But 0.7.0 is not currently on Maven central because of GPG issues I have been unable to resolve.)
Thank you!
Despite the edn format doesn't specify unicode escapes in string literals (unlike for characters), in practice, it is very inconvenient sometimes. Optional support for unicode escapes in string literals, managed by reader config flag is very desirable.
See also: https://github.com/edn-format/edn/pull/65
Looking into the code I think unicode escapes are not even supported in characters. This is a bug - the EDN specification requires unicode escapes in characters: https://github.com/edn-format/edn#characters.