clj-commons / clj-yaml

YAML encoding and decoding for Clojure
Other
120 stars 25 forks source link

Number-like strings not quoted #35

Closed grzm closed 2 years ago

grzm commented 2 years ago

It looks like there's a regression since 0.7.109 and still exists in 0.7.110 with respect to number-like strings.

% clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.108"}}}' -M -e "(require '[clj-yaml.core :as yaml]) (doseq [x [\"083\" {:x \"083\"}]] (print (yaml/generate-string x)))"
'083'
{x: '083'}
 % clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.109"}}}' -M -e "(require '[clj-yaml.core :as yaml]) (doseq [x [\"083\" {:x \"083\"}]] (print (yaml/generate-string x)))"
083
{x: 083}
 % clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.110"}}}' -M -e "(require '[clj-yaml.core :as yaml]) (doseq [x [\"083\" {:x \"083\"}]] (print (yaml/generate-string x)))"
083
{x: 083}

Note that '083' in {x: '083'} is quoted in 0.7.108 but not quoted in 0.7.109 and 0.7.110.

I'm guessing it's an issue in the upstream snakeyaml library, but I haven't checked that yet.

dpsutton commented 2 years ago

With "good" 108 version:

❯ clj -Sdeps '{:deps { clj-commons/clj-yaml {:mvn/version "0.7.108"}}}'
Downloading: clj-commons/clj-yaml/0.7.108/clj-yaml-0.7.108.pom from clojars
Downloading: clj-commons/clj-yaml/0.7.108/clj-yaml-0.7.108.jar from clojars
Clojure 1.11.1
user=> (require '[clj-yaml.core :as yaml])
nil
user=>  (doseq [x ["083" {:x "083"}]] (print (yaml/generate-string x)))
'083'
{x: '083'}
nil
user=>

With "good" 108 but bumping the yaml lib to 1.32 (version used in 110)

 clj -Sdeps '{:deps {org.yaml/snakeyaml {:mvn/version "1.32"} clj-commons/clj-yaml {:mvn/version "0.7.108"}}}'
Downloading: org/yaml/snakeyaml/1.32/snakeyaml-1.32.pom from central
Downloading: org/yaml/snakeyaml/1.32/snakeyaml-1.32.jar from central
Clojure 1.11.1
user=> (require '[clj-yaml.core :as yaml])
nil
user=> (doseq [x ["083" {:x "083"}]] (print (yaml/generate-string x)))
083
{x: 083}
nil
grzm commented 2 years ago

Looks like the regression was introduced in snakeyaml 1.30. Using [org.yaml/snakeyaml "1.29"] shows the expected, quoted result.

grzm commented 2 years ago

Reported upstream: https://bitbucket.org/snakeyaml/snakeyaml/issues/550/regression-in-handling-number-like-strings

borkdude commented 2 years ago

@grzm I tested this with EDN input like this:

{:foo [123 "045" "083"]}

Output:

$ clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.110"} org.yaml/snakeyaml {:mvn/version "1.31"}}}}' -M -e "(require '[clj-yaml.core :as yaml]) (print (yaml/generate-string (clojure.edn/read-string (slurp \"/tmp/test.edn\"))))"
foo: [123, '045', 083]

That doesn't make sense: strings should stay strings.

borkdude commented 2 years ago

It looks like we're have to going to fix it in the clj-yaml library instead of snakeyaml:

https://bitbucket.org/snakeyaml/snakeyaml/issues/550/regression-in-handling-number-like-strings

asomov commented 2 years ago

You can provide your own Resolver to SnakeYAML (and drop the octal numbers and timestamps) https://bitbucket.org/snakeyaml/snakeyaml/src/master/src/main/java/org/yaml/snakeyaml/resolver/Resolver.java

lread commented 2 years ago

I don't fully get it yet, so I'm trying to understand better by hitting snakeyaml directly.

Given snake_load.clj

(require '[clojure.pprint :as pprint])
(import '(org.yaml.snakeyaml Yaml))

(pprint/print-table
  (for [yaml ["'083'" "083" "043" "83"]]
    {:from (pr-str yaml) :to (pr-str (.load (Yaml.) yaml))}))

And snake_dump.clj

(require '[clojure.pprint :as pprint])
(import '(org.yaml.snakeyaml Yaml))

(pprint/print-table
  (for [in ["83" "043" "083" 83]]
    {:from (pr-str in) :to (pr-str (.dump (Yaml.) in))}))

And a little snaketest.clj bb driver (we were recently on 1.26, our issue seems to have occurred between 1.29 and 1.30 and we were recently on 1.31 so...)

(require '[babashka.tasks :as t])

(doseq [script ["snake_load.clj" "snake_dump.clj"]
        snake-version ["1.26" "1.29" "1.30" "1.31" "1.32"]]
  (println "snakeyaml version" snake-version script "--------")
  (t/clojure "-Sdeps"
             (format "{:deps {org.yaml/snakeyaml {:mvn/version \"%s\"}}}" snake-version)
             "-M" script)
  (println))

When I run bb snaketest.clj I get (edited down and annotated to what is interesting).

Loading from YAML:

snakeyaml version 1.26 and 1.29 snake_load.clj --------

|   :from |   :to |
|---------+-------|
| "'083'" | "083" |
|   "083" |  83.0 | <-- this seems wrong
|   "043" |    35 |
|    "83" |    83 |

snakeyaml version 1.30, 1.31 and 1.32 snake_load.clj --------

|   :from |   :to |
|---------+-------|
| "'083'" | "083" |
|   "083" | "083" | <<-- difference
|   "043" |    35 |
|    "83" |    83 |

So: 083 is not a valid YAML octal number so it is assumed to be a string. 043 is valid YAML octal, so it gets loaded as a numeric. (Octal 43 is 35 decimal).

Dumping to YAML:

snakeyaml version 1.26 and 1.29 snake_dump.clj --------

| :from |       :to |
|-------+-----------|
|  "83" |  "'83'\n" |
| "043" | "'043'\n" |
| "083" | "'083'\n" |
|    83 |    "83\n" |

snakeyaml version 1.30, 1.31 and 1.32 snake_dump.clj --------

| :from |       :to |
|-------+-----------|
|  "83" |  "'83'\n" |
| "043" | "'043'\n" |
| "083" |   "083\n" | <-- difference
|    83 |    "83\n" |

So: I don't get this one yet. The string "083" is being converted to what looks like an invalid YAML number? Does anybody understand the rationale for this new snakeyaml behaviour yet? After reading this snakeyaml issue, all I'm guessing is since string "083" looks like a number (starts with 0), but can't be number because it is not valid octal, it is safe to be dumped like as a YAML without quotes, because it won't be interpreted as YAML number when it is loaded? Brain twisty!

So maybe in YAML 1.1 a string that looks like a number but can't be a number can forego the quotes (but why? I've not noticed any shortage in the availability of quote characters).

Is this a bug in the YAML 1.1 spec (they did rework octal numbers for YAML 1.2) or a bug in snakeyaml or a bug in my noggin?

Anyhoo.... if my understanding above is all solid, or good enough, I think we can adapt to spitting out strings that look like numbers but are not numbers to include the YAML quotes.

@asomov are you suggesting there is a similar issue with timestamps?

asomov commented 2 years ago

you asked a few questions, but I think you know the answers

lread commented 2 years ago

Thank you for your reply @asomov, very much appreciated!

And also thank you so much for all the hard work on the snakeyaml libraries (sorry, I didn't make the connection until just now). I can't imagine that it is easy work!

lread commented 2 years ago

@borkdude I'll also take a peek at your roundtrip test from above, understand what is going on, and make sure we'll be good there.

lread commented 2 years ago

So @borkdude, on your example:

{:foo [123 "045" "083"]}

Output:

$ clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.110"} org.yaml/snakeyaml {:mvn/version "1.31"}}}}' -M -e "(require '[clj-yaml.core :as yaml]) (print (yaml/generate-string (clojure.edn/read-string (slurp \"/tmp/test.edn\"))))"
foo: [123, '045', 083]

That doesn't make sense: strings should stay strings.

The non-intuitive thing (to me) that I learned is that 083 is actually a string in YAML because, although it looks like a number, it cannot be a number.

Under snakeyaml 1.3.1

(def x {:foo [123 "045" "083"]})
(def o (generate-string x))
o
;; => "foo: [123, '045', 083]\n"

(parse-string o)
;; => #ordered/map ([:foo (123 "045" "083")])

We can choose to explicitly exclude octal support with a custom snakeyaml Resolver. I'd rather just emit '083' instead of 083, but not sure if we can ask snakeyaml to do that yet.

But... we can also do nothing. It could be argued that we do not have a bug here.

borkdude commented 2 years ago

Is it safe to propose the following based on @grzm's latest comment at SnakeYAML's repo:

clj-yaml will preserve quotes around strings that look like numbers. This is not against the yaml 1.1 spec and will undo breaking changes experienced by reading output of this library by yaml 1.2 readers.

We could coordinate with @asomov to get this change back into SnakeYAML but I think SnakeYAML doesn't have very frequent releases. Until that happens, we could perhaps implement the proposal in clj-yaml, until a new SnakeYAML comes out that has the change and then undo it here.

lread commented 2 years ago

Yes thanks, I think we've finally gotten to the crux of this issue! 🎉

grzm commented 2 years ago

This is not against the yaml 1.1 spec

To be clear, I believe this to be true, but haven't read the spec closely enough to, say, provide expert testimony in a court of law.

lread commented 2 years ago

So should we recap our current options?

option 1 - do nothing

Let @grzm and others suffer! 😈

option 2 - Preserve quotes around strings that look like numbers

option 2a - PR snakeyaml

After it is approved maybe use a fork until it gets released?

option 2b - try to get snakeyaml 1.32 to do this

Requires some snakeyaml spelunking and might not be viable.

option 3 - tell snakeyaml to not recognize octals

Not sure about the effect of this yet, but can tell snakeyaml not to resolve octals. Maybe current octals would be parsed as strings? Would this affect emitting? Would need to try.

borkdude commented 2 years ago

Option 2a sounds promising, but this depends on what @asomov thinks about this.

Option 3: @asomov hinted at this here I believe. Option 3b would be: provide an option to drop octal parsing (enabled by default or disabled by default).

lread commented 2 years ago

Oh right forgot:

option 4 - migrate from snakeyaml to snakeyaml-engine

This avoids the broken YAML 1.1 octals by moving to YAML 1.2. But also could be a bit of work, and might introduce other issues. Have not looked into it at all. Might be worth an exploration.

lread commented 2 years ago

Trying option 2a: Submitted an issue and PR to snakeyaml.

borkdude commented 2 years ago

It was merged! Thanks you @lread and @asomov!

Now, we could either wait for 1.33 or depend on a fork until 1.33 comes out. Considering that this a breaking change to other Clojure users, I'd like to have it sooner than later.

asomov commented 2 years ago

there is another issue to fix. Once finished, we can release SnakeYAML 1.33

lread commented 2 years ago

Thanks @asomov! (and thanks for the variable name change, more clear for sure!).

If I can help with that final issue, just let me know.

borkdude commented 2 years ago

For clarity: what is that final issue?

asomov commented 2 years ago

https://bitbucket.org/snakeyaml/snakeyaml/issues/553/loaderoptionssetcodepointlimit-not-honored

borkdude commented 2 years ago

@asomov Don't mean to push you, so just take this as a question:

Is there a likelihood that there will be a new SnakeYAML release within the next few days?

I'd like to release a new version of https://babashka.org/ (scripting environment for Clojure) in which clj-yaml is a built-in dependency, so it'd be nice to know if I should wait, or just go ahead without the new SnakeYAML and bump clj-yaml in a future release.

Thanks for the ongoing work!

asomov commented 2 years ago

I would like to finish this

borkdude commented 2 years ago

Thanks for the heads up! I'll watch the issue

asomov commented 2 years ago

@lread SnakeYAML 1.33 has been released. It may take a few hours to promote the JARs to Maven central

borkdude commented 2 years ago

@asomov Thanks so much for your work!

lread commented 2 years ago

@asomov a big sincere thank you for SnakeYAML and for this new release!

grzm commented 2 years ago

Thank you, @asomov !