01mf02 / jaq

A jq clone focussed on correctness, speed, and simplicity
MIT License
2.61k stars 63 forks source link

Parsing zero-padded numbers #169

Open kklingenberg opened 4 months ago

kklingenberg commented 4 months ago

While parsing zero-padded numbers I came across this minor issue. This is a minimal example:

$ echo "0012" | jaq .
0
0
12

Whereas jq yields just 12.

This is serde_json at work, which in turn is probably following JSON's spec (is my guess). This is another view at the issue:

$ echo "0012" | jaq -R fromjson 
Error: cannot parse 0012 as JSON: end of file expected

Also, the lexer rejects these numbers too (which is fine, and consistent with the JSON parser). jq is also consistent with its lenient parser:

$ jaq -n '0012'
Error: Unexpected token, expected as, *, +=, /=, %=, >=, /, ?, %, and, =, or, +, -, |, [, end of input, ==, -=, |=, //, *=, <=, ., !=, ,, >, <
   ╭─[<unknown>:1:2]
   │
 1 │ 0012
   │  ┬  
   │  ╰── Unexpected token 0
───╯

$ jq -n '0012'
12

Anyway, while attempting to work with these numbers one could hope to use the tonumber filter, but that's also implemented in terms of fromjson, so no luck there.

My suggestion is to either:

kklingenberg commented 4 months ago

An example of another side-effect of the current implementation of tonumber:

$ echo '"{}"' | jaq tonumber
{}
wader commented 4 months ago

Related https://github.com/jqlang/jq/pull/3055 jq used to allow whitespaces for tonumber but not anymore

kklingenberg commented 4 months ago

Interesting. So yet another side effect of tonumber just being fromjson is that it tolerates whitespace:

$ echo ' 12 ' | jaq -Rc '[., tonumber]'
[" 12 ",12]
pkoppstein commented 4 months ago

@kklingenberg - Good catch re jaq -n '"{}"|tonumber'. That's a bug that needs fixing.

Since different dialects of jq have and will probably continue to have very different implementations of tonumber, I think it would be good if jaq could lead the way with respect to a non-strict version, and in that spirit I'd like to propose that tonumber(regex) be defined using match/1, perhaps along the following lines:

def tonumber(regex): match(regex).string | sub("^00*"; "0") | strict_tonumber;

it being understood that strict_tonumber is a strict version of tonumber, i.e. it would result in an error if its string input does not conform to the JSON specification of a number.

01mf02 commented 3 months ago

Regarding the "weird" number parsing behaviour for "0012": This is unfortunate, I agree, but it stems from the fact that sequences of JSON values are not standardised (I believe). First, JSON numbers cannot have multiple leading 0s, as we can see by the JSON spec, so as soon as a leading 0 is not followed by [1-9] or [.eE], we know that we are dealing with just the number 0, and everything else is part of a new value. Next, jq allows values to be concatenated without whitespace, such as [1][2]. So I generalised this to allowing concatenation of any JSON values without whitespace. That includes numbers, and this is responsible for the behaviour exposed by parsing "0012". I'm not saying that this behaviour is very intuitive. But I think that it is consistent.

01mf02 commented 3 months ago

Regarding tonumber, I still have to think a bit about how to do this best ...