cyberphone / json-canonicalization

JSON Canonicalization Scheme (JCS)
Other
93 stars 23 forks source link

Consider arbitrary-precision decimal numbers #2

Closed tupelo-schneck closed 6 years ago

tupelo-schneck commented 6 years ago

The algorithm of section 7.1.12.1 of ES6 can be applied with no substantial changes to arbitrary-precision decimal numbers. I suggest using that approach for canonicalizing numbers in JCS.

I believe the result will be compatible with ES6 "ToString(Number)" (and thus the current JCS draft specification) for any JSON input N which has the property that ES6 "ToString(ToNumber(N))" is identical to N, notably numbers between 10^-308 and 10^308 with up to 15 significant decimal places.

To my mind this approach is more appropriate to JSON, which natively encodes arbitrary-precision decimal numbers. It will be more work to code a JavaScript canonicalizer, but I think not so bad. It will be easier in languages (like Java say) that have a ready-to-use arbitrary-precision decimal type.

cyberphone commented 6 years ago

Thanx. In my opinion breaking I-JSON does not bring any advantages. As it stands, ES6 isn't able dealing with the recent ES6 BigInt addition itself: https://github.com/tc39/proposal-bigint/issues/24#issuecomment-404874878

Other languages have taken different directions here. Oracle uses a scheme where large numbers are serialized as strings and conforming as numbers, irrespective of the underlying data type. That is, we don't have compatibility for extended number types even at the JSON level 😥

The proposed extension method is effectively the de-facto standard for JSON data targeting multiple platforms.

tupelo-schneck commented 6 years ago

I-JSON only says "SHOULD NOT" and "RECOMMENDED" about limiting the precision of numbers, reserving its MUSTs for other issues. Thus supporting arbitrary-precision numbers is not breaking I-JSON; in fact the opposite: since I-JSON messages are allowed to have high-precision numbers, JCS as currently written rejects valid I-JSON messages.

I would still say "at the JSON level" that arbitrary-precision numbers are supported; JSON is a serialization format that can meaningfully serialize arbitrary-precision decimal numbers. Tooling may be limited in how it manages to serialize and deserialize various data to and from JSON; I think that's what you mean about both ES6 and Oracle. But that's about tooling, not about JSON itself.

Of course I agree that anyone wanting to produce JSON that is widely useful should limit numbers to those expressible with IEEE-754. But JCS is not producing JSON from some original data source---it is a JSON-serialization-to-JSON-serialization translation. It should, in my opinion, work with any valid JSON (well, at least I-JSON). Especially since I think there is no strong technical obstacle to doing so.

cyberphone commented 6 years ago

I may interpret your request wrong but it seems that you want JCS to respect higher precision while keeping the range limitation? My current JCS implementations do not reject higher precision numbers, they just don't see any difference between
  1.3333333333333333
and
  1.333333333333333333333333333333

I could surely add this text.

However, I don't feel particularly tempted making JCS incompatible with JSON.parse() and JSON.stringify() when there is an an already widely deployed solution/workaround that addresses both precision and range.

Another way of dealing with this issue would be my original 2013 scheme which simply required that the textual values were kept, effectively reducing canonicalization of JSON primitives to a no op.

JCS inarguable represents a compromise. There will always be complaints about "crippled JSON" but as far as I can tell nobody have defined an IETF standard using JSON outside of I-JSON/JS.

Or as somebody recently wrote in an ES forum: JSON = JavaScript Object Notation

tupelo-schneck commented 6 years ago

My intention was to support arbitrary precision and arbitrary range. Any number expressible in JSON should be transformable using JCS and as much as possible without changing the semantics.

I would not recommend a no-op, as it seems clear the 100 and 1.00e2 should be canonically equivalent.

Let me see if I can understand what you say about JSON.parse and JSON.stringify. Currently I suppose JCS is effectively the same as taking an input JSON text S, running JSON.stringify(JSON.parse(S)) in an ES6-compatible JavaScript engine, and then recursively reordering the properties of objects as needed. Is that correct?

I agree that that is nice. I'm still inclined to say though that you could have an algorithm based on section 7.1.12.1 of ES6 which works for numbers of arbitrary precision and range expressible in JSON serialization. This algorithm would be compatible with JSON.stringify(JSON.parse(S)) for numbers within the precision and range of JavaScript numbers, but would also transform any number expressible in JSON in a meaning-preserving way.

And do recall that I-JSON allows (if reluctantly) all such numbers. It's true that anyone who wants can produce JSON which encodes numbers of unusual size or precision via strings; it's true that it may be a good idea. But JCS is a JSON-to-JSON transformer... as much as possible it should work with any valid JSON (at least I-JSON) it is given.

cyberphone commented 6 years ago

@tupelo-schneck What you are talking about is a part of a much bigger issue where one recent proposal (https://mail.mozilla.org/pipermail/es-discuss/2018-July/051384.html) suggests serializing BigInt as a string "/BigInt(10)/" . Other people in the very same forum proposes using the ES6-specific 'n' notation 10n.

With respect to your (and you are not alone...) proposal: Personally I have yet to find a single advantage beyond saving a couple of double quotes using a single type for everything from byte to BigNumber. Seen strictly from an interchange point of view, there is no reason for having explicit types for primitives. XML is a proof of that.

Anyway, in addition to breaking ES6, there are other downsides following the JSON standard to its fullest. Dynamic parsing will require using BigNumber for every JSON number in order to not lose precision or range. C#/.NET doesn't even have a BigNumber type.

If the unlikely event that the JSON community agrees on serialization, I would consider a JCS2 😀

It has also been claimed that JSON is unsuitable as interchange format and should be replaced by CBOR or protobuf.

tupelo-schneck commented 6 years ago

"Breaking ES6" isn't right---writing a full-number-preserving canonicalization function mapping JSON text to JSON text is still possible in ES6, and the JSON.parse results of input and output will still match, but you could no longer just use JSON.parse to write the canonicalization function, because JSON.parse is lossy.

I would summarize your stance as: by restricting the defined canonicalization to JSON where all numbers fit in IEEE-754, it becomes much easier to write canonicalizers in various languages---almost trivial in JavaScript, no need to worry about BigNumber types, etc. And since normal people don't use non-IEEE-754 numbers in their JSON, and anyone can always just use strings, there's no practical disadvantage. Correct me if I misrepresent you.

I guess that's fair.

cyberphone commented 6 years ago

What you requested is 100% correct from a canonicalization point of view since it would properly deal with normalization of arbitrary numbers which obviously wont happen if certain (large) numbers are put between quotes.

The naked truth is that JCS is only intended to facilitate a securely "hashable" version of a JSON object. As I wrote, my original proposal didn't even bother with normalizing numbers at all, it only preserved the textual representation, which accomplishes the same goal. The JSON WG folks slashed this idea and in retrospect I'm thankful for that 😉.

There is a strong rationale for keeping JCS aligned with existing JSON tools because a typical (every?) signature scheme using JCS involves manipulation of JSON data during validation: https://github.com/cyberphone/jws-jcs#detailed-validation-operation

Although not stated in the draft, the ultimate goal is that JCS functionality should reside in the platform's JSON serializer as an output option. This is how I implemented it in my own Java based JSON tools.

cyberphone commented 6 years ago

I'm closing this issue because the JCS scope is currently limited to I-JSON.

Looking a bit more into the subject I do not believe the JCS/ES6 number normalization algorithm would be practical because it is actually closely tied to IEEE double precision. A canonicalization system for arbitrary numbers in JSON notation would as far as I can tell have to rather work on a textual level only. This is quite different to the current solution (defined by ECMA) which knows the "exact" (for IEEE double precision) representation of certain values: https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-01#appendix-B

A text-level canonicalizer would use a much simpler scheme like rewriting every number to exponential format and removing trailing and leading zeros. Keeping an integer range like in JCS/ES6 would IMO just be confusing.

The problem you will run into here is that you cannot use existing JSON parsers and get predictable results. For signature schemes which is the primary application for JCS that would be a showstopper.

Related: https://github.com/cyberphone/es6-bigint-json-support#json-support-for-bigint-in-es6

tupelo-schneck commented 6 years ago

Fine. I do think that being able to describe JCS as "get an ES6 engine, run JSON.stringify(JSON.parse(input)), but recursively reorder object attributes" seems compelling.

Do note however that strictly speaking I-JSON does allow numbers outside IEEE double precision, so JCS is introducing a further restriction.

It's true that ES6 number serialization is about IEEE double precision numbers, since that is what ES6 uses, but the algorithm as written (in ES6 section 7.1.12.1) doesn't deeply depend on that. It is easy to use that algorithm to write a textual JSON number converter that is fully compatible with JSON.stringify(JSON.parse(input)) for numbers which are within IEEE double precision bounds.

There are plenty of existing JSON parsers which don't assume IEEE double precision.... Java's Gson and Jackson certainly don't. But I don't know what the field is like for JavaScript, and JSON.parse is of course canonical.

cyberphone commented 6 years ago

Just for my curiosity: Why would you use a more complex algorithm than the (ultra-simple) one I suggested?

tupelo-schneck commented 6 years ago

To your suggestion you'd of course have to add that you use exponential format with exactly one digit to the left of the decimal point (so not 11e2 but 1.1e3). And you'd need to specify various details like using e instead of E, including + after e, and using 0 (or 0e0?) instead of -0.

At that point, all you need for ES6 7.1.12.1 compatibility (and thus JSON.stringify compatibility for "sensible" numbers) is to specify that if the exponent is between -6 and 20 inclusive, the number should be serialized without an exponent.

I believe that's not a great deal of complexity to add. Benefits include compatibility with existing ES6 serialization and also readability.

cyberphone commented 6 years ago

Your additions the algorithm are of course entirely correct. It is still a simple algorithm though.

However, creating an ES6 compatible number serializer is actually quite complex. Feel free trying, I have 100 million test values to offer: https://github.com/cyberphone/json-canonicalization/tree/master/testdata

Anyway, your system would (for compatibility declarations) still be divided in two:

This makes me wonder if you shouldn't consider a draft of your own which addresses the Full version. I'm not going there because I see no advantages using a single type for everything from a single bit to BigNumber. No other data interchange format uses that and XML doesn't have explicit typing at all.

Regarding readability, it is IMO unimportant since you would probably never use canonicalized data on "the wire". Canonicalization is an internal operation.

Due to that, if I were to create a Full version requiring specific parsers (because it operates on text-level), I would not consider ES6 serialization.

cyberphone commented 6 years ago

Well, you could (for a Full version), at the cost of a few additional lines let values that do not contain a fraction and use less than 21 digits be expressed as integers. I.e. 1.456e+3 would be converted into 1456 while 1.4562e+3 would stay as is.

This scheme is similar but not compatible with ES6 since the latter needs a lot of binary-level operations that are slow and difficult in order to find the shortest possible representation that is meaningful to use with IEEE 754 including support for non-normalized numbers having down to 1 digit of precision.

cyberphone commented 6 years ago

If you replaced the two places where "double" is referred to with a total of 100-200 lines of fairly mundane code in this file (only) you would have a canonicalizer for the Full case: https://github.com/cyberphone/json-canonicalization/blob/master/java/canonicalizer/src/org/webpki/jcs/JsonCanonicalizer.java It would be blazingly fast compared to the JCS version as well 😃

tupelo-schneck commented 6 years ago

I don't quite understand.

Let me reiterate that I'm willing to accept a canonicalizer that is "JSON.stringify(JSON.parse(input)) and re-order object properties", and I certainly don't want the world to have multiple JSON canonicalizers. But for the sake of understanding:

I couldn't find 100 million test values, only a handful of small files at https://github.com/cyberphone/json-canonicalization/tree/master/testdata/input. Let me know where to look.

Meanwhile, my canonicalizer failed your test since it canonicalizes 333333333.33333329 to 333333333.33333329, not 333333333.3333333. But that is of course the point. This number has more than the promised 15 digits of precision (in the IEEE-754 normalized range) and so is a number for which my proposed canonicalizer doesn't agree with yours.

In fact, is this even an allowed input for JCS? The current spec draft states 'Data of the type "Number" MUST be expressible as IEEE-754 [IEEE754] double precision values.' This is not true of 333333333.33333329?

The Java code for my proposed canonicalizer is pretty short:

static String canonicalize(BigDecimal m) {
    int signum = m.signum();
    if (signum == 0) return "0";
    if (signum < 0) return "-" + canonicalize(m.negate());
    int nMinusK = -m.scale();
    String s = m.unscaledValue().toString();
    for (int i = s.length() - 1; i >= 0; i--) {
        if (s.charAt(i) != '0') {
            nMinusK += s.length() - (i+1);
            s = s.substring(0, i + 1);
            break;
        }
    }
    int k = s.length();
    int n = nMinusK + k;
    StringBuilder sb = new StringBuilder();
    if (k <= n && n <= 21) {
        sb.append(s);
        for (int i = 0; i < n - k; i++) {
            sb.append("0");
        }
    } else if (0 < n && n <= 21) {
        sb.append(s);
        sb.insert(n, ".");
    } else if (-6 < n && n <= 0) {
        sb.append("0.");
        for (int i = 0; i < -n; i++) {
            sb.append("0");
        }
        sb.append(s);
    } else {
        sb.append(s);
        if (k > 1) {
            sb.insert(1, ".");
        }
        sb.append("e");
        if (n - 1 > 0) {
            sb.append("+");
        } else {
            sb.append("-");
        }
        sb.append(Math.abs(n - 1));
    }
    return sb.toString();
}
cyberphone commented 6 years ago

I can only iterate that for a canonicalizer operating at the text-level the ES6 number serializing algorithm (even in a "JS Crippled" version) doesn't seem like the right solution. I'm BTW not alone thinking that: https://gibson042.github.io/canonicaljson-spec/

The thing is that JCS doesn't operate at the text level. This would have to go if I were about to update the spec to according to your wish. IMO, that's a brand new spec!

https://github.com/cyberphone/json-canonicalization/tree/master/testdata#test-data

tupelo-schneck commented 6 years ago

I have been convinced that your solution is reasonable.

I think it's worth being clear though that JCS does define a JSON-text to JSON-text conversion, so in that sense it operates at the text level. JSON text in, JSON text out. But that text-to-text function is defined by reference to IEEE-754 and it is assumed and encouraged that implementors will use native support for double-precision floating point.

FWIW, my alternate arbitrary number canonicalizer does pass your 100-million test case (in the sense of canonicalizing each JSON number to itself). That's no surprise, as each of those numbers is, by construction, the ES6 stringification of a number representable in double-precision IEEE-754.

cyberphone commented 6 years ago

Thanx.

This internal respectively text-level operation have implications on the code. My goal (which may not be achievable...) is that canonicalization would eventually (only) be featured in JSON serializers as an output option.

Since this is currently not a reality I have created a number of free-standing "dumb" canonicalizers that indeed read JSON text. However, since they are "dumb" they are still meant to be used with existing JSON tools as shown in this "add on" specification: https://github.com/cyberphone/jws-jcs#detailed-validation-operation

A major problem is that the industry is divided wrt to number representation. If this changes (in ES6), I may need to "adjust" my vision 🤔