cms-nanoAOD / correctionlib

A generic correction library
https://cms-nanoaod.github.io/correctionlib/
BSD 3-Clause "New" or "Revised" License
16 stars 22 forks source link

Allowing Nan/Infinity breaks most standard JSON parsers #208

Closed clelange closed 5 months ago

clelange commented 1 year ago

I found that allowing NaN and Infinity breaks standard JSON parsers, e.g. web browsers, probably because it violates https://datatracker.ietf.org/doc/html/rfc8259#section-6 An example file is https://gitlab.cern.ch/cms-nanoAOD/jsonpog-integration/-/blob/master/POG/MUO/2018_UL/muon_Z.json.gz This was introduced in https://github.com/cms-nanoAOD/correctionlib/pull/85 and is actually a feature of rapidjson. It converts the value to std::numeric_limits<double>::infinity(). This is a nice feature, but I'm not sure it's worth the caveat of breaking JSON parsing...

Two questions come to my mind:

  1. Do we need infinity? I would say maybe, since one could also just use a large number such as 999999 or collider centre-of-mass energy, but I understand that "infinity" is nicer.
  2. Is it worth using rapidjson-specific features? I would rather say no.

I would propose to use infinity, but as a quoted, case-insensitive string instead and leave the parsing/value assignment to correctionlib. In this way, the JSON standard would be preserved and one could still open the file in RFC8259-compliant applications. What do you think?

nsmith- commented 1 year ago

It originally came up as a way to handle #83, and yes a large number would work similarly in practice. We could alter the schema and replace basically every float with float | "inf" | "-inf" or similar, but it looks like a pretty invasive change. It would be backwards-compatible, so a minor version schema update would be OK, with a deprecation of using the Infinity keyword in the JSON itself for the eventual v3. Given the effort, I'm actually somewhat inclined to recommend replacing current usage of Infinity with a large number.

nsmith- commented 10 months ago

@IzaakWN given you originally requested the asymmetric flow, do you have any opinions on using a "large number" instead of infinity?

IzaakWN commented 10 months ago

No, not really strong opinion either way. As long as it's large enough to cover all realistic physics use cases (like eta, pT, mass, ...), as well as technical limitations by different data types (mainly in python and C++ I guess) . :)

clelange commented 9 months ago

I think using a large number sounds good. We could recommend something that hopefully will never be reached such as 999999?

nsmith- commented 9 months ago

I was leaning towards implementing float | "inf" | "-inf" but the two approaches are not in conflict with each other.

clelange commented 9 months ago

For my understanding, what would this look like for:

"edges": [
    15.0,
    20.0,
    25.0,
    30.0,
    40.0,
    50.0,
    60.0,
    120.0,
    Infinity
],
nsmith- commented 9 months ago
"edges": [
    15.0,
    20.0,
    25.0,
    30.0,
    40.0,
    50.0,
    60.0,
    120.0,
    "inf"
],
clelange commented 9 months ago

OK, fine for me. Renders OK in browsers.

nsmith- commented 5 months ago

This has gotten more urgent since something changed in pydantic between 2.6.4 and 2.7 that makes the serialization of float('inf') now null, possibly related to https://github.com/pydantic/pydantic-core/issues/872. Regardless, I will add a change to support the proposed syntax of "inf", "+inf", "-inf"