FasterXML / jackson-core

Core part of Jackson that defines Streaming API as well as basic shared abstractions
Apache License 2.0
2.25k stars 774 forks source link

Add support to get canonical JSON output #1037

Open digulla opened 1 year ago

digulla commented 1 year ago

I often need to compare the output of some REST API endpoint against a "known correct result" file.

This allows me to share the file with business, QAs and customers to get early feedback. It also helps them to keep track of changes since they get to see the current result without having to deploy and run the application.

For this to be efficient, I need a way to update the file when the output changes.

Right now, I can use JsonNode.equals() to check a whole tree for any changes but that doesn't tell me exactly where and what those changes are.

I can create an ObjectMapper which sorts (most) keys and which serializes BigDecimal consistently but there are still several corner cases:

I think the ideal solution would be a factory method to build a CanonicalJsonGenerator. It must

Optionally, it would be great to have a DefaultPrettyPrinter where _objectFieldValueSeparatorWithSpaces is just colon-space instead of space-colon-space. Maybe Separators should be extended to handle this case there.

I also like to convert date&time to ISO to make it easier to understand the values.

My current setup for reference is below. With this setup, I have to do this before I can compare the JSON strings:

class DefaultPrettyPrinterForTests : DefaultPrettyPrinter() {
    override fun withSeparators(separators: Separators): DefaultPrettyPrinter {
        _separators = separators
        _objectFieldValueSeparatorWithSpaces = separators.objectFieldValueSeparator + " "
        return this
    }
}

private static DefaultIndenter STABLE_INDENTER = new DefaultIndenter("    ", "\n")
private static DefaultPrettyPrinter PRETTY_PRINTER = new DefaultPrettyPrinterForTests()
            .withObjectIndenter(STABLE_INDENTER)

public static ObjectMapper forTests() {
    return Jackson2ObjectMapperBuilder()
        .featuresToEnable(
            /* Avoid 1.23445678E8 instead of 123445678 in the output */
            JsonGenerator.Feature.WRITE_BIGDECIMAL_AS_PLAIN,
            SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS,
            MapperFeature.SORT_PROPERTIES_ALPHABETICALLY
        )
        .featuresToDisable(
            /** Serialize all dates and times in ISO format */
            SerializationFeature.WRITE_DURATIONS_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATES_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATES_WITH_CONTEXT_TIME_ZONE
        )
        .indentOutput(true)
        .postConfigurer( mapper ->
            mapper.setDefaultPrettyPrinter(PRETTY_PRINTER)
        )
        .build<ObjectMapper>()
}
cowtowncoder commented 1 year ago

First of all, yes, there is need for some sort of Canonical output. Possibly (ideally?) using a well-defined, existing specification that outlines requirements -- I have vague recollection such exists (possible more than one, would need to choose).

I don't think this should be done at JsonGenerator level as its responsibility should be simple (?) encoding of token streams. It does not buffer anything (for JSON anyway). Rather, something feeding it should do this: probably using Tree Model (JsonNode). Could be either via ObjectMapper or add-on, no strong opinion.

Since ObjectNode does preserve ordering, could be something that simple transforms a JsonNode into canonical equivalent (creating new instances).

This could also be an extension module, possibly providing either new type(s), or simply overriding default serializer of JsonNode (etc) with custom one that transforms instances on the fly.

digulla commented 1 year ago

Re canonical JSON: I found this: https://gibson042.github.io/canonicaljson-spec/ Not sure how relevant it is.

I think this functionality doesn't belong inside of ObjectMapper since it's not related to Java <-> JSON mapping. It's a pure JSON functionality. Which is why I thought to return a new generator from JsonFactory since this isn't about JSON or JsonNode but about how the data is rendered.

So it's on top of JsonGenerator but below ObjectMapper (or to think of it in another way: It's useful without data binding).

I agree the sorting of the keys can be done outside of the generator in a preparation step. But the rest (handling of trailing zeros, pretty printing) is already handled in JsonGenerator.

As a developer, I would prefer something like JsonFactory.enable(CANONICAL_JSON).createGenerator(...) which then Just Works(TM). The goal here is to have a simple setup so when you have to change something because of bug reports, developers don't have to change their code. Also, this is the place where I would start looking for such a feature. From my point of view, this feels most natural.

If this is implemented this as a JsonGeneratorDelegate, developers can then add filtering and all the other good stuff.

The other option is a special CanonicalJsonFactory but I'm worried how much code duplication that would mean and how many other APIs would need changes.

Maybe a simple solution for the sorting: Let JsonNodeFactory build new children maps. That would allow to return a TreeMap or some custom implementation. Call the factory method in the constructor of ObjectNode. That would allow to create sorted copies by calling root.deepCopy().

If deepCopy() would delegate more to JsonNodeFactory, this would allow to strip trailing zeros for BigDecimals (unless you want to add a feature flag for this). Alternatively, deepCopy() could accept a NodePostProcessor but I'm not sure how well that would fit into your design.

All that's then missing is a CanonicalJsonPrettyPrinter.

cowtowncoder commented 1 year ago

Well, from my perspective it certainly does not belong to Streaming API as it cannot be implemented in streaming manner.

But I disagree on it not fitting in jackson-databind, in the sense that JsonNode is already included there -- and that is not a binding really, but a JSON-centric object representation. One could definitely argue if that should belong to either in jackson-core (loosening Streaming aspect) -- something many users would like, fwtw -- or in a new package. But from purely practical perspective JsonNode won't be moving into jackson-core; and the most likely place for canonical handler would either be in jackson-databind or as a separate "module" (whether proper Jackson Module or just general "module") and repo.

So: to make it simple -- I will not add canonical JSON handling at level of JsonParser/JsonGenerator/JsonFactory. But I'd be happy to help support another approach to providing support for Canonical JSON.

Another thing to consider, too, is whether Canonical handling would be useful for other dataformats. I would assume that for some (binary JSON ones, CBOR and Smile) it would make sense; and for most others (CSV, XML, Protobuf etc) not.

digulla commented 1 year ago

Ah, now I get where you're at. My ideas were purely from a developer perspective. I was assuming JsonNode is part of core, so I never checked. I don't want to move JsonNode either, I'm looking for the cheapest solution (least amount of new code) :-)

What do you think about my proposal to move more code from ObjectNode to JsonNodeFactory so sorting and BigDecimal processing can be done with the help of a special JsonNodeFactory? There are already seveal comments in the code for ObjectNode to this effect.

That might have the additional benefit that it might be useful for other use cases.

The other option would be a tree visitor which then delegates to a JsonGenerator. That would use less memory and it would be very clean solution. I feel that it would be a lot of new code, though.

cowtowncoder commented 1 year ago

Right, if JsonNode did exist in jackson-core things would be bit different.

Unfortunately no, JsonNodeFactory is purely for constructing "empty" instances; it is not designed to have any logic. Partly the issue is that it doesn't take in any configuration (which is probably and oversight). But more fundamentally I guess JsonNodeFactory is used on deserialization, not serialization.

One could probably register a custom JsonSerializer for JsonNode (and/or ObjectNode), but serialization of JsonNode is handled by methods in JsonNode and subtypes that implement JsonSerializable. So default logic (which is quite simple TBH) is not really modular. Or at least less so than deserialization.

Delegates for JsonGenerator sound similarly like non-optimal as they'd need to re-construct ordering, adding buffering; it seems like backwards approach as well (although to users would be modular; just tricky and inefficient to implement).

I may be missing some other approaches.

digulla commented 1 year ago

This sounds like the best option is then to preconfigure MapperBuilder.

Right now, MapperBuilder isn't opinionated. Would you prefer a

  1. new method in there,
  2. an extension class CanonicalMapperBuilder
  3. a configurer (which takes an existing builder and applies presets)

The first one will add a lot of code in the existing class which doesn't really fit into the existing desing.

The second one isn't very flexible since you have to change your existing code to use the new type.

The last one is very flexible, isolates concerns but (as far as I can tell) would add a new concept.

digulla commented 1 year ago

I've started with unit tests and a bit of code. It's more ugly than I hoped. You can have a look here: https://github.com/FasterXML/jackson-databind/pull/3963

I left TODOs in all places where something was unclear or I think the solution could be better.

One other thing that I just noticed: The pipeline tried to deploy the Java 8 build! See https://github.com/digulla/jackson-databind/actions/runs/5168255885/jobs/9309721753

I hope this happens in my repo and not in the one from Jackson. Can I turn this off or should I just ignore it?

cowtowncoder commented 1 year ago

@digulla Ok on deploy... that is weird. It should filter out PRs and not try deploy. Feel free to ignore it, it's not your fault; main pipeline does need to do better job.

cowtowncoder commented 1 year ago

@digulla One quick not on canonical JSON (from https://gibson042.github.io/canonicaljson-spec/) vs changes so far. It seems to me that:

MUST NOT include insignificant (i.e., inter-token) whitespace 

would work without Pretty Printer, and the additional configuration that adds single white-space after : in Objects would make output not-Canonical wrt spec? (but this is fine for purposes of "a" canonical representation purely for comparison purposes).

I suppose similarly the requirement to always force scientific notation (Spec) is different from your preferences.

This leads to something I realized: maybe Canonicalizer (however implemented) should have some configurability wrt specific commonly desired aspects?

digulla commented 1 year ago

I think there will be two main use cases:

For the former, we want a compact format. For the latter, we want pretty printing.

Ideally, there should be a builder which presets everything for the two cases and then configuration methods or support configurers so developers can tweak the standard.

For example, I would really like to be able to configure the number formatter. There are several useful ways to present numbers:

The first one is required by the standard. I'm not sure how useful that format is in real life. The format is also prone to rounding errors while parsing.

The second form is easy to understand for human and it's also prone to rounding errors during parsing.

The third form is a compromise between readability and keeping rounding errors in check since you can first read the number as (big)integer and then just multiply by a power of 10.

The last form represents the number perfectly, is very fast to parse but maybe not portable between, say, Java and JavaScript.

That's why I'm trying to add a generic API to format&parse numbers to Jackson.

cowtowncoder commented 1 year ago

Yeah I don't think there will ever be generic API for number formatting, since that inevitably would result in additional overhead and complexity.

Worse, since conversions between binary (float, double) and decimal (BigDecimal, textual JSON) representations are lossy, there are also problems with accuracy if attempting to unify handling.

If limiting canonical handling to be accessed using JsonNode things are slightly simpler as we can at least force use of BigDecimal and then probably force scientific/plain output.

Then again, if only using JsonNode it seems odd having to configure ObjectMapper.

I guess I am still not quite sure what a workable solution should look like.