apple / swift-protobuf

Plugin and runtime library for using protobuf with Swift
Apache License 2.0
4.56k stars 449 forks source link

Deterministic JSON serialization for `map` types #1477

Closed rebello95 closed 11 months ago

rebello95 commented 12 months ago

Problem

It's not currently possible to deterministically produce JSON data for types that include Protobuf map fields since the keys in those maps are not sorted. For example:

for _ in 0..<10 {
    var message = Some_Message.with { message in
        message.other = [
            "aaa": "abc",
            "111": "abc",
            "eee": "baz",
            "bbb": "baz",
            "ccc": "baz",
            "ddd": "baz",
        ]
    }
    message.other["0new"] = "abc"
    print(String(data: try message.jsonUTF8Data(), encoding: .utf8)!)
}

Outputs:

{"other":{"ddd":"baz","ccc":"baz","eee":"baz","111":"abc","0new":"abc","aaa":"abc","bbb":"baz"}}
{"other":{"111":"abc","bbb":"baz","ddd":"baz","0new":"abc","aaa":"abc","ccc":"baz","eee":"baz"}}
{"other":{"ddd":"baz","ccc":"baz","eee":"baz","111":"abc","0new":"abc","aaa":"abc","bbb":"baz"}}
{"other":{"111":"abc","bbb":"baz","ddd":"baz","0new":"abc","aaa":"abc","ccc":"baz","eee":"baz"}}
{"other":{"ddd":"baz","ccc":"baz","eee":"baz","111":"abc","0new":"abc","aaa":"abc","bbb":"baz"}}
{"other":{"111":"abc","bbb":"baz","ddd":"baz","0new":"abc","aaa":"abc","ccc":"baz","eee":"baz"}}
{"other":{"ddd":"baz","ccc":"baz","eee":"baz","111":"abc","0new":"abc","aaa":"abc","bbb":"baz"}}
{"other":{"111":"abc","bbb":"baz","ddd":"baz","0new":"abc","aaa":"abc","ccc":"baz","eee":"baz"}}
{"other":{"ddd":"baz","ccc":"baz","eee":"baz","111":"abc","0new":"abc","aaa":"abc","bbb":"baz"}}
{"other":{"111":"abc","bbb":"baz","ddd":"baz","0new":"abc","aaa":"abc","ccc":"baz","eee":"baz"}}

Use case

When sending Connect Unary GET requests, the serialized payload passed in the URL's query parameters must be deterministic in order to properly enable effective caching. Related: https://github.com/connectrpc/connect-swift/issues/196

Proposal

Add a useDeterministicOrdering: Bool = false field to JSONEncodingOptions (similar to what Protobuf for Go has) which, when set, causes JSONEncodingVisitor.visitMapField(...) to sort keys before iterating through and serializing them.

thomasvl commented 12 months ago

From the spec, https://protobuf.dev/programming-guides/proto3/#maps -

Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order.

rebello95 commented 12 months ago

I agree that the default implementation should not incur the extra cost of producing deterministically ordered map keys, but there is value in providing the option to produce deterministic outputs, which is why this can be configured in several other Protobuf implementations:

If desired, libraries can go further than the spec in providing customization options. The proposed option is not in conflict with the spec.

thomasvl commented 12 months ago

Sorry, meant to hold that and finish filling it in later, but accidentally posted just the reference.

Just wanted to include that it shouldn't be the default. TextFormat also has the need to sort, so that might be a reference for doing it (haven't looked at the PR).

rebello95 commented 12 months ago

Sorry, meant to hold that and finish filling it in later, but accidentally posted just the reference. Just wanted to include that it shouldn't be the default.

No worries. I agree with disabling by default.

TextFormat also has the need to sort, so that might be a reference for doing it (haven't looked at the PR).

Thanks, this is a helpful tip. I updated the PR to use a similar strategy.

rebello95 commented 12 months ago

As an aside, I'd like to follow this change up with a similar one for BinaryEncodingVisitor which I believe has the same limitation.

tbkka commented 11 months ago

I'm curious: Why is this order so important to you?

Is this because of how you are testing? Or do you have some other concern?

(The reason I ask: I've been considering a change to Binary Encoding that would provide a large performance improvement, but would change the order of fields.)

tbkka commented 11 months ago

Ah. I re-read the original comment. That makes sense. For testing, I've long wanted someone to write a function that does actual correct comparisons of JSON blobs, independent of order. That would simplify some of SwiftProtobuf's testing logic and make it easier to add new tests for more complex scenarios.

rebello95 commented 11 months ago

Yep, it would be good for testing, and the primary use case I have is to unlock caching when using unary HTTP GET requests with the Connect protocol (spec). For caching to actually work consistently, the request query for a semantically identical request message needs to be the same between requests.

tbkka commented 11 months ago

Yep, it would be good for testing ...

I disagree, which is why I outlined my desire for a better way to compare two distinct JSON blobs.

rebello95 commented 11 months ago

I disagree, which is why I outlined my desire for a better way to compare two distinct JSON blobs.

Fair enough, I misinterpreted your initial statement about testing.

Are we all in agreement that adding this proposed option is reasonable given the use case for deterministic cacheable requests?

cburrows commented 11 months ago

There is an alternative for your use case: to layer a JSON "canonicalizer" over the output. I love the separation of concerns there but it would be nice to know if that works for you.

rebello95 commented 11 months ago

There is an alternative for your use case: to layer a JSON "canonicalizer" over the output. I love the separation of concerns there but it would be nice to know if that works for you.

I assume you mean post-processing the outputs provided by SwiftProtobuf's serializer to make them deterministic? My thoughts on that are:

  1. This would be nontrivial for binary-serialized payloads (which I'd also like to add a matching option for)
  2. If we were able to get that to work, it'd likely incur significant processing overhead
  3. Many other implementations provide this option, so it feels strange to require bespoke logic specifically for Swift
fowles commented 11 months ago

Putting on my hat of "google lead for protobuf" for a moment, I think that deterministic options are reasonable. I think we should very much not make them the default and should even randomize things that we intend to have non-deterministic (to the extent practical, sometimes other concerns make this not work). But having well named options to provide determinism are good.

tbkka commented 11 months ago

The "canonicalizer" idea would work fine for testing, but it doesn't sound like a good idea for @rebello95's use case. Deterministic options sound like a good way forward for JSON, as long as "deterministic" is appropriately caveated:

In particular, as mentioned above, we've been discussing changing the order of fields in binary encoding to make things faster.

My fear: someone who thinks that a hash of a "deterministic" serialized form is a good database key. ☹️

rebello95 commented 11 months ago

Thank you @fowles 👍🏽

Deterministic options sound like a good way forward for JSON, as long as "deterministic" is appropriately caveated:

  • It does not mean "sorted"
  • It promises stability in the medium term, but is not a long-term guarantee (we can change the "deterministic" order next year if we think it would help us meet other goals)

+1. Other languages have explicit caveats in their documentation (for example, C++) which speak to both of your points. I think these warnings are probably also reasonable discouragement against doing things like using them as database keys. The PR I proposed includes this disclaimer.

akshayjshah commented 11 months ago

The Go implementation's documentation also clarifies the guarantees provided in a way that should meet everyone's needs:

// Deterministic controls whether the same message will always be // serialized to the same bytes within the same binary. // // Setting this option guarantees that repeated serialization of // the same message will return the same bytes, and that different // processes of the same binary (which may be executing on different // machines) will serialize equal messages to the same bytes. // It has no effect on the resulting size of the encoded message compared // to a non-deterministic marshal. // // Note that the deterministic serialization is NOT canonical across // languages. It is not guaranteed to remain stable over time. It is // unstable across different builds with schema changes due to unknown // fields. Users who need canonical serialization (e.g., persistent // storage in a canonical form, fingerprinting, etc.) must define // their own canonicalization specification and implement their own // serializer rather than relying on this API. // // If deterministic serialization is requested, map entries will be // sorted by keys in lexographical order. This is an implementation // detail and subject to change.

https://pkg.go.dev/google.golang.org/protobuf@v1.31.0/proto#MarshalOptions

fowles commented 11 months ago

@rebello95's fear is totally real and people even do it for the non-canonical forms. When talking about stability it is important to quantify several distinct types:

  1. non-deterministic
  2. deterministic with a single run of a binary
  3. deterministic across multiple runs of the same binary
  4. deterministic across different binaries built against the same version of protobuf
  5. deterministic within a single major version of protobuf
  6. deterministic across all time (sometimes called canonical)

For wireformat, there is a notional "canonical" encoding that produces the last one but you can only achieve it if you know the exact schema of the data.

For json, there could be a "canonical" encoding defined around sorting, but I don't think it has been explicitly written out.

I suspect json wants 4 or 5 to provide useful caching.

cburrows commented 11 months ago

There is a json rfc that makes an attempt to define "canonical," unrelated to protobuf.

Anyway I asked this question mostly because it's good to call out rejected alternatives.

rebello95 commented 11 months ago

Thank you all for working through this!

@thomasvl do you know when the next release cut will be?

thomasvl commented 11 months ago

Thank you all for working through this!

@thomasvl do you know when the next release cut will be?

This lands things on main, which is working towards 2.0. There would need to be cherry picks to the 1.x branch if the desire is to have it in a release sooner.

rebello95 commented 11 months ago

Ah okay, yes it'd be nice to have it included in an earlier release. Would you like me to open cherry pick PRs to 1_x_release_branch or are you able to add the commits?

thomasvl commented 11 months ago

If you have the time, go for it (I'm not sure when I'll have the time (or any of the Apple folks will)). There likely will be minor edits needed because of how things have changed on main.

rebello95 commented 11 months ago

https://github.com/apple/swift-protobuf/pull/1487