Closed rebello95 closed 11 months ago
From the spec, https://protobuf.dev/programming-guides/proto3/#maps -
Wire format ordering and map iteration ordering of map values are undefined, so you cannot rely on your map items being in a particular order.
I agree that the default implementation should not incur the extra cost of producing deterministically ordered map keys, but there is value in providing the option to produce deterministic outputs, which is why this can be configured in several other Protobuf implementations:
If desired, libraries can go further than the spec in providing customization options. The proposed option is not in conflict with the spec.
Sorry, meant to hold that and finish filling it in later, but accidentally posted just the reference.
Just wanted to include that it shouldn't be the default. TextFormat also has the need to sort, so that might be a reference for doing it (haven't looked at the PR).
Sorry, meant to hold that and finish filling it in later, but accidentally posted just the reference. Just wanted to include that it shouldn't be the default.
No worries. I agree with disabling by default.
TextFormat also has the need to sort, so that might be a reference for doing it (haven't looked at the PR).
Thanks, this is a helpful tip. I updated the PR to use a similar strategy.
As an aside, I'd like to follow this change up with a similar one for BinaryEncodingVisitor
which I believe has the same limitation.
I'm curious: Why is this order so important to you?
Is this because of how you are testing? Or do you have some other concern?
(The reason I ask: I've been considering a change to Binary Encoding that would provide a large performance improvement, but would change the order of fields.)
Ah. I re-read the original comment. That makes sense. For testing, I've long wanted someone to write a function that does actual correct comparisons of JSON blobs, independent of order. That would simplify some of SwiftProtobuf's testing logic and make it easier to add new tests for more complex scenarios.
Yep, it would be good for testing, and the primary use case I have is to unlock caching when using unary HTTP GET requests with the Connect protocol (spec). For caching to actually work consistently, the request query for a semantically identical request message
needs to be the same between requests.
Yep, it would be good for testing ...
I disagree, which is why I outlined my desire for a better way to compare two distinct JSON blobs.
I disagree, which is why I outlined my desire for a better way to compare two distinct JSON blobs.
Fair enough, I misinterpreted your initial statement about testing.
Are we all in agreement that adding this proposed option is reasonable given the use case for deterministic cacheable requests?
There is an alternative for your use case: to layer a JSON "canonicalizer" over the output. I love the separation of concerns there but it would be nice to know if that works for you.
There is an alternative for your use case: to layer a JSON "canonicalizer" over the output. I love the separation of concerns there but it would be nice to know if that works for you.
I assume you mean post-processing the outputs provided by SwiftProtobuf's serializer to make them deterministic? My thoughts on that are:
Putting on my hat of "google lead for protobuf" for a moment, I think that deterministic options are reasonable. I think we should very much not make them the default and should even randomize things that we intend to have non-deterministic (to the extent practical, sometimes other concerns make this not work). But having well named options to provide determinism are good.
The "canonicalizer" idea would work fine for testing, but it doesn't sound like a good idea for @rebello95's use case. Deterministic options sound like a good way forward for JSON, as long as "deterministic" is appropriately caveated:
In particular, as mentioned above, we've been discussing changing the order of fields in binary encoding to make things faster.
My fear: someone who thinks that a hash of a "deterministic" serialized form is a good database key. ☹️
Thank you @fowles 👍🏽
Deterministic options sound like a good way forward for JSON, as long as "deterministic" is appropriately caveated:
- It does not mean "sorted"
- It promises stability in the medium term, but is not a long-term guarantee (we can change the "deterministic" order next year if we think it would help us meet other goals)
+1. Other languages have explicit caveats in their documentation (for example, C++) which speak to both of your points. I think these warnings are probably also reasonable discouragement against doing things like using them as database keys. The PR I proposed includes this disclaimer.
The Go implementation's documentation also clarifies the guarantees provided in a way that should meet everyone's needs:
// Deterministic controls whether the same message will always be // serialized to the same bytes within the same binary. // // Setting this option guarantees that repeated serialization of // the same message will return the same bytes, and that different // processes of the same binary (which may be executing on different // machines) will serialize equal messages to the same bytes. // It has no effect on the resulting size of the encoded message compared // to a non-deterministic marshal. // // Note that the deterministic serialization is NOT canonical across // languages. It is not guaranteed to remain stable over time. It is // unstable across different builds with schema changes due to unknown // fields. Users who need canonical serialization (e.g., persistent // storage in a canonical form, fingerprinting, etc.) must define // their own canonicalization specification and implement their own // serializer rather than relying on this API. // // If deterministic serialization is requested, map entries will be // sorted by keys in lexographical order. This is an implementation // detail and subject to change.
https://pkg.go.dev/google.golang.org/protobuf@v1.31.0/proto#MarshalOptions
@rebello95's fear is totally real and people even do it for the non-canonical forms. When talking about stability it is important to quantify several distinct types:
For wireformat, there is a notional "canonical" encoding that produces the last one but you can only achieve it if you know the exact schema of the data.
For json, there could be a "canonical" encoding defined around sorting, but I don't think it has been explicitly written out.
I suspect json wants 4 or 5 to provide useful caching.
There is a json rfc that makes an attempt to define "canonical," unrelated to protobuf.
Anyway I asked this question mostly because it's good to call out rejected alternatives.
Thank you all for working through this!
@thomasvl do you know when the next release cut will be?
Thank you all for working through this!
@thomasvl do you know when the next release cut will be?
This lands things on main
, which is working towards 2.0. There would need to be cherry picks to the 1.x branch if the desire is to have it in a release sooner.
Ah okay, yes it'd be nice to have it included in an earlier release. Would you like me to open cherry pick PRs to 1_x_release_branch
or are you able to add the commits?
If you have the time, go for it (I'm not sure when I'll have the time (or any of the Apple folks will)). There likely will be minor edits needed because of how things have changed on main
.
Problem
It's not currently possible to deterministically produce JSON data for types that include Protobuf
map
fields since the keys in those maps are not sorted. For example:Outputs:
Use case
When sending Connect Unary GET requests, the serialized payload passed in the URL's query parameters must be deterministic in order to properly enable effective caching. Related: https://github.com/connectrpc/connect-swift/issues/196
Proposal
Add a
useDeterministicOrdering: Bool = false
field toJSONEncodingOptions
(similar to what Protobuf for Go has) which, when set, causesJSONEncodingVisitor.visitMapField(...)
to sort keys before iterating through and serializing them.