apple / swift-protobuf

Plugin and runtime library for using protobuf with Swift
Apache License 2.0
4.57k stars 449 forks source link

JSON Serialization with default value #861

Open seanliu1 opened 5 years ago

seanliu1 commented 5 years ago

Developing using iOS12, Swift 5, proto3 . I am about to add an extension which can support to output fields with their default values. I just want to check whether it is already implemented.

Based on proto doc, it looks like

JSON options
A proto3 JSON implementation may provide the following options:

Emit fields with default values: Fields with default values are omitted by default in proto3 JSON output. An implementation may provide an option to override this behavior and output fields with their default values.

I wonder does swift version has option to output fileds with their default values. I found python version has it MessageToJson(message, including_default_value_fields=False)

https://developers.google.com/protocol-buffers/docs/reference/python/google.protobuf.json_format-module

thomasvl commented 5 years ago

The supported JSON encoding options: https://github.com/apple/swift-protobuf/blob/master/Sources/SwiftProtobuf/JSONEncodingOptions.swift

thomasvl commented 5 years ago

p.s. - adding this as an option would be tricky since JSON encoding is built on the visitor pattern which is where it is skipping default values.

seanliu1 commented 5 years ago

@thomasvl thanks for your quick reply. Yes, I realized that default value is skipped in the generated code.

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if !self.cards.isEmpty {
      try visitor.visitRepeatedMessageField(value: self.homeCards, fieldNumber: 1)
    }
    if self.test != false {
      try visitor.visitSingularBoolField(value: self.test, fieldNumber: 2)
    }
    try unknownFields.traverse(visitor: &visitor)
  }
tbkka commented 5 years ago

Is there a way to do this without breaking the API between generated code and the library?

I see how old generated code could continue to work with an updated library, but I don't see how new generated code (with modified default handling) could work with an old library.

If we can't preserve this compatibility, we would probably want to increment the major version. It's been a while since 1.0, so this is probably okay. We should certainly consider whether there are other issues that might require API breakage that should be adopted along with this.

seanliu1 commented 5 years ago

What I found is that current we do the check in codegen xx.pb.swift.

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if self.showID != 0 {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
    try unknownFields.traverse(visitor: &visitor)
  }

What if we add an option to JSONEncodingOptions, then

we change

  mutating func visitSingularUInt32Field(value: UInt32, fieldNumber: Int) throws {

     if value != 0 || options.includeDefaultParameters {
            try startField(for: fieldNumber)
            encoder.putUInt32(value: value)
        }
}

Then API will remain same, and then we do check inside actual visitSingularUInt32Field or visitSingularUInt64Field or visitSingularStringField

tbkka commented 5 years ago

This is a good idea. I'd need to think about how this works for proto2, though.

thomasvl commented 5 years ago

The visitor api shouldn't know about the JSONEncodingOptions, we'd need to make a some other (internal) "options" to expose what ever subset makes sense for all possible visitors.

seanliu1 commented 5 years ago

I think JSONEncodingVisitor currently owns JSONEncodingOptions, and we are using JSONEncodingOptions.alwaysPrintEnumsAsInts in the JSONEncodingVisitor Implementation. If this is something we want to support across all visitors, you are right, it probably should some other internal options.

 mutating func visitSingularEnumField<E: Enum>(value: E, fieldNumber: Int) throws {
    try startField(for: fieldNumber)
    if !options.alwaysPrintEnumsAsInts, let n = value.name {
      encoder.appendQuoted(name: n)
    } else {
      encoder.putEnumInt(value: value.rawValue)
    }
  }
thomasvl commented 5 years ago

The call to JSONEncodingVisitor is happening on the generic visitor interface and that call is blocked on the value not being the default, so all the JSON specific types are removed from where the change is needed.

seanliu1 commented 5 years ago

The call to JSONEncodingVisitor is happening on the generic visitor interface and that call is blocked on the value not being the default, so all the JSON specific types are removed from where the change is needed.

Oh, I see. Yes, the original approach I proposed required all visitor to include default value check, which is not really flexible. Another way in my mind.

We add a function into Visitor protocol

protocol Visitor {
    func shouldInlcludeDefault() ->Bool
    ....
}

We provide default implementations as other methods in Visitor protocol

extension Visitor {
    func shouldInlcludeDefault() ->Bool {
        return false
    }
    ...
}

In JSONEncodingVisitor, we can implement it by checking shouldIncludeDefaultValue from options.

The last change will be codegen.

All traverse function will require to call

  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    if self.id != 0 || visitor.shouldIncludeDefaultValue() {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
}

I guess this can be also used by formatTextString.

@tbkka @thomasvl let me know what do you think.

tbkka commented 5 years ago

@seanliu1's idea would have:

I'll have to think about whether this is enough compatibility to avoid the major version bump.

seanliu1 commented 5 years ago

@tbkka see this comments. https://github.com/apple/swift-protobuf/issues/861#issuecomment-479680877

People wont notice any change, unless shouldIncludeDefaultValue is provided. So

Old generated code with new library: default values would always be omitted
New generated code with old library: default values would always be omitted
thomasvl commented 5 years ago

Likely should make a sweep of the options the other languages provide to TextFormat, JSON, and Binary to double checks if there are potentially any other things we might want to support that also would reach to this level of the Swift impl.

Performance wise, we could call this once within each generated visitor method, but it is still a lot of calls if the graph has a lot of sub messages, and the call can't be optimized since it is on a generic Visitor.

seanliu1 commented 5 years ago

Let me know if I can help anything.

seanliu1 commented 5 years ago

I managed to make some changes based on my initial idea.

  1. I moved all default check from codegen to the visitor. Currently I only put default check for json encoding visitor, if this is a viable approach I will add check to all visitors
  2. I add new JsonEncodingOption to include default value
  3. I also changed code gen, so that proto2 still works fine. It will still check nil for sub message for proto3.

https://github.com/seanliu1/swift-protobuf/pull/1

I also have the chance to look at

Python and Java all use descriptor to loop through all fields, and insert default value to the dictionary or map. Python version does not support output Singular message fields and oneof fields default value.

thomasvl commented 5 years ago

Taking a quick look, the visitor seems to have special cased zero/false as the default. That's only true for a message declared in a file with proto3 syntax. A message in proto2 syntax can have any default the .proto file author chooses, so I'm not sure your flow would work out correctly when a proto2 syntax message is used as a field of a proto3 syntax message.

seanliu1 commented 5 years ago

Thanks for taking a look. I forgot proto2 can have any default value. I can think of three workarounds

  1. In order to access stored property, it looks like we need to pass it to all visiting methods. Signature will change to

mutating func visitSingularEnumField<E: Enum>(value: E, default E, fieldNumber: Int) throws

  1. Change signature for traverse, add include_default

func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, includeDefault:Bool == false) throws

Only JsonEncoderVisitor will call try traverse(visitor: &visitor, includeDefault: options.includeDefault)

Other callsites will remain same.

  1. change private let options: JSONEncodingOptions to internal let options: JSONEncodingOptions in JSONEncodingVisitor, traverse can directly access it.

  2. As I mentioned before, add a new protocol to Visitor. func shouldIncludeDefault()->Bool Default implementation is return false, JsonEncodingVisitor provides its own implementation.Here is a PR for it https://github.com/seanliu1/swift-protobuf/pull/2

seanliu1 commented 5 years ago

I also looked at other options provided by Java, Python, C++

JSON

Looks like all languages are on the same page, Swift just does not have includeDefault

TextFormat

Binary Looks pretty similar across those languages.

seanliu1 commented 5 years ago

Would love some feedback for this https://github.com/seanliu1/swift-protobuf/pull/2 Thanks

thomasvl commented 5 years ago

Left some quick comments, the repeated calls to the visitor also have the potential to slow things down when most fields are the defaults. It might make sense to fetch it once and reuse it in the conditionals (you could also check the flag first since when it is on, wether the value is set or not, you are calling the visit method).

But this approach goes back to @tbkka comments before, there are behavior issues with old generated code against the new runtime (and new generated code against the old). So the question of a version change still has to be sorted out. And if there is a version change, then making this a method on visitor vs. doing something more generic like passing options for it into the generated visited code might make things easier to add others in the future (although the same interaction problems would likely exist).

seanliu1 commented 5 years ago

@thomasvl Thanks, I addressed your comment on repeated calls to the visitor, currently it will only call once at beginning of traverse. Since it involves API change and version management, I will leave the PR open until version bump is sorted out.

tbkka commented 5 years ago

I think I might have figured out part of how to solve the versioning issue. We could generate two different traverse methods, one that provides the old behavior for use with the old libraries, and one that supports the new behavior. This might look like:

  // Called by old libraries.  Never visits default values.
  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    traverse(visitor: &visitor, visitDefault: false)
  }

  // Called by new libraries.  The caller specifies whether default values are visited.
  func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, visitDefault: Bool) throws {
    if self.id != 0 || visitDefault {
      try visitor.visitSingularUInt32Field(value: self.showID, fieldNumber: 1)
    }
  }

The expected behavior: If you're using an old library or old generated code, defaults are not visited. Defaults can only be controlled if you have a new library and new generated code. The above solves three of the four cases:

  1. Old library, old generated code. This already works.
  2. Old library, new generated code. The old library calls traverse(visitor:) in new generated code. The result never visits default values, but that's the expected behavior when using the old library.
  3. New library, new generated code. The new library calls traverse(visitor:visitDefault:) in new generated code. That allows the new library code to select whether defaults are visited.

The remaining case is for the new library used with old generated code. This requires a way for the new library to call the old traverse(visitor:) method when the new traverse(visitor:visitDefault:) method isn't available. Is there a way to solve this with generics trickery or maybe adding a new protocol?

thomasvl commented 5 years ago

Hm, so we'd have to generate a travers shim in ever message class, which on all new library uses is wasted code (and if declared public, won't dead strip).

For your last case, could the updated library provide an extension to Message with a default impl of the new traverse method that calls the old? Or would that always get used for any generic operation on Messages directly?

tbkka commented 5 years ago

... shim in every message class, which on all new library uses is wasted code ...

At this moment, I'm happy if we can find any solution that works. I agree it would be nice to avoid this overhead. Certainly, that extra shim could be omitted the next time we change the major version.

... could the updated library provide an extension to Message with a default impl of the new traverse method that calls the old? Or would that always get used for any generic operation on Messages directly?

Good idea! I think that would do it. I think generic uses and protocol-typed variables will work correctly as long as we declare both traverse methods on the Message protocol.

tbkka commented 5 years ago

So I think here's the full set of changes that would allow us to add this feature without a major version change. When we do change the major version someday, we can remove the compatibility shims at that time.

Caution: I haven't actually tested this, so there may be a problem with this approach that I've overlooked.

  1. Change traverse method signature to traverse(visitor:visitDefault:). This requires changing the generated code, adding the new method to Message.swift, updating callers in the library, and adding the new option to JSONEncodingOptions. Once this is done, new generated code with the new library should have the new capability.

  2. Add shim to generated code so that new generated code can be called by old library. This just requires adding the following to the generated output:

    // Called by old libraries.  Never visits default values.
    func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V) throws {
    try traverse(visitor: &visitor, visitDefault: false)
    }
  3. Add a Message extension to Message.swift so that the new library can use old generated code. Make sure that the protocol Message declares both versions of traverse.

    /// Shim to support old generated code.
    ///
    /// If this library is used with old generated code that lacks the new `traverse(visitor:visitDefault:)` method, we want to just use the old `traverse` method.  This will ignore the `visitDefault` parameter, but we cannot do better with old generated code.
    extension Message {
    func traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, visitDefault: Bool) throws {
    try traverse(visitor: visitor)
    }
thomasvl commented 5 years ago

Couple random thoughts on this from this weekend -

tbkka commented 5 years ago

For oneof: What does C++ do?

For proto2, I agree that visitDefault should never have an effect. The easiest answer might be to just continue generating the old traverse method for proto2.

albidev commented 4 years ago

Hello there! Any update on this issue?

CarlosNano commented 2 years ago

Is there any update in this, it would be interesting to have this option the JSONEncodingOptions

thomasvl commented 2 years ago

We're currently working towards a 2.0 release, so now would be the time take one something like this since it could be done with a breaking change to the generated code.

CarlosNano commented 2 years ago

That's great, thanks for the info. Do you have any estimated date when 2.0 will be released?

JscLee commented 1 year ago

Is there any update on this issue?

thomasvl commented 1 year ago

No specific update on exactly when 2.0 will be; and to this specific issue, no one seems to have taken up making a PR for this feature request.

CarlosNano commented 12 months ago

Any update on this for 2.0? Thanks

thomasvl commented 12 months ago

Same as last check, no one has proposed a PR for this.

CarlosNano commented 12 months ago

Thanks @thomasvl

I've seen @seanliu1 achieved something in here: https://github.com/seanliu1/swift-protobuf/pull/2

Is there any documentation of how to do the setup to get default values in the JSON?

Thanks

thomasvl commented 12 months ago

I've seen @seanliu1 achieved something in here: seanliu1#2

Is there any documentation of how to do the setup to get default values in the JSON?

The default values would be on the FieldDescriptor. I haven't looked at that change, but if the have something working, it likely would just need to be update to the current main here, i.e. - it should be basically what is needed.

mrabiciu commented 11 months ago

Are we okay with making breaking changes to the Message type in a pr since we're working towards a 2.0 version? If so I can work on something that looks like traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, options: [VisitorOptions])

mrabiciu commented 11 months ago

Actually after playing with this a bit I think an options argument to the traverse method would fall into the same backwards compatibility pitfalls that we encountered in this thread. In order to future proof this I propose we define a type-erased VisitorNode type

public struct VisitorNode {

    private let _visit: (inout Visitor, VisitorOptions) throws -> Void

    internal func visit(visitor: inout Visitor, options: VisitorOptions) throws {
        try _visit(&visitor, options)
    }

    // Factories

    public static func singularFloat(_ value: Float, fieldNumber: Int, defaultValue: Float) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularFloatField(value: value, fieldNumber: fieldNumber)
            }
        }
    }

    public static func singularEnum<E: SwiftProtobuf.Enum>(value: E, fieldNumber: Int, defaultValue: E) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularEnumField(value: value, fieldNumber: fieldNumber)
            }
        }
    }

    public static func singularMessage<M: Message & Equatable>(value: M, fieldNumber: Int, defaultValue: M) -> Self {
        .init { visitor, options in
            if options.visitDefaults || value != defaultValue {
                try visitor.visitSingularMessageField(value: value, fieldNumber: fieldNumber)
            }
        }
    }
    // etc ...
}

And have the Message protocol provide visitor nodes instead of providing the traverse implementation.

extension SomeProto {
    var visitorNodes: [VisitorNode] {
        [
            .singularEnum(value: someEnum, fieldNumber: 0 defaultValue: .unspecified),
            .singularFloat(value: someFloat, fieldNumber: 1, defaultValue: 0),
            .singularMessage(value: someMessage, fieldNumber: 2, defaultValue: .init())
        ]
    }
}

This allows us to have a common shared traverse implementation

func traverse<V: Visitor>(inout visitor: V, options: VisitorOptions) throws {
    for node in visitorNodes {
        try node.visit(visitor: &visitor, options: options)
    }
}

this allows the library to have more flexibility to change the internals of the visitor pattern regardless of which version of the library was used to generate the type conforming to Message

I'm still pretty new to this project so I'm not sure if there are potential pitfalls with this approach. I'm not sure if what I'm proposing would trigger issues like this one

thomasvl commented 11 months ago

Are we okay with making breaking changes to the Message type in a pr since we're working towards a 2.0 version? If so I can work on something that looks like traverse<V: SwiftProtobuf.Visitor>(visitor: inout V, options: [VisitorOptions])

Yup, 2.0 is allowed breaking changes.

As to your other questions - I'll defer to the Apple folks on the general api changes. @tbkka @FranzBusch @Lukasa

I do worry a little about how much the compiler will be able to inline compared to the current code and it also seems like you concern about stack size might be justified.

Having said that, I do sorta like the idea of only having to generate something more Array like that might hopefully be less actual code, reducing the overall size of the generated code.

mrabiciu commented 11 months ago

Sounds good I'll put together a PR this week and we can discuss more there, thanks for the quick reply!

tbkka commented 11 months ago
func traverse<V: Visitor>(inout visitor: V, options: VisitorOptions) throws {
    for node in visitorNodes {
        try node.visit(visitor: &visitor, options: options)
    }
}

I developed the current API after performance-testing a lot of different alternatives. But I must confess, I never tried something with quite this shape, so I'm very curious to see how well it would work. Like Thomas, I'd be willing to accept a modest performance loss if this could significantly reduce code size. Certainly worth the experiment!

I suspect the first performance challenge will be to ensure the array of nodes is statically allocated just once for each message type.

mrabiciu commented 11 months ago

I suspect the first performance challenge will be to ensure the array of nodes is statically allocated just once for each message type.

I'll have to think about this a bit. On the surface level static allocation doesn't really make sense since each node is capturing the value of a given field. Maybe if nodes capture key paths this could work πŸ€”

I'd be willing to accept a modest performance loss if this could significantly reduce code size.

Do you have any recommendations for ways I can capture the relative performance of my solution? Are there some good sample protos I could use or some benchmark test cases I could run?

tbkka commented 11 months ago

On the surface level static allocation doesn't really make sense since each node is capturing the value of a given field. Maybe if nodes capture key paths this could work πŸ€”

Oh. I missed that. Hmmm.... That means this array is getting built every time you try to walk the structure? That's going to be hard for binary serialization, in particular, since we actually walk the structure several times (to compute size information that's used to allocate the necessary buffers). And building this array will require a bunch of allocations: a couple for the array itself, and likely one more for each captured existential.

Key paths would be worth experimenting with, I think.

Do you have any recommendations for ways I can capture the relative performance of my solution? Are there some good sample protos I could use or some benchmark test cases I could run?

We have a Performance suite that Tony Allevato wrote and which I've relied on a lot in the past. It's a collection of shell scripts that can generate standard protos and then compiles and runs benchmarks for both C++ and Swift. Most importantly, it makes some very pretty graphs of the results. 😁 Pretty handy, though the shell scripting for compiling the C++ benchmarks keeps breaking as the protobuf folks keep churning their build system.

(Note: A few years back, Thomas suggested we switch the binary encoder to serialize from back-to-front in the buffer; that would only require a single walk to size the entire output, then a walk to actually write the data into the buffer. I keep meaning to set aside time to try implementing this approach, which would also address a more serious performance cliff we currently have in the binary encoder.)

mrabiciu commented 11 months ago

I played around with keypaths and I think they should work! I've got a rough prototype have been able to make this work for every field type.

Here is the general shape I've got so far

struct FieldNode<M: Message> {
    private let fieldNumber: Int
    private let subtype: Subtype

    private enum Subtype {
        case singularFloat(keypath: KeyPath<M, Float>, defaultValue: Float)
        // ... 
    }

    func traverse<V: Visitor>(message: M, using visitor: inout V, options: VisitorOptions) throws {
        switch subtype {
        case .singularFloat(let keypath, let defaultValue):
            let value = message[keyPath: keypath]
            if options.contains(.visitDefaults) || value != defaultValue {
                try visitor.visitSingularFloatField(value: value, fieldNumber: fieldNumber)
            }
        // ...
        }
    }

    // MARK: - Factories

    public static func singularFloat(_ keyPath: KeyPath<M, Float>, fieldNumber: Int, defaultValue: Float) -> Self {
        Self(fieldNumber: fieldNumber, subtype: .singularFloat(keypath: keyPath, defaultValue: defaultValue))
    }
    // ...
}

extension FooType: Message {    
    static let fieldNodes: [FieldNode<Self>] = [
        .singularFloat(\.someFloat, fieldNumber: 1, defaultValue: 0)
    ]
}

Message/Enum/Map fields are a little trickier but I do have something working for those too.

Accessing the fields through a keypath doesn't seem to have a significant performance impact compared to accessing fields directly. I think swift switch statements over enums are O(1) but I need to confirm this as that could be a performance bottleneck.

If we push this concept a little further we could actually use this to get rid of _protobuf_nameMap and potentially save even more space by consolidating everything into fieldNodes by including the naming information in the field node with an API like this.

static let fieldNodes: [Int: FieldNode<Self>] = [
    1: .singularFloat(\.someFloat, defaultValue: 0, name: .same("some_float"))
]

We'd probably need to use an ordered dictionary though since I assume its important that visitor visit fields in order and we wouldn't want to do any sorting at runtime.

tbkka commented 11 months ago

This is very promising! We'd need to see some measurements (performance and code size) to see how to best take advantage of this, of course. For example, if it's a big code size win but also a significant performance regression, then we might need to give people the option of generating either old-style (fast) or new-style (compact) output. But we'd have to see some actual numbers to inform any such planning. Of course, we all hope that🀞 it's a code size win without being a performance regression -- that would be truly wonderful!

This might also give us a way to drop the generated Equatable conformances, which is another source of code bloat; walking this list of field info would let us compare two messages with a general iterator instead of bespoke code.

tbkka commented 11 months ago

You might not need a dictionary to replace _protobuf_nameMap, actually. That name map is a dictionary today because the current traversal provides a field number that needs to get translated. If the encoding is being driven by an array of field nodes, then those nodes would give both field number and name, so you would not need a translation per se. And arrays are more compact and faster to traverse than dictionaries. Hmmm.... There is the question of decoding, though. Hmmm...

mrabiciu commented 11 months ago

Yes exactly, you'd still need to the ability to do a lookup for decoding purposes from what I can tell.

mrabiciu commented 11 months ago

Next step for me is to get this code out of my custom plugin I've written and into a fork of swift-protobuf and get started with some performance testing. I'm not looking forward to having to re-generate all the pb.swift files in this repo πŸ˜…

I'll report back my findings when I have some 😊, thank you for your feedback so far!

thomasvl commented 11 months ago

One catch for visit is the intermixing of extension ranges. Not sure if you want markers for that in the list of fields, or if we capture the info in another list and then walk the two in parallel to mix the data? (unknown fields just get put at the end, not in order).

As far are regeneration goes, once the generator is updated, the Makefile has targets to do that.