envoyproxy / go-control-plane

Go implementation of data-plane-api
Apache License 2.0
1.5k stars 504 forks source link

Expose vtprotobuf generation #824

Closed howardjohn closed 8 months ago

howardjohn commented 9 months ago

vtprotobuf is a protoc plugin that enables faster proto operations. See blog for introduction.

The tl;dr is that it is gogo protobuf, but built on top of protobuf v2, instead of replacing it. vtprotobuf works by creating strictly new files/methods, and enhancing existing types (similar to protoc-gen-validate). Users who wish to utilize these fast functions can call them, or chose not to. Generally, the presence of vtprotobuf would not even be strictly required, and the same binary would compile fine with and without (albeit slower in the latter case), as calling pattern is typically like https://github.com/vitessio/vitess/blob/40f314c1d052db3fa5a52e5fb2ddd2bddaefde44/go/vt/servenv/grpc_codec.go#L40-L49. Additionally, its actively supported by Vitess (CNCF graduation project).

There are a few risks:

However, the gains are tremendous. Below shows benchmarks from our control plan. Note these are not really synthetic microbenchmarks, but rather exercise our entire application; the CPU profiles of the benchmarks very closely match production profiles:

name                                              old time/op        new time/op        delta
RouteGeneration/gateways-shared-48                      77.9ms ± 2%        51.8ms ± 5%  -33.46%  (p=0.008 n=5+5)
RouteGeneration/knative-gateway-48                      3.94ms ± 1%        2.92ms ± 2%  -25.93%  (p=0.008 n=5+5)
ClusterGeneration/gateways-48                           10.7ms ± 1%         5.6ms ± 2%  -47.69%  (p=0.008 n=5+5)
ClusterGeneration/gateways-shared-48                    10.7ms ± 1%         5.6ms ± 1%  -47.21%  (p=0.008 n=5+5)
ClusterGeneration/knative-gateway-48                    24.0µs ± 1%        13.2µs ± 1%  -45.14%  (p=0.008 n=5+5)
ClusterGeneration/http-48                               1.57ms ± 1%        0.86ms ± 1%  -45.42%  (p=0.008 n=5+5)
ListenerGeneration/gateways-48                          15.4ms ± 2%        10.0ms ± 3%  -35.43%  (p=0.008 n=5+5)
ListenerGeneration/gateways-shared-48                   35.1µs ± 3%        24.7µs ± 1%  -29.49%  (p=0.008 n=5+5)
ListenerGeneration/knative-gateway-48                   31.2µs ± 2%        21.1µs ± 2%  -32.22%  (p=0.008 n=5+5)
ListenerGeneration/http-48                               125µs ± 1%          94µs ± 0%  -24.94%  (p=0.008 n=5+5)
EndpointGeneration/1/100-48                              958µs ± 1%         628µs ± 1%  -34.41%  (p=0.008 n=5+5)
EndpointGeneration/10/10-48                              597µs ± 2%         338µs ± 1%  -43.39%  (p=0.008 n=5+5)
EndpointGeneration/100/10-48                            5.64ms ± 2%        3.18ms ± 3%  -43.69%  (p=0.008 n=5+5)
EndpointGeneration/1000/1-48                            5.67ms ± 1%        3.21ms ± 3%  -43.34%  (p=0.008 n=5+5)

name                                              old alloc/op       new alloc/op       delta
RouteGeneration/gateways-shared-48                      37.9MB ± 0%        36.0MB ± 0%   -5.07%  (p=0.008 n=5+5)
RouteGeneration/knative-gateway-48                      1.81MB ± 0%        1.74MB ± 0%   -3.56%  (p=0.008 n=5+5)
ClusterGeneration/gateways-48                           4.94MB ± 0%        4.38MB ± 0%  -11.35%  (p=0.008 n=5+5)
ClusterGeneration/gateways-shared-48                    4.94MB ± 0%        4.38MB ± 0%  -11.35%  (p=0.008 n=5+5)
ClusterGeneration/knative-gateway-48                    11.3kB ± 0%        10.2kB ± 0%   -9.90%  (p=0.008 n=5+5)
ClusterGeneration/http-48                                700kB ± 0%         648kB ± 0%   -7.47%  (p=0.008 n=5+5)
ListenerGeneration/gateways-48                          8.53MB ± 0%        8.53MB ± 0%   -0.00%  (p=0.016 n=4+5)
ListenerGeneration/gateways-shared-48                   19.1kB ± 0%        19.0kB ± 0%   -0.67%  (p=0.008 n=5+5)
ListenerGeneration/knative-gateway-48                   17.2kB ± 0%        17.1kB ± 0%     ~     (p=0.079 n=4+5)
ListenerGeneration/http-48                              62.6kB ± 0%        62.6kB ± 0%     ~     (p=0.167 n=5+5)
EndpointGeneration/1/100-48                              397kB ± 0%         361kB ± 0%   -9.26%  (p=0.008 n=5+5)
EndpointGeneration/10/10-48                              269kB ± 0%         232kB ± 0%  -13.71%  (p=0.008 n=5+5)
EndpointGeneration/100/10-48                            2.53MB ± 0%        2.16MB ± 0%  -14.54%  (p=0.008 n=5+5)
EndpointGeneration/1000/1-48                            2.53MB ± 0%        2.16MB ± 0%  -14.55%  (p=0.008 n=5+5)

name                                              old allocs/op      new allocs/op      delta
RouteGeneration/gateways-shared-48                        490k ± 0%          394k ± 0%  -19.58%  (p=0.008 n=5+5)
RouteGeneration/knative-gateway-48                       24.9k ± 0%         21.7k ± 0%  -12.93%  (p=0.000 n=5+4)
ClusterGeneration/gateways-48                            75.2k ± 0%         53.2k ± 0%  -29.31%  (p=0.029 n=4+4)
ClusterGeneration/gateways-shared-48                     75.2k ± 0%         53.2k ± 0%  -29.31%  (p=0.008 n=5+5)
ClusterGeneration/knative-gateway-48                       166 ± 0%           122 ± 0%  -26.51%  (p=0.008 n=5+5)
ClusterGeneration/http-48                                9.23k ± 0%         6.98k ± 0%  -24.36%  (p=0.008 n=5+5)
ListenerGeneration/gateways-48                           66.9k ± 0%         66.9k ± 0%   -0.00%  (p=0.008 n=5+5)
ListenerGeneration/gateways-shared-48                      180 ± 0%           178 ± 0%   -1.11%  (p=0.008 n=5+5)
ListenerGeneration/knative-gateway-48                      163 ± 0%           161 ± 0%   -1.23%  (p=0.008 n=5+5)
ListenerGeneration/http-48                                 530 ± 0%           530 ± 0%     ~     (all equal)
EndpointGeneration/1/100-48                              7.91k ± 0%         6.31k ± 0%  -20.23%  (p=0.008 n=5+5)
EndpointGeneration/10/10-48                              5.37k ± 0%         3.77k ± 0%  -29.81%  (p=0.008 n=5+5)
EndpointGeneration/100/10-48                             49.6k ± 0%         33.6k ± 0%  -32.27%  (p=0.008 n=5+5)
EndpointGeneration/1000/1-48                             49.1k ± 0%         33.1k ± 0%  -32.60%  (p=0.008 n=5+5)

Overall we see a 40% improvement across the board, which is too good to ignore IMO

A branch of go-control-plane with these changes can be found here: https://github.com/howardjohn/go-control-plane/tree/exp/vt-2


One alternative idea I had was to simply not use go-control-plane and build our own codegen. This is not viable, however. Because go-control-plane has creeped into other core projects, such as grpc and prometheus, its ~impossible to import those packages and our own alternative generated protos. Additionally, we cannot just store the vtproto parts out of band, as methods cannot be attached to structs in other packages. As a result, our only option (AFAIK) is to include them in this repo directly.

howardjohn commented 9 months ago

Did a few discussions with folks about various risks and tradeoffs

The biggest concerns are around a gogo 2.0 that can't be ripped out easily. This library is fundamentally different, and intentionally avoids this issue. Calling looks like this: https://github.com/vitessio/vitess/blob/40f314c1d052db3fa5a52e5fb2ddd2bddaefde44/go/vt/servenv/grpc_codec.go#L40-L49, so there is no code changes required if its removed later. Theoretically you could break yourself, but you would have to use the library in a poor way to do so. The original proto Go struct is unmodified.

From the internal protobuf side, I am not sure how much of the details I can discuss publicly, but there was not concerns in adoption of vtprotobuf.

dfawley commented 9 months ago

Calling looks like this: vitessio/vitess@40f314c/go/vt/servenv/grpc_codec.go#L40-L49, so there is no code changes

That's what calling code should look like. And what happens when an important project writes the equivalent of:

func (vtprotoCodec) Marshal(v any) ([]byte, error) {
    return v.(vtprotoMessage).MarshalVT() // panics if the proto doesn't support vtprotoMessage
}

Removing vtproto would be a breaking change.

howardjohn commented 9 months ago

Calling looks like this: vitessio/vitess@40f314c/go/vt/servenv/grpc_codec.go#L40-L49, so there is no code changes

That's what calling code should look like. And what happens when an important project writes the equivalent of:

func (vtprotoCodec) Marshal(v any) ([]byte, error) {
  return v.(vtprotoMessage).MarshalVT() // panics if the proto doesn't support vtprotoMessage
}

Removing vtproto would be a breaking change.

You can go down this rabbit hole with any API really. Someone can always use reflect or unsafe to pull out data from private fields, or other crazy shenanigans.

If someone goes out of there way to shoot themselves in the foot, I don't see any issue with breaking them -- it would be far less of a breaking change than go-control-plane has made numerous times over the years as well (not reason to do it again, but still).

Also see build tag discussion in https://github.com/envoyproxy/envoy/pull/31172#issuecomment-1852644505

dfawley commented 9 months ago

These aren't private fields, and my example does not use reflection or unsafe. These are exported methods, and removing them is going to be seen as less "okay".

The issue with "just break people who do bad things" is: when you work for a big company (e.g. Google) and someone imports the project doing bad things into your codebase, you might not be able to upgrade your thing without fixing the people that did the bad things first.

If this can go in with a build flag that is opt-in instead of opt-out, that would probably remove all of my concerns.

howardjohn commented 9 months ago

If this can go in with a build flag that is opt-in instead of opt-out, that would probably remove all of my concerns.

This is my plan, need to merge https://github.com/planetscale/vtprotobuf/pull/122 first though

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 8 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.