proto3 optional for all scalar fields

htuch commented 3 years ago

As discussed in https://github.com/envoyproxy/envoy/pull/14880#issuecomment-772863659, we would like to both:

Move to using proto3 optional support over WKTs for scalar fields. This should have both wire, memory, text proto and other overhead wins.
Use optional for all scalar fields, since this future proofs client default choices. We can switch from default true to false in Envoy without a breaking API change at any point. It's hard to anticipate always which fields this is needed for, so we can do it everywhere.

There are some things we need to do to make this happen:

Validate that all known code generators support optional in released versions; probably want to make sure this covers C++/Java/Go/Python at least.
Potentially update https://github.com/envoyproxy/protoc-gen-validate so that WKT wrapper flow also works for optional. E.g. if the default value is 3 and the PGV annotation is [(validate.rules).int32.gt = 3], today this would fail PGV validation as PGV is unaware of optional and treats the field as default 0. This would require explicitly providing defaults in an annotation, which is something we have talked about in https://github.com/envoyproxy/envoy/issues/10517 and could fold into this work.
Add a check_format and api/STYLE.md update to enforce the property "all new scalar fields are optional".
Validate that the JSON/YAML value for "unset" makes sense with optional. JSON/YAML stability and usability is a 1st class consideration for the xDS APIs.

@envoyproxy/api-shepherds

htuch commented 3 years ago

For (2), this raises the quesiton on whether we can make unilateral client-side default changes.

markdroth commented 3 years ago

I think we may want to take a step back here and consider the overall strategy for how we handle defaults in xDS. I have two concerns here.

My first concern is that changing a default is technically always a breaking change. It's always possible that a client was relying on the default without knowing it, in which case the deployment would not know that it needs to explicitly set a field in the control plane before upgrading the client to a version that changes the default. It seems like we've been willing to live with this so far, but I'm not convinced that it will actually continue to be viable as the ecosystem gets larger.

My second concern is that I think the idea of having the default be controlled by the client will cause problems as more clients are added to the ecosystem. One concrete example of this is the circuit breaking max_requests field, which defaults to 1024 concurrent requests per cluster. gRPC is currently in the process of adding support for this field. Existing gRPC clients that don't support this field will not enforce this default, which means that an existing client that may have more than 1024 concurrent requests in a cluster will break as soon as they upgrade to a version of gRPC that does support this field.

At a high level, I think we have two competing goals here:

We want control planes to have to do the minimum amount of work to get the right default behavior.
We want multple client implementations to be supported.

It seems like until now, we've been focusing on goal (1), which has led to a strategy where clients should provide the right defaults and control planes should not need to set fields unless they want to override those defaults. But if we want to make goal (2) a priority, that would imply a strategy where all new features are off by default, and control planes must explicitly configure them to get the right functionality. That way, upgrading a client to a new version that adds functionality can never break existing clients unless the control plane is already explicitly setting the field. However, this would basically be exactly the opposite of what we want for goal (1).

I think we need to think through how we can find some balance between these two goals.

htuch commented 3 years ago

For this point

My first concern is that changing a default is technically always a breaking change. It's always possible that a client was relying on the default without knowing it, ...

Do you mean the server relying on the default? The client just does what it's told to do :)

On the second concern, you are assuming your management server isn't setting the default. I think the idea of multiple client implementations is possible if the server is able to provide guarantees around how it sets default; the heavy lifting is on the server side, not the client.

The one place that I agree that we have an insufficient story is around new fields added that the server isn't aware of. One argument might be that servers are taking on a contract to track the API closely and set explicit defaults where needed. This could actually be enforced programatically, even with proto reflection, if a server were to scan for fields that are optional and it's not setting in responses sent.

antoniovicente commented 3 years ago

Marking proto3 fields optional does not prevent the control planes from being more explicit about config and relying less on default behavior. I hear you about the issue of multiple client implementations in multiple languages. proto3 lack of support for defaults encoded into the proto definition itself means that consistent implementation of default values for bool and integer fields is rather difficult.

Recommending that control planes fill in all config fields seems like the best possible advice going forward.

htuch commented 3 years ago

@yanjunxiang-google can you update this issue with your findings? Thanks!

markdroth commented 3 years ago

I'm really not convinced that it's actually practical for control planes to always set defaults for all new fields. That seems like a lot of work for control plane maintainers to have do -- they'd have to constantly stay at the leading edge of the xDS protos themselves, effectively rebuilding their control plane with the latest version of the xDS protos every time any xDS client releases a new version that has a newer version of the xDS proto files. And now that there are multiple xDS clients, it's not even clear how the control plane maintainers would even know when a new client has been released with a newer version of the proto files.

IMHO, this strategy basically amounts to hoping that people will do something that they will almost certainly not do, because the cost of doing it is high and the benefit for them is low. I think this approach is doomed to fail.

htuch commented 3 years ago

I think this is overstating the problem. Control planes only need to set defaults for behaviors they actually are expecting to see from clients. If a new defaultable field is added and ignored, it's fine; it probably doesn't matter. E.g. let's say a max headers size limit is set and it varies across clients. That's fine until the control plane starts to make guarantees about the actual max headers that are supported across its data planes.

markdroth commented 3 years ago

The problem here is Hyrum's Law: Just because a control plane is not making a guarantee about something does not mean that clients are not actually using a given behavior. Users are not likely to think about a given knob unless they have some reason to know that a knob is changing in a way that is going to break their existing usage. And the only way that we are actually giving users to discover this problem is by causing breakage.

antoniovicente commented 3 years ago

If you want something to behave in a certain way, you have to configure it as such. We are have problems with bad defaults or defaults that are not safe from a security standpoint.

Deployments that rely on defaults need to accept that those defaults may change over time and those changes like any other change in proxy behavior has potential to break them. Runtime features are similar, they may break behavior and you can temporarily opt out of the change by flipping the runtime feature off.

htuch commented 3 years ago

@markdroth in the proposed model, let's say an Envoy-side change in a default happens, it will only affect Envoy clients. So, only Envoy clients subject to Hyrum's Law suffer. We generally handle this by using runtime and allowing folks to override defaults as needed when making these changes. Other clients, e.g. gRPC, see no change.

So, there are really multiple issues:

Is it fine to have different defaults in each client? Sure, why not. If the control plane doesn't set it and things are working great.
Is it fine to have a given client change its default? Sure, if it knows how to navigate the pain of Hyrum's Law, as we do in Envoy.

This works fine for things like buffer sizes, e.g. they might be legitimately different as default in different client implementations.

Then, for defaults where you really want to override and enforce consistently, you need to form an opinion on these as a control plane, e.g. header underscore rejection.

Generally, clients should pick sensible secure defaults (I think we're at the point that we're all in agreement on this). If it turns out they want to change this later (e.g. because they made the wrong choice out of N options), they can have the flexibility to do this.

I think what would persuade me differently would be a concrete example showing how this doesn't work in practice.

yanjunxiang-google commented 3 years ago

@yanjunxiang-google can you update this issue with your findings? Thanks!

I did memory usage and CPU performance benchmarking for bool vs GWK-BoolValue, int32 vs GWK-Int32Value, and int64 vs GWK-Int64Value. Here is the summary ( with 1 million repeated elements for above type in each protobuf message):

Memory:
1.1 ) For memory in wire, GWK consumes 2 extra bytes for each boolean, and 3 extra bytes for each integer. 1.2 ) For memory in local: GWK boolean : raw boolean 26:1 (GWK boolean consumes 26 times space than raw boolean) GWK int32 : raw int32 6:1
GWK int64 : raw int64 4:1

CPU performance: 2.1) Encoding Encoding GWK boolean costs 300 times more CPU than raw boolean. Encoding GWK integers costs 2-3 times more CPU than raw integers. 2.2) Decoding Decoding GWK boolean costs 10 times more CPU than raw boolean. Decoding GWK integers costs 2-3 times more CPU than raw integers.

htuch commented 11 months ago

This came up again today in #29877. @markdroth putting aside the rethinking of defaults, any issue supporting optional scalar?

envoyproxy / envoy

proto3 optional for all scalar fields #14973