gRPC/JSON transcoding using reflection

mattklein123 commented 7 years ago

In https://github.com/lyft/envoy/pull/1079 @lizan has implemented transcoding that requires the proto definitions to be available to envoy and up to date. This is a great start, but as discussed, the real killer feature here IMO is to do it transparently using reflection. Theoretically this should be possible.

cc @louiscryan

jwfang commented 7 years ago

great work +1

what happen if different upstream instances have different versions of the same protocol ? or even complete different service due to misconfigure.

currently the protobuf descriptor is shared for all request for the same listener, right?

mattbailey commented 5 years ago

While figuring out #6608, adding OPTIONS method reflection might be something to consider for the transcoder, since either with reflection or possibly the descriptor, the transcoder is in a unique position to determine the availability of http methods on an endpoint.

Richard1ybb commented 2 years ago

Are there any updates now ? I am very interested in this.

luoxiner commented 2 years ago

I'm interested in that, too !

jspeiser commented 2 years ago

I'm also very interested in this!

juliaogris commented 2 years ago

In a Kubernetes and Istio context we have built the Kubernetes Operator transflect to automate gRPC/JSON transcoding updates via the gRPC server reflection API. For annotated deployments transflect creates Istio EnvoyFilter resources, which update the gRPC JSON transcoder filter configuration with the new FileDescriptorSet and service list.

hq6 commented 2 years ago

I'm interested in possibly working on this, if nobody else has already started work on it.

However, I haven't yet read the contribution guidelines, so it'll take me a bit of time to get up to speed on procedures.

qiwzhang commented 2 years ago

@hq6 you can start to work on it. I can help to review the change. But I like to hear your design first.

the key is to build this descriptorpool by calling backend grpc server reflection.

I assume you will do that in the main thread when JsonTranscoderConfig is constructing, since there are multiple async calls, you may need to register with init_manager to wait for them.

Thanks

hq6 commented 2 years ago

@qiwzhang Thanks for offering to review my design! I've outlined my proposed design below.

Yes, my plan is to add another case to the switch on descriptor_set_case. Specifically, I intend to add a proto_descriptor_channel_uri which corresponds to the URI passed to grpc::CreateChannel.

To keep the design simple, I intend to use grpc::InsecureChannelCredentials() rather than dealing with SSL or other channel authentication in the initial implementation. This will make reflection work when Envoy is deployed as a sidecar and terminates SSL, and so can talk to its upstream over plaintext.

With respect to waiting for multiple async calls with init_manager, that is a part of Envoy I am less familiar with, so I had been planning to make the Reflection Rpc synchronously. Again, assuming the sidecar model, the latency to the upstream should be on the order of microseconds or single digit milliseconds. Moreover, if this code only runs at initialization, it should be one-time cost If you have a strong preference for me to do async calls, please let me know. In this case, I'd also appreciate a pointer to example code or documentation that uses init_manager for async calls.

With respect to actual implementation, I had intended to mostly wholesale copy in this class from the gRPC source and wrap the pool around it. This needs to be copied in because the gRPC maintainers are currently unwilling to make the code part of their library, and it seems like a waste of effort to rewrite it. They seem to use the same open source license as Envoy, so I believe the licenses are compatible.

Please let me know if the proposed design seems reasonable to you! I am very much interested in feedback on both the proposed design and the clarity of exposition, since this is my first attempt at an upstream contribution to Envoy.

hq6 commented 2 years ago

Another area where guidance would be greatly appreciated is whether there is prior art for and a requirement to write automated E2E tests involving a real gRPC service.

I will certainly run manual tests, ensure that existing unit tests pass, and attempt to write new unit tests by mocking out the reflection service, but not yet sure where to write the code or script to spin up a real service.

With enough spelunking, I'll probably find it, but if you already have a pointer, that would be convenient. :)

hq6 commented 2 years ago

@qiwzhang I have a working prototype of the design above (no unit tests, and not importing the correct way).

Would it be helpful to see that in a Draft PR (not sure if Envoy community does that) as part of the design review, or do you prefer to offer comments first based on my description above, and then I can polish more before putting up a PR?

qiwzhang commented 2 years ago

Envoy is all about async calls. I believe Envoy is strongly against blocking calls, even it is on the init phase. Envoy supports two ways to make async gRPC calls.

envoy grpc_call: async_client_impl. It is not calling any grpc c++ library, it is using http2 with grpc frame. You have to define a envoy cluster config for the grpc server.
use google grpc_call, google_async_client_impl. It is calling grpc c++ library. you don't need to define a cluster config for it.

For using init_manager, here is an example code you can follow

qiwzhang commented 2 years ago

There are some extension codes using these async grpc calls. e.g. rate_limiting, and ext_authz. Please take a look.

Still prefer to use async call if it is possible.

hq6 commented 2 years ago

Excellent, thanks for the pointers! I'll look into how hard it is to use async calls for this usage, although my current suspicion is that it might require rewriting substantial portions of code I had intended to copy from gRPC.

hq6 commented 2 years ago

Also, is there a specific reason for Envoy to be against blocking calls during initialization?

I totally get why we don't want to consume limited threads while actually serving requests, but the reasoning for avoiding blocking calls during initialization is less clear to me. It's particularly puzzling because code that is written in a blocking style is usually easier to read, and a few ms of extra loading time doesn't seem super high compared to the total cost of deploying a service.

If there's a philosophy doc somewhere, I'd appreciate a link as well.

qiwzhang commented 2 years ago

Hi @htuch @mattklein123

what is your take on using blocking calls in the main thread when building up a listener?

Agreed that the task will be much easier, and the code is easier to read.

mattklein123 commented 2 years ago

what is your take on using blocking calls in the main thread when building up a listener?

Not a good idea. Blocking the main thread will have follow on consequences for metrics output, admin requests, etc. We really can't block so anything like this would need to be async and then use completions as already suggested.

hq6 commented 2 years ago

Just want to be clear, the question is not around blocking the main thread during Envoy's actual execution, only during initialization of the config which should happen exactly once when Envoy loads, and not on a recurring basis.

That is, I know the listener is constructed for every request, but the factory is constructed only once, as far as I can tell.

Do those consequences still apply in this case?

Thanks!

On Tue, Jan 25, 2022, 10:24 AM Matt Klein @.***> wrote:

what is your take on using blocking calls in the main thread when building up a listener?

Not a good idea. Blocking the main thread will have follow on consequences for metrics output, admin requests, etc. We really can't block so anything like this would need to be async and then use completions as already suggested.

— Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1182#issuecomment-1021481124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEG5IG4TME22EWMMFKOM3TUX3TFTANCNFSM4DRAL66Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mattklein123 commented 2 years ago

Do those consequences still apply in this case?

Yes, as this can happen when the server is already operating due to an xDS update, etc. Even during initialization this can block health checking and other important things.

hq6 commented 2 years ago

Thanks for the clarification about the xDS updates, that makes sense!

hq6 commented 2 years ago

@qiwzhang I have examined the provided example code and links. Here's a new, non-blocking design proposal.

New logic flow for JsonTranscoderConfig constructor
- The logic after the switch() statement gets extracted into a function called processDescriptorPool.
- If services are empty, then we return immediately as before.
- If we are not using reflection, then just read the files, call processDescriptorPool directly, and return from the constructor.
- Otherwise (we are using reflection), set JsonTranscoderConfig._disabled to true, and register a target with the InitManager that will use Envoy::Grpc::AsyncClient to kick off ServerReflection Rpcs for each service listed in config.services() serially. After all the Rpcs have completed successfully, do the following:
  1. Call processDescriptorPool
  2. Set JsonTranscoderConfig._disabled to false
  3. Call target.ready()
- If any of the Rpcs fail or timeout, throw an EnvoyException.
Mechanism for serializing Rpc handling.
- Create an std::vector<ServerReflectionRequest> reflection_rpc_requests and populate it with one request for each service in config.services().
- Create a size_t current_rpc_Index, initialized to 0.
- Launch the first Rpc.
- In the callback for onSuccess:
  - Put the parsed FileDescriptorProtos into a FileDescriptorSet
  - Increment current_rpc_Index.
  - If current_rpc_Index < reflection_rpc_requests.size(), fire off the next request.
  - Otherwise, assume we have finished all the reflection we need.
- In the callback for onFailure, throw an EnvoyException.

Please let me know when you've had a chance to review. Thanks!

qiwzhang commented 2 years ago

For JsonTranscoderConfig change. I will like to suggest another approach; move descriptor_pool_ building code outside of JsonTranscoderConfig. it is already complicated, don't want to add async_fetching code into it. Use separate classes for reflection_fetching and descriptor_pool_ loading. then pass descriptor_pool_ to JsonTranscoderConfig
for fetching part, you can issue parallel requests for each service in config.services. create one grpc_async_client for each service. they will be processed in parallel.

hq6 commented 2 years ago

The first point I agree with.

The second point I believe might cause problems, if the servers that are hit have a mixture of builds, as described in the gRPC reflection doc.

According to that doc, the only reason why the ServerReflection Rpc is structured as a streaming Rpc is to ensure that all requests go to the same backend, so that we do not get a mixture of builds. If we run multiple requests in parallel by creating multiple async clients, we will allow hitting multiple different backends in the same cluster, which defeats the purpose of the Rpc being streaming.

One potential issue with naive reverse proxies is that, while any individual server will have a consistent and valid picture of >the proto DB which is sufficient to handle incoming requests, incompatibilities will arise if the backend servers have a mix >of builds. For example, if a given message is moved from foo.proto to bar.proto, and the client requests foo.proto from an >old server and bar.proto from a new server, the resulting database will have a double definition.

To solve this problem, the protocol is structured as a bidirectional stream, ensuring all related requests go to a single >server. This has the additional benefit that overlapping recursive requests don’t require sending a lot of redundant >information, because there is a single stream to maintain context between queries.

Thus, I think we need to go with my currently proposed serial Rpc design.

Does this concern make sense? Or do you think this concern does not apply to Envoy for some reason?

qiwzhang commented 2 years ago

There is not concern for Envoy. Most likely, these multiple requests will share the same TCP connection.

I assume you will not use google_grpc_async_client, you will use envoy grpc_async_client: which is not calling c++ grpc client library, you will use envoy upstream code to talk to grpc server. You need to create a envoy cluster config for your grpc server. Even for this case, envoy will use the same Http2 connection for the multiple requests.

hq6 commented 2 years ago

Yes, I plan to use the the Envoy version of the client, not the Google version. The concern is that if someone tries to use this filter when Envoy is not a sidecar, could there be more than one backend in the specified cluster?

qiwzhang commented 2 years ago

I believe envoy will try to re-use same connection for http2. It should be fine.

hq6 commented 2 years ago

@qiwzhang As of yesterday, I got the non-blocking variation working E2E; there's still a fair bit of cleanup, as well as making it comply with Envoy style guides and updating and adding unit tests.

Given that this is my first nontrivial change to open source Envoy, and my knowledge of the norms are incomplete, are you open to offering early feedback (not a full review) on the shape of this change before it's entirely polished?

By default, I intend to fully polish it to the best of my knowledge before putting it up, but I think we might be able to course correct and ultimately finish this sooner with feedback before full polish.

Please let me know what you prefer. Thanks!

qiwzhang commented 2 years ago

Sure, I can help to review your pr. I suggest you just send a preliminary pr to solicit feedback even it did not pass compiling. You can prefix the pr title as [WIP], so other reviewers will not jump in.

BTW, you may send me an email at qiwzhang@google.com directly as I may not monitor envoy issues or pr so often.

hq6 commented 2 years ago

Hi @qiwzhang, I'm put up a preliminary PR, but it seems that I lack permissions to add you as a reviewer and Lizhan was added by default.

https://github.com/envoyproxy/envoy/pull/20053

thejerrybao commented 1 year ago

Hello! Are there any updates on this? I'm very interested in the reflection feature

hq6 commented 1 year ago

@thejerrybao Thanks for asking! The restructuring that Wayne requested to decouple descriptor pool building from the main config effectively requires rewriting or at least modifying many of the existing unit tests in the filter as well as adding new unit tests.

I am still slowly working my way through the unit tests. This is a personal project for me, rather than something I'm doing for my employer, so there's no real timeline, but I am making continuous if plodding progress.

hq6 commented 1 year ago

@qiwzhang I have just put up a candidate PR and look forward to your review.

I will update as needed for test coverage based on the results of CI, but the core code, documentation and existing tests are ready for review.

Also, I know it's been a long time and the PR is fairly large due to the restructuring we discussed. If you're no longer working with the Envoy project or do not currently have bandwidth to review, please recommend a different reviewer.

envoyproxy / envoy

gRPC/JSON transcoding using reflection #1182