elixir-grpc / grpc

An Elixir implementation of gRPC
https://hex.pm/packages/grpc
Apache License 2.0
1.36k stars 210 forks source link

feat: add telemetry #298

Closed polvalente closed 1 year ago

polvalente commented 1 year ago

closes #229 related to #299

"minimal event set" here means the set of events needed to publish the same Prometheus data that is published by :grpc_prometheus, which is being deprecated due to incompatibilities with Elixir 1.14 and in favor of adding the more portable :telemetry events.

It's worth noting that we aren't measuring exactly the same times because the previous approach would actually depend on where the interceptors were placed in the pipeline.

That can still be mimicked by a user by placing their own interceptor + telemetry event there, following the same ideas that this PR introduces. However, for better documentation and clarity of expectations, we're not providing optional interceptors in favor of having these events always be published.

polvalente commented 1 year ago

Current results from the telemetry+prometheus companion livebook:

# HELP grpc_server_handled_latency_seconds Histogram of response latency of rpcs handled by the server, in seconds.
# TYPE grpc_server_handled_latency_seconds histogram
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.7203967910000002
grpc_server_handled_latency_seconds_count{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.71165725
grpc_server_handled_latency_seconds_count{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.713500874
grpc_server_handled_latency_seconds_count{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_total{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_total{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
# HELP grpc_server_started_total Total number of RPCs started on the server
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 18

Some validation from users would be nice, but it looks like we have most if not full retrocompatibility, aside from the unsupported summary metric

polvalente commented 1 year ago

To minimize the size of this PR, I'll add client events to the livebook in a follow-up PR :)

polvalente commented 1 year ago

@beligante @wingyplus reviews are welcome (for anyone else reading this as well!) :)

beligante commented 1 year ago

@polvalente Great Job! Code LGTM, tho, I have some considerations.

You may know this behavior, but server and client codes behave differently for streams.

Server

For server_stream and bidirectional_stream on the server side. The apply/3 that runs on the server needs to stay blocked til the end of the server stream. So, for long-running streams, the duration will be a significant number, but it's not the case for client.

Note: the function needs to be blocked until the stream has ended because, in the current behavior, if the apply function returns, the connection with the client will be brutally closed. Aka: no proper messages will be exchanged.

Client

For clients, the behavior for streams, in any case, is different from the server. Before, I mentioned that the server needs a blocking code. On the other hand, clients have non-blocking/async behavior. Dealing with streams on the client side is transparent because the interactions happen inside the adapters asynchronously or are wrapped inside an Elixir.Stram.


I want to ensure that this behavior is okay with you. Otherwise, we could add a to-do to improve that.

polvalente commented 1 year ago

@beligante regarding the duration for clients, I think that one could aggregate events based on the stream metadata somehow to provide a meaningful number. I need to check what's the current behavior with :grpc_prometheus. If it's the same as the current one in this PR we can move forward as-is, otherwise we can re-work things accordingly.

beligante commented 1 year ago

@beligante regarding the duration for clients, I think that one could aggregate events based on the stream metadata somehow to provide a meaningful number. I need to check what's the current behavior with :grpc_prometheus. If it's the same as the current one in this PR we can move forward as-is, otherwise we can re-work things accordingly.

Got it! Makes Sense!

polvalente commented 1 year ago

Merging this, with #315 left as a pendency.