envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.67k stars 4.75k forks source link

Request timeout metric not getting emitted for chunked responses #19829

Closed agrawroh closed 2 years ago

agrawroh commented 2 years ago

Title

Request timeout metric is not getting emitted from clusters or virtual clusters for chunked responses

Description

We realized that the request timeout metrics are not getting emitted for clusters & virtual clusters in case of chunking where the upstream server immediately returns a 200 OK with all the other headers but keep sending the data in chunks. If the response times out later than we do get a UT in the response flags but none of the other metrics like upstream_rq_timeout gets emitted.

I was looking at the code and see that we do have a condition on only emitting these metrics for those requests which are awaiting headers but it won't work in the above described scenario.

// Don't do work for upstream requests we've already seen headers for.
if (upstream_request->awaitingHeaders()) {
  cluster_->stats().upstream_rq_timeout_.inc();
  if (request_vcluster_) {
    request_vcluster_->stats().upstream_rq_timeout_.inc();
  }
...

Is this intentional? Just trying to understand if it's possible to also record these metrics for such cases or if there is any other metric we can look at for such cases.

Repro Steps

Implement a backend API which immediately returns a 200 OK with all the headers and return the data in chunks and then make that timeout somehow.

htuch commented 2 years ago

@alyssawilk @yanavlasov

alyssawilk commented 2 years ago

Yeah I guess this was written not taking into account bidirectional streaming, and assuming once the response had started we didn't want to time out requests. We do support bidi streaming through (MultiplexedUpstreamIntegrationTest.LargeBidirectionalStreamingWithBufferLimits) so I think the correct thing to do in that case is continue to allow request timeouts until the request is fully complete. cc @snowp @mattklein123 for a second opinion

mattklein123 commented 2 years ago

so I think the correct thing to do in that case is continue to allow request timeouts until the request is fully complete. cc @snowp @mattklein123 for a second opinion

I'm pretty sure we still do the request timeouts, but I think the issue is that the stat is not incremented in that case? Assuming that is true, to be perfectly honest I don't remember the history here and why we only increment the stat if nothing has been received. I think it's probably OK to increment the stat in all cases but we should definitely runtime guard the fix.

agrawroh commented 2 years ago

@mattklein123 Would the right fix here be to just remove this condition which looks for awaitingHeaders() before increasing these stats?

// Don't do work for upstream requests we've already seen headers for.
if (upstream_request->awaitingHeaders()) { ...
mattklein123 commented 2 years ago

Yes I think that is right. I can't really see any reason to guard that, but it must be that way historically for some reason. I would definitely guard this change behind a runtime flag.