Open johnsonjthomas opened 1 month ago
cc @htuch who looked at the original issue.
@htuch, appreciate any feedback regarding this issue. Hope the additional information/clarification is useful. Please let us know if you need any other information.
I spent a bit of time looking through the code paths and don't spot anything obvious, so if I was observing this in a repeatable way, I'd probably start instrumenting and trying to figure out the exact sequence responsible.
Thanks @htuch for looking into this issue. The scenario is simulated using the following unit test added to test/common/grpc/async_client_impl_test.cc
TEST_F(EnvoyAsyncClientImplTest, InvokeCallbacksAfterCancel) {
Http::AsyncClient::StreamCallbacks* http_callbacks;
Http::MockAsyncClientStream http_stream;
EXPECT_CALL(http_client_, start(_, _))
.WillOnce(
Invoke([&http_callbacks, &http_stream](Http::AsyncClient::StreamCallbacks& callbacks,
const Http::AsyncClient::StreamOptions&) {
http_callbacks = &callbacks;
return &http_stream;
}));
EXPECT_CALL(http_stream, sendHeaders(_, _));
EXPECT_CALL(http_stream, sendData(_, _));
EXPECT_CALL(http_stream, reset());
helloworld::HelloRequest request_msg;
Tracing::MockSpan active_span;
Tracing::MockSpan* child_span{new Tracing::MockSpan()};
EXPECT_CALL(active_span, spawnChild_(_, "async helloworld.Greeter.SayHello egress", _))
.WillOnce(Return(child_span));
EXPECT_CALL(*child_span, setTag(_, _)).Times(testing::AnyNumber());
EXPECT_CALL(*child_span, injectContext(_, _));
EXPECT_CALL(*child_span, finishSpan()).Times(testing::AnyNumber());
MockAsyncRequestCallbacks<helloworld::HelloReply> grpc_callbacks;
EXPECT_CALL(grpc_callbacks, onCreateInitialMetadata(_)).Times(1);
auto* grpc_request = reinterpret_cast<Envoy::Grpc::AsyncRequestImpl *>(grpc_client_->send(*method_descriptor_, request_msg, grpc_callbacks, active_span, Http::AsyncClient::RequestOptions()));
grpc_request->cancel();
EXPECT_CALL(grpc_callbacks, onCreateInitialMetadata(_)).Times(0);
EXPECT_CALL(grpc_callbacks, onFailure(_, _, _)).Times(0);
EXPECT_CALL(grpc_callbacks, onSuccess_(_, _)).Times(0);
grpc_request->onTrailers(
Envoy::Http::ResponseTrailerMapPtr{new Envoy::Http::TestResponseTrailerMapImpl{{"some", "trailer"}}});
}
The test fails in onTrailers()
with a failure in expectation with respect to the onFailure(_, _, _)
. In our application, the callbacks are already freed, so we hit a crash when either the onFailure
or onSuccess
is invoked. Unfortunately, we are not able to exactly reproduce the issue in house, we are hitting this more in production, so instrumentation approach might be difficult.
This test has the HTTP async client ignore the reset and injects directly the onTrailers
. I think the contract is when the HTTP stream is reset, it should prevent any further callbacks from the HTTP stream to the gRPC stream object. So onTrailers
should never be invoked on the gRPC stream.
@htuch, thanks for your response. Yes, the unit test is just a simulation of the crash (skipping many layers), not the actual reproduction of the issue.
I took a look at the GoogleAsyncStreamImpl::cleanup()
and it does not use deferred deletion to remove the stream from the active stream list. Wondering whether moving to Google gRPC will help with this issue.
void GoogleAsyncStreamImpl::cleanup() {
ENVOY_LOG(debug, "Stream cleanup with {} in-flight tags", inflight_tags_);
// We can get here if the client has already issued resetStream() and, while
// this is in progress, the destructor runs.
if (draining_cq_) {
ENVOY_LOG(debug, "Cleanup already in progress");
return;
}
draining_cq_ = true;
ctxt_.TryCancel();
if (LinkedObject<GoogleAsyncStreamImpl>::inserted()) {
// We take ownership of our own memory at this point.
LinkedObject<GoogleAsyncStreamImpl>::removeFromList(parent_.active_streams_).release();
if (inflight_tags_ == 0) {
deferredDelete();
}
}
}
Entirely possible - as I mention above, the issue isn't the use of deferred delete per se, but something else that needs chasing down. Given the Google gRPC implementation (including clean up code) is completely different to Envoy gRPC, I'd try that out and see if it works for you. I'd still leave this issue open in case anyone else encounters the same problem.
Thanks @htuch for your feedback and time in looking into this issue. We are evaluating switching to Google gRPC and will verify this issue as part of that. However, the switch will take some time, so I dont think I can provide the feedback soon. I am fine with keeping the issue open, hopefully the issue wont get closed due to inactivity.
Marked as "help wanted" to avoid closing.
This is a follow up of https://github.com/envoyproxy/envoy/issues/27999, which was closed due to inactivity. I am unable to reopen that issue, so raising a new issue.
We are hitting the crash mentioned in https://github.com/envoyproxy/envoy/issues/27999 more frequent with the upgrade of envoy from v1.20.7 to v1.29.0. To add a bit more context and details on the issue. The sequence of events are as follows:
Envoy::Grpc::AsyncRequestImpl
object andEnvoy::Grpc::AsyncRequestCallbacks
for handling the callbackscancel()
on theEnvoy::Grpc::AsyncRequestImpl
in the destructor of theEnvoy::Grpc::AsyncRequestCallbacks
object. Thecancel()
call chain is as follows where the request is appended to the deferred deletion list. Everything on our side including callbacks are deleted after this point as we don’t expect Envoy to invoke the callback on the gRPC request whose response is still pending.AsyncRequestImpl
that was on the deferred deletion list yet, so the callback is called by Envoy and we hit a crash with the following call stack since everything is freed on our side.Essentially, there is a race between Envoy deleting the deferred
AsyncRequestImpl
object and the gRPC response for the cancelledAsyncRequestImpl
request. In the core, we are able to see that theAsyncRequestImpl
is still in the deferred deletion list and treated as an active stream.From frame
"message": "#9: Envoy::Http::Http2::ConnectionImpl::onFrameReceived()"
,