envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.34k stars 4.7k forks source link

Race condition between Envoy's deferred deletion and cancelled gRPC response #34489

Open johnsonjthomas opened 1 month ago

johnsonjthomas commented 1 month ago

This is a follow up of https://github.com/envoyproxy/envoy/issues/27999, which was closed due to inactivity. I am unable to reopen that issue, so raising a new issue.

We are hitting the crash mentioned in https://github.com/envoyproxy/envoy/issues/27999 more frequent with the upgrade of envoy from v1.20.7 to v1.29.0. To add a bit more context and details on the issue. The sequence of events are as follows:

  1. We are issuing a gRPC request using an Envoy::Grpc::AsyncRequestImpl object and Envoy::Grpc::AsyncRequestCallbacks for handling the callbacks
  2. Based on certain criteria, we decide to destroy the gRPC request for which we are still waiting on a response by calling cancel() on the Envoy::Grpc::AsyncRequestImpl in the destructor of the Envoy::Grpc::AsyncRequestCallbacks object. The cancel() call chain is as follows where the request is appended to the deferred deletion list. Everything on our side including callbacks are deleted after this point as we don’t expect Envoy to invoke the callback on the gRPC request whose response is still pending.
void AsyncRequestImpl::cancel() {
  current_span_->setTag(Tracing::Tags::get().Status, Tracing::Tags::get().Canceled);
  current_span_->finishSpan();
  this->resetStream();
}

void AsyncStreamImpl::resetStream() { cleanup(); }

void AsyncStreamImpl::cleanup() {
  if (!http_reset_) {
    http_reset_ = true;
    stream_->reset();
  }

  // This will destroy us, but only do so if we are actually in a list. This does not happen in
  // the immediate failure case.
  if (LinkedObject<AsyncStreamImpl>::inserted()) {
    ASSERT(dispatcher_->isThreadSafe());
    dispatcher_->deferredDelete(
        LinkedObject<AsyncStreamImpl>::removeFromList(parent_.active_streams_));
  }
}
  1. We receive the gRPC response that was sent in Step 1 but Envoy has not got to deleting the AsyncRequestImpl that was on the deferred deletion list yet, so the callback is called by Envoy and we hit a crash with the following call stack since everything is freed on our side.
      "message": "#27: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke()",
      "message": "#26: Envoy::Server::WorkerImpl::threadRoutine()",
      "message": "#25: event_base_loop"
      "message": "#24: event_process_active_single_queue"
      "message": "#23: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke()",
      "message": "#22: std::_Function_handler<>::_M_invoke()"
      "message": "#21: Envoy::Network::ConnectionImpl::onFileEvent()",
      "message": "#20: Envoy::Network::ConnectionImpl::onReadReady()",
      "message": "#19: Envoy::Network::FilterManagerImpl::onContinueReading()",
      "message": "#18: Envoy::Http::CodecClient::CodecReadFilter::onData()",
      "message": "#17: Envoy::Http::CodecClient::onData()"
      "message": "#16: Envoy::Http::Http2::ConnectionImpl::dispatch()",
      "message": "#15: Envoy::Http::Http2::ConnectionImpl::dispatch()",
      "message": "#14: http2::adapter::NgHttp2Adapter::ProcessBytes()"
      "message": "#13: nghttp2_session_mem_recv",
      "message": "#12: http2::adapter::callbacks::OnFrameReceived()",
      "message": "#11: http2::adapter::CallbackVisitor::OnEndHeadersForStream()",
      "message": "#10: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_29::__invoke()"
      "message": "#9: Envoy::Http::Http2::ConnectionImpl::onFrameReceived()",
      "message": "#8: Envoy::Http::ResponseDecoderWrapper::decodeTrailers()",
      "message": "#7: Envoy::Router::UpstreamRequest::decodeTrailers()"
      "message": "#6: Envoy::Http::AsyncStreamImpl::encodeTrailers()",
      "message": "#5: Envoy::Grpc::AsyncStreamImpl::onTrailers()",
      "message": "#4: Envoy::Grpc::AsyncRequestImpl::onRemoteClose()"
      "message": "#3: Envoy::Grpc::AsyncRequestCallbacks<>::onSuccessRaw()"

Essentially, there is a race between Envoy deleting the deferred AsyncRequestImpl object and the gRPC response for the cancelled AsyncRequestImpl request. In the core, we are able to see that the AsyncRequestImpl is still in the deferred deletion list and treated as an active stream.

From frame "message": "#9: Envoy::Http::Http2::ConnectionImpl::onFrameReceived()",

(gdb) p *frame             
$96 = {                  
  hd = {
    length = 1, 
    stream_id = 31928399,
    type = 1 '\001',       <——— NGHTTP2_HEADERS
    flags = 5 '\005',      <——— NGHTTP2_FLAG_END_STREAM, NGHTTP2_FLAG_END_HEADERS
    reserved = 0 '\000'
  },                   
  data = {               
    hd = {
      length = 1,                               
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },                     
    padlen = 0        
  },                                                            
  headers = {            
    hd = {    
      length = 1,      
      stream_id = 31928399,
      type = 1 '\001',     
      flags = 5 '\005', 
      reserved = 0 '\000'
    },                   
    padlen = 0,    
    pri_spec = {
      stream_id = 0,
      weight = 0,          
      exclusive = 0 '\000'
    },                 
    nva = 0x0,           
    nvlen = 0,             
    cat = NGHTTP2_HCAT_HEADERS
  },                   
  priority = {           
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',     
      flags = 5 '\005',    
      reserved = 0 '\000'
    },                   
    pri_spec = {
      stream_id = 0,
      weight = 0,
      exclusive = 0 '\000' 
    }                 
},
rst_stream = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    error_code = 0
  },
  settings = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    niv = 0,
    iv = 0x0
  },
  push_promise = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    padlen = 0,
    nva = 0x0,
    nvlen = 0,
    promised_stream_id = 0,
    reserved = 0 '\000'
  },
  ping = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    opaque_data = "\000\000\000\000\000\000\000"
  },
  goaway = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    last_stream_id = 0,
    error_code = 0,
    opaque_data = 0x0,
    opaque_data_len = 0,
    reserved = 0 '\000'
  },
  window_update = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    window_size_increment = 0,
    reserved = 0 '\000'
  },
  ext = {
    hd = {
      length = 1,
      stream_id = 31928399,
      type = 1 '\001',
      flags = 5 '\005',
      reserved = 0 '\000'
    },
    payload = 0x0
  }
}

(gdb) p active_streams_
$97 = {
  <std::__cxx11::_List_base<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> >, std::allocator<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> > > >> = {
    _M_impl = {
      <std::allocator<std::_List_node<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> > > >> = {
        <__gnu_cxx::new_allocator<std::_List_node<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> > > >> = {<No data fields>}, <No data fields>}, 
      members of std::__cxx11::_List_base<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> >, std::allocator<std::unique_ptr<Envoy::Http::Http2::ConnectionImpl::StreamImpl, std::default_delete<Envoy::Http::Http2::ConnectionImpl::StreamImpl> > > >::_List_impl:
      _M_node = {
        <std::__detail::_List_node_base> = {
          _M_next = 0x55ddcb3d4f40,
          _M_prev = 0x55dda734eb60
        }, 
        members of std::__detail::_List_node_header:
        _M_size = 32
      }
    }
  }, <No data fields>}

(gdb) set $n = active_streams_._M_impl._M_node._M_next
(gdb) set $i = 0
(gdb) while $i < 32
 >set $p = ((Envoy::Http::Http2::ConnectionImpl::StreamImpl*)((void**)$n)[2])
 >p $p->stream_id_
 >set $n = $n->_M_next 
 >set $i = $i + 1
 >end
$98 = 31928467
$99 = 31928465
$100 = 31928463
$101 = 31928461
$102 = 31928459
$103 = 31928457
$104 = 31928455
$105 = 31928453
$106 = 31928451
$107 = 31928449
$108 = 31928447
$109 = 31928445
$110 = 31928443
$111 = 31928441
$112 = 31928439
$113 = 31928437
$114 = 31928435
$115 = 31928433
$116 = 31928431
$117 = 31928429
$118 = 31928427
$119 = 31928425
$120 = 31928423
$121 = 31928421
$122 = 31928419
$123 = 31928417
$124 = 31928415
$125 = 31928413
$126 = 31928411
$127 = 31928409
$128 = 31928399          <—————— stream that we got the trailer on
$129 = 31928395
adisuissa commented 1 month ago

cc @htuch who looked at the original issue.

johnsonjthomas commented 1 month ago

@htuch, appreciate any feedback regarding this issue. Hope the additional information/clarification is useful. Please let us know if you need any other information.

htuch commented 1 month ago

I spent a bit of time looking through the code paths and don't spot anything obvious, so if I was observing this in a repeatable way, I'd probably start instrumenting and trying to figure out the exact sequence responsible.

johnsonjthomas commented 1 month ago

Thanks @htuch for looking into this issue. The scenario is simulated using the following unit test added to test/common/grpc/async_client_impl_test.cc

TEST_F(EnvoyAsyncClientImplTest, InvokeCallbacksAfterCancel) {
  Http::AsyncClient::StreamCallbacks* http_callbacks;
  Http::MockAsyncClientStream http_stream;
  EXPECT_CALL(http_client_, start(_, _))
      .WillOnce(
          Invoke([&http_callbacks, &http_stream](Http::AsyncClient::StreamCallbacks& callbacks,
                                                 const Http::AsyncClient::StreamOptions&) {
            http_callbacks = &callbacks;
            return &http_stream;
          }));
  EXPECT_CALL(http_stream, sendHeaders(_, _));
  EXPECT_CALL(http_stream, sendData(_, _));
  EXPECT_CALL(http_stream, reset());
  helloworld::HelloRequest request_msg;
  Tracing::MockSpan active_span;
  Tracing::MockSpan* child_span{new Tracing::MockSpan()};
  EXPECT_CALL(active_span, spawnChild_(_, "async helloworld.Greeter.SayHello egress", _))
      .WillOnce(Return(child_span));
  EXPECT_CALL(*child_span, setTag(_, _)).Times(testing::AnyNumber());
  EXPECT_CALL(*child_span, injectContext(_, _));
  EXPECT_CALL(*child_span, finishSpan()).Times(testing::AnyNumber());

  MockAsyncRequestCallbacks<helloworld::HelloReply> grpc_callbacks; 
  EXPECT_CALL(grpc_callbacks, onCreateInitialMetadata(_)).Times(1);
  auto* grpc_request = reinterpret_cast<Envoy::Grpc::AsyncRequestImpl *>(grpc_client_->send(*method_descriptor_, request_msg, grpc_callbacks, active_span, Http::AsyncClient::RequestOptions()));
  grpc_request->cancel();
  EXPECT_CALL(grpc_callbacks, onCreateInitialMetadata(_)).Times(0);
  EXPECT_CALL(grpc_callbacks, onFailure(_, _, _)).Times(0);
  EXPECT_CALL(grpc_callbacks, onSuccess_(_, _)).Times(0);
  grpc_request->onTrailers(
      Envoy::Http::ResponseTrailerMapPtr{new Envoy::Http::TestResponseTrailerMapImpl{{"some", "trailer"}}});
}

The test fails in onTrailers() with a failure in expectation with respect to the onFailure(_, _, _). In our application, the callbacks are already freed, so we hit a crash when either the onFailure or onSuccess is invoked. Unfortunately, we are not able to exactly reproduce the issue in house, we are hitting this more in production, so instrumentation approach might be difficult.

htuch commented 1 month ago

This test has the HTTP async client ignore the reset and injects directly the onTrailers. I think the contract is when the HTTP stream is reset, it should prevent any further callbacks from the HTTP stream to the gRPC stream object. So onTrailers should never be invoked on the gRPC stream.

johnsonjthomas commented 1 month ago

@htuch, thanks for your response. Yes, the unit test is just a simulation of the crash (skipping many layers), not the actual reproduction of the issue.

I took a look at the GoogleAsyncStreamImpl::cleanup() and it does not use deferred deletion to remove the stream from the active stream list. Wondering whether moving to Google gRPC will help with this issue.

void GoogleAsyncStreamImpl::cleanup() {
  ENVOY_LOG(debug, "Stream cleanup with {} in-flight tags", inflight_tags_);
  // We can get here if the client has already issued resetStream() and, while
  // this is in progress, the destructor runs.
  if (draining_cq_) {
    ENVOY_LOG(debug, "Cleanup already in progress");
    return;
  }
  draining_cq_ = true;
  ctxt_.TryCancel();
  if (LinkedObject<GoogleAsyncStreamImpl>::inserted()) {
    // We take ownership of our own memory at this point.
    LinkedObject<GoogleAsyncStreamImpl>::removeFromList(parent_.active_streams_).release();
    if (inflight_tags_ == 0) {
      deferredDelete();
    }
  }
}
htuch commented 1 month ago

Entirely possible - as I mention above, the issue isn't the use of deferred delete per se, but something else that needs chasing down. Given the Google gRPC implementation (including clean up code) is completely different to Envoy gRPC, I'd try that out and see if it works for you. I'd still leave this issue open in case anyone else encounters the same problem.

johnsonjthomas commented 2 weeks ago

Thanks @htuch for your feedback and time in looking into this issue. We are evaluating switching to Google gRPC and will verify this issue as part of that. However, the switch will take some time, so I dont think I can provide the feedback soon. I am fine with keeping the issue open, hopefully the issue wont get closed due to inactivity.

htuch commented 2 weeks ago

Marked as "help wanted" to avoid closing.