TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.19k stars 110 forks source link

Gracefully handle GoAway #320

Closed chrisstaite-menlo closed 9 months ago

chrisstaite-menlo commented 1 year ago

When being hammered a GrpcStore can end up encountering many transport errors:

[2023-10-19T08:43:40.895Z ERROR cas_server] find_missing_blobs Resp: 0.039009552 Err(Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected (b\\\"[p]req HEADERS: max concurrency reached\\\")\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::find_missing_blobs : In find_missing_blobs : Failed on find_missing_blobs() command", source: None })

It should be restricted in the number of simultaneous requests it can hammer at the upstream.

allada commented 1 year ago

I cant find where that error could have came from in from tonic/hyper. Any idea if this could have come from another service (or even client)?

chrisstaite commented 1 year ago

Yeah, I couldn't find it either, but it's only turbo-cache talking to the CAS everything else is proxied through a GrpcStore, so I don't see how a headers message could come from anywhere else.

allada commented 1 year ago

Interesting, are you sure it's turbo-cache talking to turbo-cache? The reason I ask is because find_missing_blobs is never called inside TC except in GRPC forwarding store.

Is it possible that the client (ReClient?) sent this error over h2?

chrisstaite commented 1 year ago

Actually, I have an nginx in front that forwards the CAS gRPC requests directly for legacy reasons. It may actually be Reclient sending it.

chrisstaite-menlo commented 11 months ago

It seems like this is still an issue and is discussed in hyperium/hyper#2500. I get this issue with nativelink talking to nativelink directly GrpcStore and sometimes also stream session alloc failed and No stream session found. Getting one of these errors can often lead to a hang up in the build.

chrisstaite-menlo commented 9 months ago

Just looking and it appears that tonic provides both concurrency_limit and rate_limit for Endpoint. It would be good to expose these to config. It would be better if we could handle the GoAway a little nicer to perform concurrency detection, but this would at least allow us to add some kind of cap.

chrisstaite-menlo commented 9 months ago

Adding the rate limiting doesn't seem to help much. This is the logs from the GrpcStore server:

  2024-01-24T11:15:04.891584Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004535798 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:04.891842Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004686835 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:15:05.038798Z ERROR nativelink_service::bytestream_server: Write Resp: 0.051602904 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(3)

  2024-01-24T11:15:05.038955Z ERROR nativelink_service::bytestream_server: Write Resp: 0.05187966 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:05.099769Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000070117 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 967 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:15:05.102512Z ERROR nativelink_service::bytestream_server: Write Resp: 0.0001212 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)

  2024-01-24T11:15:05.185536Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000094612 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 813 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:15:05.244820Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000124144 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)`

And these are the logs from the storage server itself:

  2024-01-24T11:15:04.957840Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", PROTOCOL_ERROR, Library) })
    at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(4)

This is with the above concurrency limit change and the concurrency set to 200, I'll try reducing it to 100.

chrisstaite-menlo commented 9 months ago

With it reduced to 100 I got:

GrpcStore:

  2024-01-24T11:18:40.013285Z ERROR nativelink_service::bytestream_server: Write Resp: 0.57577896 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:18:40.013300Z ERROR nativelink_service::bytestream_server: Write Resp: 0.575339 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)

  2024-01-24T11:18:40.169492Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000058742 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 1286 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)

  2024-01-24T11:18:40.232436Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000067092 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
    at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)

Storage server:

  2024-01-24T11:18:39.933106Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"[p]CONTINUATION: No stream session found", PROTOCOL_ERROR, Remote) })
    at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(3)