Closed chrisstaite-menlo closed 9 months ago
I cant find where that error could have came from in from tonic/hyper. Any idea if this could have come from another service (or even client)?
Yeah, I couldn't find it either, but it's only turbo-cache talking to the CAS everything else is proxied through a GrpcStore, so I don't see how a headers message could come from anywhere else.
Interesting, are you sure it's turbo-cache
talking to turbo-cache
? The reason I ask is because find_missing_blobs
is never called inside TC except in GRPC forwarding store.
Is it possible that the client (ReClient?) sent this error over h2?
Actually, I have an nginx in front that forwards the CAS gRPC requests directly for legacy reasons. It may actually be Reclient sending it.
It seems like this is still an issue and is discussed in hyperium/hyper#2500. I get this issue with nativelink talking to nativelink directly GrpcStore and sometimes also stream session alloc failed
and No stream session found
. Getting one of these errors can often lead to a hang up in the build.
Just looking and it appears that tonic provides both concurrency_limit
and rate_limit
for Endpoint
. It would be good to expose these to config. It would be better if we could handle the GoAway a little nicer to perform concurrency detection, but this would at least allow us to add some kind of cap.
Adding the rate limiting doesn't seem to help much. This is the logs from the GrpcStore server:
2024-01-24T11:15:04.891584Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004535798 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)
2024-01-24T11:15:04.891842Z ERROR nativelink_service::bytestream_server: Write Resp: 0.004686835 None Status { code: Unavailable, message: "status: Unavailable, message: \"h2 protocol error: http2 error: stream error received: refused stream before processing any application logic\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)
2024-01-24T11:15:05.038798Z ERROR nativelink_service::bytestream_server: Write Resp: 0.051602904 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(3)
2024-01-24T11:15:05.038955Z ERROR nativelink_service::bytestream_server: Write Resp: 0.05187966 None Status { code: Internal, message: "status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)
2024-01-24T11:15:05.099769Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000070117 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 967 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)
2024-01-24T11:15:05.102512Z ERROR nativelink_service::bytestream_server: Write Resp: 0.0001212 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)
2024-01-24T11:15:05.185536Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000094612 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 813 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)
2024-01-24T11:15:05.244820Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000124144 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)`
And these are the logs from the storage server itself:
2024-01-24T11:15:04.957840Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", PROTOCOL_ERROR, Library) })
at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(4)
This is with the above concurrency limit change and the concurrency set to 200, I'll try reducing it to 100.
With it reduced to 100 I got:
GrpcStore:
2024-01-24T11:18:40.013285Z ERROR nativelink_service::bytestream_server: Write Resp: 0.57577896 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)
2024-01-24T11:18:40.013300Z ERROR nativelink_service::bytestream_server: Write Resp: 0.575339 None Status { code: Unknown, message: "status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} } : in GrpcStore::write : in GrpcStore::update() : Error updating inner store : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(14)
2024-01-24T11:18:40.169492Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000058742 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 1286 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(13)
2024-01-24T11:18:40.232436Z ERROR nativelink_service::bytestream_server: Write Resp: 0.000067092 None Status { code: InvalidArgument, message: "Received out of order data. Got 0, expected 335 : In ByteStreamServer::write()", source: None }
at nativelink-service/src/bytestream_server.rs:535 on tokio-runtime-worker ThreadId(16)
Storage server:
2024-01-24T11:18:39.933106Z ERROR nativelink: Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"[p]CONTINUATION: No stream session found", PROTOCOL_ERROR, Remote) })
at src/bin/nativelink.rs:605 on tokio-runtime-worker ThreadId(3)
When being hammered a GrpcStore can end up encountering many transport errors:
It should be restricted in the number of simultaneous requests it can hammer at the upstream.