Open JamesNK opened 3 years ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | JamesNK |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http` |
Milestone: | - |
We have logic to handle this exact case here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Stream.cs#L904
It sounds like you only see this once in a while? Then it's likely we have a race here. It's possible we've introduced this in 6.0, but it's also possible it's always existed and is just rare to see in practice.
If I'm reading the log above correctly, it looks like there is a small request body (12 bytes) and Kestrel has received the entire request body before sending any response. Correct?
We should investigate for 6.0.
It sounds like you only see this once in a while?
It's very rare. I've seen it once in .NET 6 (this time).
If I'm reading the log above correctly, it looks like there is a small request body (12 bytes) and Kestrel has received the entire request body before sending any response. Correct?
Yes. A normal POST then response (with trailers).
The only interesting bit is the server sends RST_STREAM after END_STREAM. The END_STREAM is on the trailers frame.
Looking at the code I linked, this seems funny to me:
When this situation happens (RST_STREAM with NO_ERROR after END_STREAM), we check if the request stream is still being sent. If it is, we will set _requestBodyAbandoned = true
, then set requestBodyCancellationSource = _requestBodyCancellationSource
, and fall out of the lock statement, at which point we check if requestBodyCancellationSource != null
and then cancel the request body.
But if the request body has already completed sending, then we don't do this and we will end up calling Cancel(), which will cancel the response, including discarding any buffered data. That seems bad. We should not call Cancel in this case.
Actually, ignore my comment above. The logic I pointed at seems fine. Since we have already received END_STREAM on the response, we must still be sending the request stream; otherwise the stream would be fully completed and we'd just ignore the RST_STREAM (because the stream would no longer be in our active stream table).
(Edit: We should probably Assert that the request body hasn't completed yet, instead of testing for it; but that's not critical.)
The issue seems to be related to PING handling in the RttEstimator. We are throwing an Http2ConnectionException because we think there's something wrong with the PING ACK that Kestrel has sent -- either we aren't expecting it, or it has the wrong PING value.
@JamesNK Any chance you could add the ping value to the log info here?
@antonfirsov Any ideas?
Kestrel doesn't log the content of pings. I could add it to a custom Kestrel build and rerun the test, but I've only ever seen this once so it is unlikely to reproduce.
What I will do is enable capturing event source logging for this test. If I see it again then that info will be available.
@antonfirsov -- we probably should have some log events associated with these failure cases so we can more easily determine what's going on here.
Unfortunately, we are not logging the received ping content either (only sends), so it's very hard to reason about the error with the information we can get with our existing logging.
I assume this happened on the CI. @JamesNK any chance you can try running the test in a loop or do some other trick to get a local repro? If we could somehow get a deterministic repro, we could experiment with extended logging and hunt down the bug.
@geoffkizer I think we should improve our logging for RC2 (or even RC1 if still possible), so if we get similar reports we would be able to react easier. Will it be hard to get it through the backports?
Triage: If there is really issue, it is bad -- it may be in client or on server. So far it happened only once. Without repro we cannot take action :( We need more logs -- we should add logging and try to get it to 6.0 to have at least chance to root cause it next time it happens somewhere (e.g. in customer production).
I saw this again from another gRPC test:
SDK info:
.NET SDK (reflecting any global.json):
Version: 6.0.100-rc.1.21421.6
Commit: 77703ed54e
I haven't been able to repo it by putting the test in a loop.
Saw it again. This time on the CI server. Also, this test has your HTTP event source logs:
It's long because the test is retrying multiple requests. Probably only the last request is relevant.
The CI is using a slightly older SDK: 6.0.100-rc.1.21380.19
It looks like the error in https://github.com/dotnet/runtime/issues/57617#issuecomment-903193621 is different from the one in the opening post, this time HttpClient complains about the SETTINGS frame.
@JamesNK is there a client stack trace for https://github.com/dotnet/runtime/issues/57617#issuecomment-903620983? I don't see any indication of PROTOCOL_ERROR in that log. Is HttpClient replying with RST_STREAM to an RST_STREAM there?
I think we are dealing with 3 different issues here. /cc @geoffkizer
It's very strange that we'd get an error related to processing the SETTINGS frame... if this was common it would blow up all sorts of things.
I wonder if we have some sort of buffer corruption issue, either on the client or the server... this would explain the PING issue as well. More detailed logs would help track this down.
@JamesNK is there a client stack trace for #57617 (comment)? I don't see any indication of PROTOCOL_ERROR in that log. Is HttpClient replying with RST_STREAM to an RST_STREAM there?
~It's at the top of the log:~ Never mind, different post.
Yes, I think the 3rd stack trace is different.
Logging added in 7.0 in PR #57747 and in 6.0 RC1 in PR #57998
@JamesNK for clarity: you never managed to reproduce these failures locally, only on CI, right? Are the failures bound to particular OS or machine?
@JamesNK the key question is: Now that we have logging in, how can we get logs from a case when it happens to look at the logs? Any way to reproduce it on our side, or on your side? Thanks!
@JamesNK for clarity: you never managed to reproduce these failures locally, only on CI, right? Are the failures bound to particular OS or machine?
The opposite: I have seen it twice, both times on my local machine. Windows 11 insider perview. I tried putting the test logic in a loop and I couldn't recreate it.
@JamesNK the key question is: Now that we have logging in, how can we get logs from a case when it happens to look at the logs? Any way to reproduce it on our side, or on your side? Thanks!
Unit tests - both locally and on CI - gather logs from Kestrel, gRPC, and HttpClient. I will add them here if I see another failure.
I tried putting the test logic in a loop and I couldn't recreate it.
This often indicates that issue is not independent from the rest of the test suite. What happens if you run all tests together (several times) but copy of the flaky one?
The trick I typically use:
public static readonly int[] DummyData = Enumerable.Range(0, 1000).ToArray();
[Theory]
[MemberData(nameof(DummyData))]
public void FlakyTest(int dummy) { /* ... */ }
@JamesNK if you hit it and can gather the logs, that would be great, thanks!
@JamesNK did you happen to catch it with logs? It is not actionable for us now.
Triage: Not actionable, moving to Future for now.
The server is completing the response stream and then sending RST_STREAM frame.
The client should be able to read the response to the end without error. Instead, they are getting an error.
This is from a flakey test. This is the only time I've seen it fail.
Important bit:
The server is sending END_STREAM flag, then sending RST_STREAM frame.