dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.74k forks source link

SocketsHttpHandler.SendAsync reads request Content stream twice #53914

Open davidni opened 3 years ago

davidni commented 3 years ago

Description

Under very specific conditions, SocketsHttpHandler.SendAsync when using HTTP/2 attempts to copy the HttpRequestMessage.Content twice.

In Island Gateway (AKA what eventually became YARP), we have code similar to the following, and this invariant that we do a best-effort check is being violated:

https://github.com/microsoft/reverse-proxy/blob/961608d01adc9ab07d4cd3f5adf7d4d3dfff3f0f/src/ReverseProxy/Proxy/StreamCopyHttpContent.cs#L118-L125

The following criteria seem to be necessary to trigger this bug, though it may be broader than this:

It seems like the SocketsHttpHandler machinery may be deciding it is safe to retry because the destination service provably did not start processing the first attempt, but it seems to make the wrong assumption that it is okay to read the HttpRequestMessage.Content again on the next attempt. This is not a valid assumption, and leads to issues in applications such as YARP where the Content is a wrapper around Kestrel's incoming request body stream, which cannot be read twice.

Configuration

Other information

Filing this as a result of an offline discussion with @halter73 and @davidfowl. Thanks folks for helping investigate.

@karelz FYI.

ghost commented 3 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Under very specific conditions, SocketsHttpHandler.SendAsync when using HTTP/2 attempts to copy the HttpRequestMessage.Content twice. In Island Gateway (AKA what eventually became YARP), we have code similar to the following, and this invariant that we do a best-effort check is being violated: https://github.com/microsoft/reverse-proxy/blob/961608d01adc9ab07d4cd3f5adf7d4d3dfff3f0f/src/ReverseProxy/Proxy/StreamCopyHttpContent.cs#L118-L125 The following criteria seem to be necessary to trigger this bug, though it may be broader than this: * Request specifies `Expect: 100-continue` * Destination does not produce a `100` response within the first 1 second * HttpConnectionPool.SendWithRetryAsync decides to retry (just a hypothesis) It seems like the SocketsHttpHandler machinery may be deciding it is safe to retry because the destination service provably did not start processing the first attempt, but it seems to make the wrong assumption that it is okay to read the HttpRequestMessage.Content again on the next attempt. This is not a valid assumption, and leads to issues in applications such as YARP where the Content is a wrapper around Kestrel's incoming request body stream, which cannot be read twice. ### Configuration * .NET Core 3.1.15 * Windows Server Datacenter 2019 ### Other information Filing this as a result of an offline discussion with @halter73 and @davidfowl. Thanks folks for helping investigate. @karelz FYI.
Author: davidni
Assignees: -
Labels: `area-System.Net.Http`, `untriaged`
Milestone: -
geoffkizer commented 3 years ago

Can you try with a recent .NET 6 build?

There are a bunch of situations that can cause us to try to read the content stream twice. If the content doesn't support this, it can throw and we should fail the request. There's not really anything else we can do at that point.

geoffkizer commented 3 years ago

More info:

Things that can cause the content stream to be read twice include stuff like redirects and authentication. We will handle specific status codes like 302 or 401 and issue another request and try to send the content again. Expect: 100-continue can be used to try to avoid having to read the content twice in cases like this, because hopefully we will get the 302 or 401 before proceeding to send content; but obviously this only will work if the expect timeout doesn't expire.

Are you using redirects or authentication here?

We do have logic to retry requests on connection shutdown -- that's what HttpConnectionPool.SendWithRetryAsync is about. But this shouldn't cause the content stream to be read twice. Once we start sending any content, the request is no longer retryable.

davidni commented 3 years ago

Thanks Geoff.

1: Unfortunately I do not have a minimal repro of this atm and I am unable to verify on .NET 6 (we will not transition our production workloads before GA).

2: There are no redirects nor authentication at play here. Connection shutdown is likely the trigger -- during the incident where we had this happen, the destination service was overwhelmed and couldn't handle the load. This meant many of our requests were hitting the 1-second timeout for Expect:100-continue. And in at least a few cases, this did cause the content stream to be read twice. We have unambiguous evidence of this based on a few occurrences of the error I mentioned in the issue above ("Stream was already consumed"). We have indirect evidence of this based on Kestrel logs with the following stack trace, indicating two concurrent calls to read the request content (look for StreamCopyHttpContent.SerializeToStreamAsync below):

System.NotSupportedException:  The ReadAsync method cannot be called when another read operation is pending.
   at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer)
   at System.IO.Pipelines.StreamPipeReader.ReadAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1ContentLengthMessageBody.ReadAsyncInternal(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.ReadAsyncInternal(Memory`1 buffer, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.StreamCopier.CopyAsync(Stream source, Stream destination, CancellationToken cancellation)
   at IslandGateway.Core.Service.Proxy.StreamCopyHttpContent.SerializeToStreamAsync(Stream stream, TransportContext context)
   at System.Net.Http.HttpContent.CopyToAsyncCore(ValueTask copyTask)
   at System.Net.Http.Http2Connection.Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.HttpProxy.<>c__DisplayClass13_0.<<ExecuteRequestAsync>b__0>d.MoveNext()
geoffkizer commented 3 years ago

Ahhhh, this is HTTP2. I had assumed HTTP/1.1.

Yes, with HTTP2, there are additional scenarios. When we receive a GOAWAY (server-initiated connection shutdown), the server tells us which streams will be processed and which won't; the ones that haven't are retryable, even if we already started sending the content. So in this case you will definitely see us try to read the content twice.

That said, this is expected behavior. If you can't support reading the content again, then you should throw and the request should be failed.

davidni commented 3 years ago

@Tratcher for your take re. YARP. If confirmed this is acceptable, we would at least want to make sure we are handling it correctly and emitting the right data to telemetry. E.g. perhaps this should count as a Transport error (?)

geoffkizer commented 3 years ago

You can also consider increasing the Expect100ContinueTimeout so that you'll hit this less frequently.

Tratcher commented 3 years ago

We have indirect evidence of this based on Kestrel logs with the following stack trace, indicating two concurrent calls to read the request content

This part's concerning, it means SerializeToStreamAsync is called concurrently on two threads and the Started check in YARP isn't thread safe. The two concurrent threads make sense if the retry is happing at the same time as the 100-continue expires. @geoffkizer that sounds like a potential race in the client logic. If the 100-continue expired and is sending content then the retry should wait for that to abort.

This could be mitigated by making the Started check in YARP thread safe. I'll file an issue for that (https://github.com/microsoft/reverse-proxy/issues/1068).

As for the retry scenario in general, yes that's expected to fail in YARP and it should be reported accordingly. What type of failure is it currently reported as? It might depend on if it sent the whole request body before retrying, or if it was aborted part way through.

Tratcher commented 3 years ago

Raising the 100-continue timeout is a valid mitigation and shouldn't affect responsive servers.

geoffkizer commented 3 years ago

@geoffkizer that sounds like a potential race in the client logic. If the 100-continue expired and is sending content then the retry should wait for that to abort.

Looking at the code, we should be waiting for SerializeToStreamAsync to complete before we actually fail the request. So I don't see how we could end up with concurrent SerializeToStreamAsync calls here. (I'm looking at 6.0 code, but I don't think this has changed much since 3.1.)

Note that in cases like this, where we get a GOAWAY or similar and need to cancel sending the request body, we will do two things:

(1) Cancel the cancellation token passed to SerializeToStreamAsync (2) Throw on any attempt to use the request body stream passed to SerializeToStreamAsync

davidni commented 3 years ago

(1) Cancel the cancellation token passed to SerializeToStreamAsync

@geoffkizer there's no cancellation parameter here, can you clarify what you're referring to? This may be the issue at hand. We really are making no effort to stop the first SerializeToStreamAsync, because there doesn't seem to be any signal we should be reacting to.

public abstract class HttpContent : IDisposable
{
  //...
  protected abstract Task SerializeToStreamAsync(Stream stream, TransportContext context);
}

(2) Throw on any attempt to use the request body stream passed to SerializeToStreamAsync

The failure we are experiencing happens before we attempt to write to the request body stream. We were still waiting to read bytes from the other side before proxying to the destination.

Tratcher commented 3 years ago

SerializeToStreamAsync added a new cancellable overload in 5.0. YARP hasn't implemented it yet since we're still supporting 3.1. https://github.com/dotnet/runtime/blob/c14ac488cc5f7d2ad71e7f81d6a05a1da8c76237/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L354

MihaZupan commented 3 years ago

Wouldn't we throw it away in YARP anyway? Right now we have a comment in the code about it

Note we do not need to implement the SerializeToStreamAsync(Stream stream, TransportContext context, CancellationToken token) overload because the token it provides is the short request token, not the long body token passed from the constructor. Using the short token would break bidirectional streaming scenarios.

Tratcher commented 3 years ago

Note that in cases like this, where we get a GOAWAY or similar and need to cancel sending the request body, we will do two things:

(1) Cancel the cancellation token passed to SerializeToStreamAsync

@geoffkizer @MihaZupan is this a linked token? What are the possible triggers for it?

karelz commented 3 years ago

@MihaZupan did you get chance to look at it? It is not clear why untriaged label was removed.

MihaZupan commented 3 years ago

From reading the code, it looks like it is possible that the request content will be read multiple times concurrently on HTTP 1:

This could cause the exception seen in https://github.com/dotnet/aspnetcore/issues/33409#issuecomment-859829081.

Judging from the stack trace seen by David, I suspect a similar race exists for HTTP/2.

@geoffkizer If we're saying that the content may be read multiple times, are we also saying that those reads may happen concurrently? If so, we should fix cases like StreamContent that aren't using a thread-safe check. If not, we should fix races in Http*Connection.

geoffkizer commented 3 years ago

@geoffkizer If we're saying that the content may be read multiple times, are we also saying that those reads may happen concurrently?

No, the read should never happen concurrently.

geoffkizer commented 3 years ago

Your analysis seems correct to me.

Seems like we need to possibly do a couple things here.

The primary one is, we should wait for the sendRequestContentTask to complete before we propagate the exception upward. That should ensure we don't read concurrently from the request content.

However, doing that may introduce some new issues. In order to ensure the sendRequestContentTask completes promptly (or indeed at all), we need to cancel it. Unlike HTTP2, we don't seem to have logic to do this for HTTP1.1; I think we should.

Judging from the stack trace seen by David, I suspect a similar race exists for HTTP/2.

It's certainly possible, but the HTTP2 code is pretty different so we should investigate and confirm

One thing I noticed from the stack trace you linked is, the derived HttpContent class doesn't implement the SerializeToStreamAsync overload with a cancellation token. That means that even if we try to cancel sending the request body, it won't necessarily be cancelled promptly. I'm not sure this matters or not in terms of reproducing the issue -- I can't see how it would matter, but maybe I'm missing something.

As an aside, this all makes me wish we had common Expect: 100-Continue logic across different HTTP versions. We now have this implemented in three different places, unfortunately.

karelz commented 3 years ago

Triage: There is workaround (lock in custom HttpContent), so not critical for 6.0. But we should fix it -- i.e. make sure we don't start 2nd read before fully cancelling the 1st one.

geoffkizer commented 3 years ago

Actually I think there's an even more basic issue here:

Thread A reads an EOF from the server, sees _startedSendingRequestBody == false, sets _canRetry = true and throws an IO exception Timer wakes up thread B, signals allowExpect100ToContinue with true and starts reading the request content

We shouldn't be both setting _canRetry = true and actually starting to send the body. We either need to prevent the body from sending when _canRetry is set to true, or if we can't (i.e. we lose the race), then don't set _canRetry to true.

Edit: And if we did this, we would prevent the concurrent read because we'd never retry if we started sending the content.

davidni commented 3 years ago

We recently started seeing this issue on HTTP/2 even without Expect:100-continue. Interestingly, it happened for both GET's and POST's. It seems likely that this could be related to timing issues between the HttpClient stack and the HTTP/2 server that it is talking to. See e.g. https://github.com/microsoft/reverse-proxy/issues/711#issuecomment-801103475

Example callstack of the recent occurrences is below (we had >400 hits of this in a few hours):

System.Net.Http.HttpRequestException: An error occurred while sending the request.
 ---> System.InvalidOperationException: Stream was already consumed.
   at IslandGateway.Core.Service.Proxy.StreamCopyHttpContent.SerializeToStreamAsync(Stream stream, TransportContext context)
   at System.Net.Http.HttpContent.CopyToAsyncCore(ValueTask copyTask)
   at System.Net.Http.Http2Connection.Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at Microsoft.PowerApps.CoreFramework.Communication.Http.HttpClientCorrelationHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at IslandGateway.Core.Service.Proxy.HttpProxy.<>c__DisplayClass13_0.<<ExecuteRequestAsync>b__0>d.MoveNext()
geoffkizer commented 3 years ago

As I mentioned above -- there are various scenarios where we will try to send the content twice, and if the content doesn't support this, it will fail.

It should never try to send the content concurrently, but as long as it's not, then trying to send the content twice is expected in the scenarios discussed above.

geoffkizer commented 3 years ago

BTW I assume this is still .NET Core 3.1? I.e. no code has changed on the client here?

davidni commented 3 years ago

Correct, still .NET Core 3.1. Likely the only thing that changed was request timing patterns as well as the destination services we are routing to.

shawncx commented 2 years ago

Hi @geoffkizer , @Tratcher , we meet the exact same issue. We are using .NET Core 3.1 and HTTP/2. May I ask is there any recommended action to mitigate this issue? Or what kind of error is recommended to return to user for this case?

karelz commented 2 years ago

Triage: This problem exists already in 6.0, so not a regression. Only 1 customer report which has workaround in YARP. We should look at it from reliability perspective, but given low impact on customers, moving to 8.0.

ManickaP commented 1 year ago

Triage: still limited customer impact, moving to future.

MihaZupan commented 1 year ago

I think I figured out how concurrent calls to SerializeToStreamAsync can occur with HTTP/2 as well:

  1. Send a request with Expect 100-Continue set.
  2. The Http2Connection sends the request headers and calls Http2Stream.SendRequestBodyAsync, storing the requestBodyTask.
  3. SendRequestBodyAsync starts waiting on WaitFor100ContinueAsync and yields.
  4. The Http2Connection starts waiting on the request body or response headers tasks to complete.
  5. After some time, the Expect100ContinueTimeout timer in WaitFor100ContinueAsync fires. SendRequestBodyAsync sees that the request cancellation hasn't been requested yet and is about to call HttpContent.SerializeToStreamAsync in the background.
  6. At about the same time in a different thread, the server sends us a GOAWAY frame or a REFUSED_STREAM RST. Http2Connection decides that the request is retryable and calls Http2Stream.OnReset with canRetry: true.
  7. OnReset() sets Http2Stream._canRetry to true and calls Http2Stream.Cancel(), which signals the waiter.
  8. The waiter getting signaled wakes up the ReadResponseHeadersAsync task, which sees that the stream was reset and throws the retryable exception.
  9. The Http2Connection.SendAsync which was waiting on the request body or response headers to complete now wakes up and sees that the response headers task completed first.
  10. It doesn't wait for the requestBodyTask to complete, but sets up a continuation that will observe any exceptions.
  11. The retryable exception from responseHeadersTask is propagated out of SendAsync and into HttpConnectionPool.SendWithVersionDetectionAndRetryAsync which will now attempt to send the request again.
  12. The request is sent again, and HttpContent.SerializeToStreamAsync is called.
  13. In the background, the original SendRequestBodyAsync task (nr. 3) may just now be calling into SerializeToStreamAsync at the same time.

The gist is that we aren't waiting for requestBodyTask to complete before returning a retryable exception from Http2Connection.SendAsync, allowing SerializeToStreamAsync to be called multiple times concurrently.

sberube commented 2 months ago

Triage: This problem exists already in 6.0, so not a regression. Only 1 customer report which has workaround in YARP. We should look at it from reliability perspective, but given low impact on customers, moving to 8.0.

@karelz we seem to observe the same behavior using YARP 2.2 and .NET Core 8. You were mentionning that 1 customer has a workaround, what is this workaround and how can we add it in our system?

Thanks!

MihaZupan commented 2 months ago

@sberube I take it you're seeing the "Stream was already consumed." error message? See https://github.com/microsoft/reverse-proxy/issues/1683

That may happen during normal proxy operation if HttpClient decides to retry a request, something that YARP doesn't support as we don't buffer the request body.

This issue is tracking a bug in SocketsHttpHandler where those reads may happen concurrently. The workaround mentioned is mitigating just that aspect, and it's being used by YARP internally.