dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

HttpRequestException when using SocketsHttpHandler #26629

Closed ppekrol closed 4 years ago

ppekrol commented 6 years ago

Hi all,

we had to drop the SocketsHttpHandler (https://github.com/ravendb/ravendb/commit/9ad7c41233fca1e3f16617ad0eadf191251c0e88) because for quite some time we started receiving random test failures on our CI (Jenkins) Agent (Hyper-V machines, 6 vCPU, 6GB RAM on WS2012R2 and WS2016 - fully updated). We tried to reproduce this on our local dev machines, but without any luck. After turning off the SocketsHttpHandler, the random failures stopped showing up.

The exception we are getting is:

System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\\Jenkins\\workspace\\PR_Tests_v4.1\\s\\src\\Raven.Client\\Http\\RequestExecutor.cs:line 738

We are using latest .NET Core 2.1.1 (2.1.0 was also affected).

Do you know if there are some problems in SocketsHttpHandler that could cause this? What can we do to help you solve this issue? Any pointers?

ayende commented 6 years ago

Most / all of these requests were using POST operations (or other methods that have method bodies). We suspect, but aren't sure, that this might be related to connection pooling in some manner.

stephentoub commented 6 years ago

What can we do to help you solve this issue?

A repro :smile:

stephentoub commented 6 years ago

We suspect, but aren't sure, that this might be related to connection pooling in some manner.

That error message is for SocketError.OperationAborted, which likely means the socket was closed while the send or receive was in progress.

ayende commented 6 years ago

@stephentoub Is there anything here that we can look at? To reproduce this we need to schedule > 5 runs of 4,000+ tests that takes an hour plus. In such scenario, we typically see 1 - 3 such failures using SocketsHttpHandler.

Note that we never use SocketsHttpHandler directly, so I don't think that we improperly using it. We got through HttpClient, a shared instance that is re-used among threads, as recommended.

The code is actually doing something like:

try
{
    return await httpHandler.SendAsync(...);
}
catch(HttpRequestException e)
{
}

So we can catch the exception and then check state / logging / something.

Anything you can recommend that we'll look at to get to the root cause?

stephentoub commented 6 years ago

Is it possible you're hitting a timeout? It shouldn't manifest this way, but it would give us a place to start looking. For example, if you change your HttpClient's Timeout to Timeout.InfiniteTimeSpan, does this still happen?

Or if it's potentially related to connection pooling, what happens if you change the SocketHttpHandler's PooledConnectionLifetime and PooledConnectionIdleTimeout to infinite... does it still repro?

ayende commented 6 years ago

It is all running locally, and we timed the error and it was about 11 ms from calling SendAsync. We'll try with the connection lifetimes.

stephentoub commented 6 years ago

Actually, I have a theory for how this could manifest... now I just need to go look through the code for where the cause could be. Is it possible that before these errors occur for you, some other request timed out or could have otherwise been canceled (e.g. via the token you pass in)?

stephentoub commented 6 years ago

Or... are you potentially disposing of the response stream from a request while it's concurrently being read? e.g. something buggy like:

Task t = responseStream.ReadAsync(...);
responseStream.Dispose();

?

ayende commented 6 years ago

It is entirely possible that the previous request can time out, yes. In fact, that is probably guaranteed. We have a few tests that test what happens when a request is timing out.

ayende commented 6 years ago

I don't think that I have anything like the second option, but it is possible. We might be in the middle of a read and need to abort the connection.

ayende commented 6 years ago

And... I have a repro. On my machine, this reproduce the scenario instantly. You were correct about the timeout causing this.

using System;
using System.Net.Http;
using System.Threading.Tasks;

namespace ConsoleApp15
{
    class Program
    {
        static async Task Main(string[] args)
        {

            // http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk

            var client = new HttpClient(new SocketsHttpHandler
            {
            })
            {
                Timeout = TimeSpan.FromMilliseconds(250)
            };

            var tasks = new[]
            {
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
                Run(client),
            };

            await Task.WhenAll(tasks);

            Console.WriteLine("Done");

        }

        public static async Task Run(HttpClient client)
        {
            for (int i = 0; i < 25; i++)
            {
                try
                {
                    await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/3000/url/http://www.google.co.uk");
                    Console.WriteLine("Should timeout");
                }
                catch (Exception e)
                {
                }
                await client.GetStringAsync("http://slowwly.robertomurray.co.uk/delay/100/url/http://www.google.co.uk");
            }

        }
    }
}
stephentoub commented 6 years ago

And... I have a repro. On my machine, this reproduce the scenario instantly.

That's hitting a timeout and throwing an OperationCanceledException... that's expected.

What's not expected is getting an IOException (not wrapped in an OperationCanceledException) out of SendAsync and when nothing on the request is being canceled / timing out.

ayende commented 6 years ago

@stephentoub There are two exceptions thrown here. The first is expected, because we are hitting the timeout. The second exception (which will actually kill the process) is thrown even though the second request shouldn't be timing out. In fact, you can replace the slowwly url in the second one entirely and still get the same error.

I think that this shows that the timeout on the request indeed poison the pool

stephentoub commented 6 years ago

The second exception (which will actually kill the process) is thrown even though the second request shouldn't be timing out.

Why shouldn't the second request time out? It takes a very long time to access and the timeout is set at 250ms. (I'm trying to access it now and it's just sitting waiting.)

ayende commented 6 years ago

Hm... looks like I was too excited.

var sp = Stopwatch.StartNew();
try
{
    await client.GetStringAsync("http://www.google.co.uk");
}
catch (Exception)
{
    Console.WriteLine(sp.ElapsedMilliseconds);
    throw;
}

Always gives > 250 ms. I assumed that this was because of the previous run, not that it is (probably) waiting there.

ayende commented 6 years ago

Okay, I think that I'm on a better track here. See the following code. The key here is that we need to be POSTing data., I think.

I was able to reproduce the HttpRequestException when posting data and getting a timeout. We are going to try to build a better repro for this.

ppekrol commented 6 years ago

We have some more information (still no direct repro).

We have managed to find out exactly which particular test is causing those errors. This is RavenDB_7787 - a test that is spawning a separate process with our Server and doing some basic CRUD operations on it.

The server is started with random port (we are passing http://127.82.86.78:0) and from our CI logs we can see that the address is generated correctly (Embedded URL: http://127.82.86.78:60101). What is worth noting is that we are killing this Server and whole external process in a bit rude way - using process.Kill.

After this test completes, 20-30 seconds later, few tests are failing with the exception that I wrote about in the issue. Those tests are connecting to global server that is available for whole test suite (and this server is working on a different url and port). After few (2-3) of those failures the tests are starting to work in a normal way.

It looks like something is going on after 20-30 seconds after process was killed and for some time it affects other socket connections.

This is our CI log (the 7787 turns on a switch that writes on console url's and http methods for sanity check - to know if we are hitting appropriate server). The code is available here: https://github.com/ppekrol/ravendb/commits/v4.1-temp4

FYI, skipping this test resolves the issue.

01:02:45.307     SlowTests.Tests.Sorting.AlphaNumericSorting.order_by_two_parameters_first_long_than_alphanumeric
01:02:45.579     SlowTests.Tests.Sorting.AlphaNumericSorting.basic_alphanumeric_sort
01:02:45.859     SlowTests.Tests.Sorting.AlphaNumericSorting.number_and_decimal_alphanumeric_sort
01:02:46.465     SlowTests.Tests.Sorting.AlphaNumericSorting.random_words(seed: 1144883738)
01:02:49.064     SlowTests.Issues.RavenDB_7787.TestEmbedded
01:02:50.064 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:02:54.406 Request: http://127.82.86.78:60101/cluster/topology. Type: GET
01:02:54.688 Request: http://127.82.86.78:60101/admin/databases?name=Test&replicationFactor=1. Type: PUT
01:02:56.117 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:02:58.727 Embedded URL: http://127.82.86.78:60101
01:02:58.727 Request: http://127.82.86.78:60101/topology?name=Test. Type: GET
01:02:58.727 Request: http://127.82.86.78:60101/databases/Test/bulk_docs. Type: POST
01:02:59.334 Request: http://127.82.86.78:60101/databases/Test/configuration/client. Type: GET
01:03:01.920 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:03:03.338 Request: http://127.82.86.78:60109/cluster/topology. Type: GET
01:03:03.926 Request: http://127.82.86.78:60109/admin/databases?name=Test&replicationFactor=1. Type: PUT
01:03:04.196 Embedded URL: http://127.82.86.78:60109
01:03:04.197 Request: http://127.82.86.78:60109/topology?name=Test. Type: GET
01:03:04.197 Request: http://127.82.86.78:60109/databases/Test/docs?&id=people%2F1. Type: GET
01:03:04.197 Request: http://127.82.86.78:60109/databases/Test/configuration/client. Type: GET
01:03:05.178     SlowTests.Issues.RavenDB_6414.Should_unload_db_and_send_notification_on_catastrophic_failure
01:03:05.470 Request: http://127.0.0.1:60113/cluster/topology. Type: GET
01:03:05.470 Request: http://127.0.0.1:60113/admin/databases?name=Should_unload_db_and_send_notification_on_catastrophic_failure_2534&replicationFactor=1. Type: PUT
01:03:05.470 Request: http://127.0.0.1:60113/info/tcp?tag=Supervisor. Type: GET
01:03:06.068 Request: http://127.0.0.1:60113/rachis/waitfor?index=4. Type: GET
01:03:08.060 Request: http://127.0.0.1:59136/databases/Round_robin_load_balancing_should_workbe797fd1-f48b-4772-9bde-6ed4d45afd71/stats?failure=check. Type: GET
01:03:08.060 Request: http://127.0.0.1:60113/topology?name=Should_unload_db_and_send_notification_on_catastrophic_failure_2534. Type: GET
01:03:08.060 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/hilo/next?tag=users&lastBatchSize=0&lastRangeAt=0001-01-01T00:00:00.0000000&identityPartsSeparator=/&lastMax=0. Type: GET
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/configuration/client. Type: GET
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/bulk_docs. Type: POST
01:03:08.334 Request: http://127.0.0.1:60113/databases/Should_unload_db_and_send_notification_on_catastrophic_failure_2534/hilo/return?tag=users&end=32&last=1. Type: PUT
01:03:08.334     SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession
01:03:08.334 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.334 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyQueriesInAsyncSession_2535&replicationFactor=1. Type: PUT
01:03:08.610 Request: http://127.0.0.1:58232/rachis/waitfor?index=10760. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyQueriesInAsyncSession_2535. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/bulk_docs. Type: POST
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/configuration/client. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/stats. Type: GET
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/multi_get. Type: POST
01:03:08.610 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:08.610 Request: http://127.0.0.1:58232/databases/CanExecuteLazyQueriesInAsyncSession_2535/stats?failure=check. Type: GET
01:03:08.610     SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession [FAIL]
01:03:08.610       Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
01:03:08.610          --- End of inner exception stack trace ---
01:03:08.610          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
01:03:08.610          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
01:03:08.610          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.610          --- End of inner exception stack trace ---
01:03:08.610          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.610          at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.612          at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
01:03:08.612          at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs:line 738
01:03:08.612       Response.StatusCode - InternalServerError
01:03:08.612       Stack Trace:
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(116,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteLazyOperationsSingleStep(ResponseTimeInformation responseTimeInformation, List`1 requests, CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(88,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteAllPendingLazyOperationsAsync(CancellationToken token)
01:03:08.612         c:\Jenkins\workspace\PR_Tests_v4.1\s\test\SlowTests\Issues\RavenDb1962.cs(83,0): at SlowTests.Issues.RavenDb1962.CanExecuteLazyQueriesInAsyncSession()
01:03:08.612         --- End of stack trace from previous location where exception was thrown ---
01:03:08.612     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall
01:03:08.612 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.612 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536&replicationFactor=1. Type: PUT
01:03:08.891 Request: http://127.0.0.1:58232/rachis/waitfor?index=10764. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/bulk_docs. Type: POST
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/configuration/client. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_CheckSingleCall_2536/multi_get. Type: POST
01:03:08.891 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:08.891     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall [FAIL]
01:03:08.891       Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
01:03:08.891          --- End of inner exception stack trace ---
01:03:08.891          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
01:03:08.891          at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
01:03:08.891          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          --- End of inner exception stack trace ---
01:03:08.891          at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
01:03:08.891          at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
01:03:08.891          at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs:line 738
01:03:08.891       Response.StatusCode - InternalServerError
01:03:08.891       Stack Trace:
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(116,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteLazyOperationsSingleStep(ResponseTimeInformation responseTimeInformation, List`1 requests, CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\src\Raven.Client\Documents\Session\AsyncDocumentSession.Lazy.cs(88,0): at Raven.Client.Documents.Session.AsyncDocumentSession.ExecuteAllPendingLazyOperationsAsync(CancellationToken token)
01:03:08.891         c:\Jenkins\workspace\PR_Tests_v4.1\s\test\SlowTests\Issues\RavenDb1962.cs(52,0): at SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession_CheckSingleCall()
01:03:08.891         --- End of stack trace from previous location where exception was thrown ---
01:03:08.891     SlowTests.Issues.RavenDb1962.CanExecuteLazyLoadsInAsyncSession
01:03:08.891 Request: http://127.0.0.1:58232/cluster/topology. Type: GET
01:03:08.891 Request: http://127.0.0.1:58232/admin/databases?name=CanExecuteLazyLoadsInAsyncSession_2537&replicationFactor=1. Type: PUT
01:03:09.161 Request: http://127.0.0.1:58232/rachis/waitfor?index=10767. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/topology?name=CanExecuteLazyLoadsInAsyncSession_2537. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_2537/bulk_docs. Type: POST
01:03:09.162 Request: http://127.0.0.1:58232/databases/CanExecuteLazyLoadsInAsyncSession_2537/configuration/client. Type: GET
01:03:09.162 Request: http://127.0.0.1:58232/admin/databases. Type: DELETE
01:03:09.162     SlowTests.Issues.RavenDB_11046.FacetRqlShouldSupportAliasNotation
vitalybibikov commented 6 years ago

I have this behavior on nodes in Service Fabric Cluster 1. In SF Cluster 2 it works OK I'll try to investigate it further, still have no clues, can't reproduce it on my local machine.

ayende commented 6 years ago

We have a statistical repro. Unfortunately, it require us to run about 3,400 tests to run into this.

karelz commented 6 years ago

We can try our logging on it. Do you have steps we can "easily" replicate locally?

ayende commented 6 years ago

Get the code from this PR https://github.com/ravendb/ravendb/pull/6879

Go to test/slowtests and run dotnet xunit

This reproduce in one of five runs or so. Typically more on lower end machines. If there is a way to turn on the logging configuration, I'll happily do so and provide the results.

karelz commented 6 years ago

Most of our logging went in AFTER 2.1 :( ... I am not even sure if it is going to cover this particular scenario ... If we find logging useful, we can get it in 2.1.x servicing. Is there easy way to run your test suite against master? (if you don't know, we will figure it out)

ppekrol commented 6 years ago

@karelz

I'm not sure what is needed to run from 'master', but if you want to update <RuntimeFrameworkVersion> in csprojs (all of them) then best option to do so would be to run:

.\scripts\updateFrameworkVersion.ps1 2.1.1

Also our CI is executing tests in 'Release'.

karelz commented 6 years ago

@rmkerr will take a look on our side - I asked him to try the repro in house as first step. Just to clarify:

ppekrol commented 6 years ago

It started to happening all of a sudden. First we thought that it might be related to 2.0 -> 2.1 migration, but at the end we think it is related to SocketsHttpHandler and RavenDB_7787.cs only.

Check my explanation here: https://github.com/dotnet/corefx/issues/30691#issuecomment-401775334

rmkerr commented 6 years ago

I'm in the process of setting up the repro now. I think that the connection management logging in 2.1 actually isn't bad, so we can hopefully get some useful info there.

rmkerr commented 6 years ago

@ayende are there any additional steps I need to follow to get the tests working? I'm currently: 1) checking out the code from the linked PR 2) running dotnet restore in /tests/SlowTests 3) running dotnet xunit in /tests/SlowTests

As output I see a large number of tests skipped, and then several failures that occur every time I try to run the tests. This is running on Windows 10 RS2

Logs: D:\ravendb\test\SlowTests>dotnet xunit Detecting target frameworks in SlowTests.csproj... Building for framework netcoreapp2.1... Sparrow -> D:\ravendb\src\Sparrow\bin\Debug\netcoreapp2.1\Sparrow.dll Sparrow -> D:\ravendb\src\Sparrow\bin\Debug\netstandard2.0\Sparrow.dll Voron -> D:\ravendb\src\Voron\bin\Debug\netstandard2.0\Voron.dll Raven.Client -> D:\ravendb\src\Raven.Client\bin\Debug\netstandard2.0\Raven.Client.dll Raven.Embedded -> D:\ravendb\src\Raven.Embedded\bin\Debug\netstandard2.0\Raven.Embedded.dll Raven.Server -> D:\ravendb\src\Raven.Server\bin\Debug\netcoreapp2.1\Raven.Server.dll Tests.Infrastructure -> D:\ravendb\test\Tests.Infrastructure\bin\Debug\netcoreapp2.1\Tests.Infrastructure.dll FastTests -> D:\ravendb\test\FastTests\bin\Debug\netcoreapp2.1\FastTests.dll Raven.TestDriver -> D:\ravendb\src\Raven.TestDriver\bin\Debug\netstandard2.0\Raven.TestDriver.dll SlowTests -> D:\ravendb\test\SlowTests\bin\Debug\netcoreapp2.1\SlowTests.dll Running .NET Core 2.1.1 tests for framework netcoreapp2.1... xUnit.net Console Runner v2.4.0-beta.1.build3958 (64-bit .NET Core 4.6.26606.02) Discovering: SlowTests Discovered: SlowTests Starting: SlowTests Max number of concurrent tests is: 2 To attach debugger to test process (x64), use proc-id: 8640. SlowTests.Issues.RavenDB_3365.index_pretty_printer_ignores_whitespaces [SKIP] RavenDB-4185 To attach debugger to test process (x64), use proc-id: 8640. Url http://127.0.0.1:50061 SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromAsyncSessionTwiceInEmbeddedDocumentStore [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromSessionTwiceInShardingDocumentStore [SKIP] RavenDB-5918, RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromAsyncSessionTwiceInShardingDocumentStore [SKIP] RavenDB-5918, RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromAsyncSessionInShardingDocumentStore [SKIP] RavenDB-5918, RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromSessionInShardingDocumentStore [SKIP] RavenDB-5918, RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromSessionInEmbeddedDocumentStore [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadMultipleUsersWithIncludesFromAsyncSessionInEmbeddedDocumentStore [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_5256.ShouldLoadUserFromSessionTwiceInEmbeddedDocumentStore [SKIP] RavenDB-6283 SlowTests.MailingList.ShardedFacets.FacetTest [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_3465.get_metadata_for_async_sharded [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_3465.get_metadata_for_sharded [SKIP] RavenDB-6283 SlowTests.Tests.Queries.CanQueryOnCustomClass.UsingConverter [SKIP] RavenDB-6263 SlowTests.MailingList.StreamingHalfWay.ShouldWork [SKIP] Missing feature: /docs/stream SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_to_replace_index_with_errors [SKIP] RavenDB-5919 SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_with_errors_to_replace_index_with_errors [SKIP] RavenDB-5919 SlowTests.Issues.RavenDB_4904.can_create_side_by_side_index_with_errors_to_replace_index [SKIP] RavenDB-5919 SlowTests.Server.Documents.SqlMigration.MySQLSchemaTest.CanFetchSchema [SKIP] Test requires MySQL database SlowTests.Issues.RavenDB_4708.CanUseSingleSyncSharded [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_4708.CanUseFirstAsyncSharded [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_4708.CanUseFirstSyncSharded [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_4708.CanUseLazilySyncShaded [SKIP] RavenDB-6283 SlowTests.MailingList.JohanNilsson.WithCustomizedTagNameAndIdentityProperty [SKIP] RavenDB-6124 SlowTests.Issues.RavenDB_3609.Test [SKIP] RavenDB-6283 SlowTests.Authentication.AuthenticationChangesTests.ChangesWithAuthentication [SKIP] RavenDB-9580: Does not work on current version .NET core. SlowTests.Issues.RavenDB_3420.bulk_insert_sharded [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_3420.bulk_insert_sharded_when_specifying_default_database [SKIP] RavenDB-6283 SlowTests.MailingList.DeserializationToObjectTests.Query_GivenDbWithComplexObjects_ShouldDeserializePropertiesToOriginalType [SKIP] RavenDB-6124 SlowTests.Bugs.Indexing.IndexingEachFieldInEachDocumentSeparately.ForIndexing [SKIP] Missing feature : RavenDB-6152 SlowTests.Bugs.Indexing.WiseShrek.Isolated [SKIP] Missing features SlowTests.Issues.RavenDB_579.OneShardPerSessionStrategy [SKIP] RavenDB-6283 SlowTests.Issues.RavenDB_579.OneShardPerSessionStrategyAsync [SKIP] RavenDB-6283 SlowTests.Cluster.ClusterModesForRequestExecutorTest.Fastst_node_should_choose_the_node_without_delay [SKIP] RavenDB-9020 SlowTests.Issues.RavenDB_6596.Run [SKIP] Culture tests are disabled. Please set 'RAVEN_ENABLE_CULTURE_TESTS' environment variable to 'true' to enable them. SlowTests.MailingList.Lindblom.Test [SKIP] RavenDB-6124 SlowTests.MailingList.Kushnir.SortOnMetadata [SKIP] RavenDB-6264 SlowTests.Issues.RavenDB_2944.CanCreateTestMapReduceIndexes [SKIP] RavenDB-6572 SlowTests.Issues.RavenDB_2944.CanCreateTestMapIndexes [SKIP] RavenDB-6572 SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.upload_archive [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_blob_in_folder [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.can_get_correct_error_s3 [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.upload_archive_multipart [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_blob [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.can_get_correct_error_glacier [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_object_multipart [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_2181.put_object [SKIP] Requires Amazon AWS Credentials SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_70MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_64MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_256MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_100MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_500MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_765MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_500MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.can_get_container_not_found [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_765MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_64MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_256MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_70MB [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.can_get_and_delete_container [SKIP] Azure Storage Emulator is not installed SlowTests.Server.Documents.PeriodicBackup.RavenDB_4163.put_blob_into_folder_100MB [SKIP] Azure Storage Emulator is not installed SlowTests.MailingList.ScriptedPatchBug.Test [SKIP] Missing feature: Tasks (operations) and their results SlowTests.Issues.RavenDB_3726.Test [SKIP] RavenDB-6283 SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(provider: MsSQL) [FAIL] System.InvalidOperationException : Use a valid connection Stack Trace: D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(73,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.<>c.<.cctor>b__32_0() at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode) at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor) at System.Lazy`1.CreateValue() D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\SqlAwareTestBase.cs(113,0): at SlowTests.Server.Documents.SqlMigration.SqlAwareTestBase.WithMsSqlDatabase(String& connectionString, String& databaseName, String dataSet, Boolean includeData) D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\SqlAwareTestBase.cs(104,0): at SlowTests.Server.Documents.SqlMigration.SqlAwareTestBase.WithSqlDatabase(MigrationProvider provider, String& connectionString, String& schemaName, String dataSet, Boolean includeData) D:\ravendb\test\SlowTests\Server\Documents\SqlMigration\RecursiveMigrationTest.cs(119,0): at SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(MigrationProvider provider) --- End of stack trace from previous location where exception was thrown --- SlowTests.Server.Documents.SqlMigration.RecursiveMigrationTest.CanEmbedOnParent(provider: MySQL) [SKIP] Test requires MySQL database SlowTests.Voron.Bugs.RavenDB_6971.Applying_new_diff_requires_to_zero_destination_bytes_first [FAIL] Raven.Client.Exceptions.Database.DatabaseLoadTimeoutException : Raven.Client.Exceptions.Database.DatabaseLoadTimeoutException: Database Applying_new_diff_requires_to_zero_destination_bytes_first_42 after 00:01:00 is still loading, try again later. Database initialization log: [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Starting database initialization [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Creating db.lock file [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Initializing NotificationCenter [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Initializing DocumentStorage [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Starting Recovery [Load Database] 7/11/2018 7:41:01 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Recovering journal 6 (upto last journal 7) [Load Database] 7/11/2018 7:46:27 PM :: Database 'Applying_new_diff_requires_to_zero_destination_bytes_first_42' : Recovering journal 7 (upto last journal 7) at Raven.Server.Routing.RouteInformation.ThrowDatabaseLoadTimeoutWithLog(StringSegment databaseName, TimeSpan timeout, String log) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 143 at Raven.Server.Routing.RouteInformation.UnlikelyWaitForDatabaseToLoad(RequestHandlerContext context, Task`1 database, DatabasesLandlord databasesLandlord, StringSegment databaseName) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 122 at Raven.Server.Routing.RouteInformation.WaitForDb(Task databaseLoading) in D:\ravendb\src\Raven.Server\Routing\RouteInformation.cs:line 162 at Raven.Server.Routing.RequestRouter.HandlePath(HttpContext context, String method, String path) in D:\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 73 at System.Threading.Tasks.ValueTask`1.get_Result() at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\ravendb\src\Raven.Server\RavenServerStartup.cs:line 162 Response.StatusCode - ServiceUnavailable Stack Trace: D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(839,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request) D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(795,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(828,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Documents\Operations\OperationExecutor.Operations.cs(22,0): at Raven.Client.Documents.Operations.OperationExecutor.SendAsync(IOperation`1 operation, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(104,0): at Raven.Client.Util.AsyncHelpers.<>c__DisplayClass3_0`1.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(124,0): at Raven.Client.Util.AsyncHelpers.RunSync[T](Func`1 task) D:\ravendb\src\Raven.Client\Documents\Operations\OperationExecutor.Operations.cs(13,0): at Raven.Client.Documents.Operations.OperationExecutor.Send(IOperation`1 operation, SessionInfo sessionInfo) D:\ravendb\test\SlowTests\Voron\Bugs\RavenDB_6971.cs(75,0): at SlowTests.Voron.Bugs.RavenDB_6971.Applying_new_diff_requires_to_zero_destination_bytes_first() SlowTests.Server.Documents.ETL.RavenDB_8866.CanResetEtl [FAIL] Raven.Client.Exceptions.Commercial.LicenseLimitException : Raven.Client.Exceptions.Commercial.LicenseLimitException: Your current license doesn't include the RavenDB ETL feature at Raven.Server.Commercial.LicenseManager.AssertCanAddRavenEtl() in D:\ravendb\src\Raven.Server\Commercial\LicenseManager.cs:line 1269 at Raven.Server.Web.System.OngoingTasksHandler.AssertCanAddOrUpdateEtl(String databaseName, BlittableJsonReaderObject& etlConfiguration, JsonOperationContext context) in D:\ravendb\src\Raven.Server\Web\System\OngoingTasksHandler.cs:line 719 at Raven.Server.Documents.DatabaseRequestHandler.DatabaseConfigurations(Func`4 setupConfigurationFunc, String debug, RefAction beforeSetupConfiguration, Action`3 fillJson, HttpStatusCode statusCode) in D:\ravendb\src\Raven.Server\Documents\DatabaseRequestHandler.cs:line 64 at Raven.Server.Web.System.OngoingTasksHandler.AddEtl() in D:\ravendb\src\Raven.Server\Web\System\OngoingTasksHandler.cs:line 685 at Raven.Server.Routing.RequestRouter.HandlePath(HttpContext context, String method, String path) in D:\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 123 at System.Threading.Tasks.ValueTask`1.get_Result() at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\ravendb\src\Raven.Server\RavenServerStartup.cs:line 162 State for A-term1: Passive=>LeaderElect at 2018-07-11T19:34:45.5363246Z because I'm the only one in the cluster, so I'm the leader Stack Trace: D:\ravendb\src\Raven.Client\Exceptions\ExceptionDispatcher.cs(117,0): at Raven.Client.Exceptions.ExceptionDispatcher.Throw(JsonOperationContext context, HttpResponseMessage response, Action`1 additionalErrorInfo) D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(1036,0): at Raven.Client.Http.RequestExecutor.HandleUnsuccessfulResponse[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, HttpRequestMessage request, HttpResponseMessage response, String url, SessionInfo sessionInfo, Boolean shouldRetry, CancellationToken token) D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(786,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Http\RequestExecutor.cs(828,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) D:\ravendb\src\Raven.Client\Documents\Operations\MaintenanceOperationExecutor.cs(61,0): at Raven.Client.Documents.Operations.MaintenanceOperationExecutor.SendAsync[TResult](IMaintenanceOperation`1 operation, CancellationToken token) D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(104,0): at Raven.Client.Util.AsyncHelpers.<>c__DisplayClass3_0`1.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- D:\ravendb\src\Raven.Client\Util\AsyncHelpers.cs(124,0): at Raven.Client.Util.AsyncHelpers.RunSync[T](Func`1 task) D:\ravendb\src\Raven.Client\Documents\Operations\MaintenanceOperationExecutor.cs(43,0): at Raven.Client.Documents.Operations.MaintenanceOperationExecutor.Send[TResult](IMaintenanceOperation`1 operation) D:\ravendb\test\SlowTests\Server\Documents\ETL\EtlTestBase.cs(21,0): at SlowTests.Server.Documents.ETL.EtlTestBase.AddEtl[T](DocumentStore src, EtlConfiguration`1 configuration, T connectionString) D:\ravendb\test\SlowTests\Server\Documents\ETL\RavenDB_8866.cs(43,0): at SlowTests.Server.Documents.ETL.RavenDB_8866.CanResetEtl() SlowTests.Server.Documents.ETL.ConnectionStringTests.CanGetConnectionStringByName [FAIL] System.InvalidOperationException : Use a valid connection Stack Trace: D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(73,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.<>c.<.cctor>b__32_0() at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode) --- End of stack trace from previous location where exception was thrown --- at System.Lazy`1.CreateValue() D:\ravendb\test\SlowTests\Server\Documents\ETL\SQL\SqlEtlTests.cs(1132,0): at SlowTests.Server.Documents.ETL.SQL.SqlEtlTests.GetConnectionString(DocumentStore store) D:\ravendb\test\SlowTests\Server\Documents\ETL\ConnectionStringTests.cs(171,0): at SlowTests.Server.Documents.ETL.ConnectionStringTests.CanGetConnectionStringByName()
ppekrol commented 6 years ago

@rmkerr

This is expected. There are over 3500 tests there and it takes around 30-40min for them to complete.

I suggest running dotnet xunit -c Release or dotnet xunit -c Release -verbose if you want to see which test is currently running.

Some of them might fail due to the missing license, or can be skipped because of missing SQL Server, but you can safely ignore that fact.

ppekrol commented 6 years ago

@rmkerr

I've pushed 1 additional commit to the branch https://github.com/ppekrol/ravendb/commits/v4.1-temp4 that disables test parallelization.

After that I've ran in test/SlowTests:

dotnet xunit -c Release -verbose > tests.txt

Following two tests have failed with the exception that we are trying to trace:

    SlowTests.Tests.Sorting.AlphaNumericSorting.random_words_using_document_query_async(seed: 1472672940)
Request: http://127.0.0.1:54525/cluster/topology. Type: GET
Request: http://127.0.0.1:54525/admin/databases?name=random_words_using_document_query_async_1557&replicationFactor=1. Type: PUT
Request: http://127.0.0.1:54525/rachis/waitfor?index=6223. Type: GET
Request: http://127.0.0.1:54525/topology?name=random_words_using_document_query_async_1557. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=0&lastRangeAt=0001-01-01T00:00:00.0000000&identityPartsSeparator=/&lastMax=0. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/configuration/client. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/operations/next-operation-id. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/bulk_insert?id=1. Type: POST
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=32&lastRangeAt=2018-07-11T21:13:52.7935007Z&identityPartsSeparator=/&lastMax=32. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=64&lastRangeAt=2018-07-11T21:13:52.8037418Z&identityPartsSeparator=/&lastMax=96. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=128&lastRangeAt=2018-07-11T21:13:52.8135658Z&identityPartsSeparator=/&lastMax=224. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=256&lastRangeAt=2018-07-11T21:13:52.8216998Z&identityPartsSeparator=/&lastMax=480. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/next?tag=tracks&lastBatchSize=512&lastRangeAt=2018-07-11T21:13:52.8433858Z&identityPartsSeparator=/&lastMax=992. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/admin/indexes. Type: PUT
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/queries?queryHash=1369170765125549859. Type: POST
Request: http://127.0.0.1:54525/admin/databases. Type: DELETE
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats?failure=check. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/stats?failure=check. Type: GET
Request: http://127.0.0.1:54525/databases/random_words_using_document_query_async_1557/hilo/return?tag=tracks&end=2016&last=1024. Type: PUT
      Raven.Client.Exceptions.RavenException : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
         at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs:line 738
      Response.StatusCode - InternalServerError
      Stack Trace:
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs(835,0): at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs(756,0): at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(847,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.ExecuteActualQueryAsync(CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(839,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.InitAsync(CancellationToken token)
        F:\Workspaces\HR\ravendb_2\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs(797,0): at Raven.Client.Documents.Session.AsyncDocumentQuery`1.ExecuteQueryOperation(Nullable`1 take, CancellationToken token)
        F:\Workspaces\HR\ravendb_2\test\SlowTests\Tests\Sorting\AlphaNumericSorting.cs(454,0): at SlowTests.Tests.Sorting.AlphaNumericSorting.random_words_using_document_query_async(Int32 seed)
        --- End of stack trace from previous location where exception was thrown ---
    SlowTests.Server.Documents.ModifyExistingDocument.ShouldThrowIfChangeCollection
Request: http://127.0.0.1:54525/cluster/topology. Type: GET
Request: http://127.0.0.1:54525/admin/databases?name=ShouldThrowIfChangeCollection_1623&replicationFactor=1. Type: PUT
Request: http://127.0.0.1:54525/rachis/waitfor?index=6497. Type: GET
Request: http://127.0.0.1:54525/topology?name=ShouldThrowIfChangeCollection_1623. Type: GET
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/docs?id=users%2F1. Type: PUT
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/configuration/client. Type: GET
Request: http://127.0.0.1:54525/databases/ShouldThrowIfChangeCollection_1623/docs?id=users%2F1. Type: PUT
Request: http://127.0.0.1:54525/admin/databases. Type: DELETE
      Assert.Contains() Failure
      Not found: Changing 'users/1' from 'Users' to 'UserAddresses' via update is not supported.
      Delete it and recreate the document users/1.
      In value:  System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
         at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
         at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in F:\Workspaces\HR\ravendb_2\src\Raven.Client\Http\RequestExecutor.cs:line 738
      Response.StatusCode - InternalServerError
      Stack Trace:
           at SlowTests.Server.Documents.ModifyExistingDocument.ShouldThrowIfChangeCollection()
        --- End of stack trace from previous location where exception was thrown ---

Log is available here: https://drive.google.com/file/d/1syhM_yR2KwtWvzp7LGABag5_duU8cKLZ/view?usp=sharing

rmkerr commented 6 years ago

Thanks for the additional info. I'll try running the tests a few times to see if I hit the same error. You mentioned I should expect some failures without a license -- does this look about right?

=== TEST EXECUTION SUMMARY ===
   SlowTests  Total: 3349, Errors: 0, Failed: 518, Skipped: 96, Time: 113039.155s
rmkerr commented 6 years ago

I've been able to collect some interesting logs while running the test suite. The logging is missing some info that we added in 2.1, but it's still informative.

I initially thought this issue might be caused by a closed connection somehow ending up back in the connection pool, but the logs indicate that is not the case. The socket appears to be healthy until ~800ms after the request is initiated, at which point it gets closed for reasons that are still unclear.

I don't have any other conclusions yet, but there is a lot of data in the logs that I haven't been able to look at in detail yet. I'll update here if I find anything interesting.

Selected Logs: | | | | | |-------------------------------------------|--------------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Event Name                                |  Time MSec | Process Name | Rest   | | HandlerMessage | "204,123.38" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""9288003"" workerId=""5429267"" requestId=""46227566"" memberName=""SendAsyncCore"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63912): Sending request: Method: PUT RequestUri: 'https://127.0.0.1:63912/admin/databases?name=AskServerForClientCertificate_138&replicationFactor=1' Version: 2.0 Content: Raven.Client.Json.BlittableJsonContent Headers: { Raven-Client-Version: 4.1.0.41 Topology-Etag: ""0"" Transfer-Encoding: chunked Accept-Encoding: gzip Accept-Encoding: deflate Content-Encoding: gzip }"" ActivityID=""/#3076/1/3710/""" | | HandlerMessage | "204,123.49" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""9288003"" workerId=""5429267"" requestId=""46227566"" memberName=""WriteToStreamAsync"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63912): Writing 747 bytes."" ActivityID=""/#3076/1/3710/""" | | Enter        | "204,123.50" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#34540026"" memberName=""SendAsync"" parameters=""(AwaitableSocketAsyncEventArgs#7711490)"" ActivityID=""/#3076/1/3710/""" | | DumpBuffer   | "204,123.53" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""AwaitableSocketAsyncEventArgs#7711490"" memberName=""LogBuffer"" buffer=""17030303030000000000000003162400..."" ActivityID=""/#3076/1/3710/""" | | Exit         | "204,123.53" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#34540026"" memberName=""SendAsync"" result=""False"" ActivityID=""/#3076/1/3710/""" | | Exit            | "204,123.66" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""HttpClient#55553870"" memberName=""SendAsync"" result=""System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[System.Net.Http.HttpResponseMessageSystem.Net.Http.DiagnosticsHandler+d__2]""" | | Enter        | "204,944.19" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""ReceiveAsync"" parameters=""(AwaitableSocketAsyncEventArgs#11480187)"" ActivityID=""/#3076/1/3722/1/""" | | Exit         | "204,944.23" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""ReceiveAsync"" result=""True"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,944.30" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SendAsync"" parameters=""(AwaitableSocketAsyncEventArgs#60314028)"" ActivityID=""/#3076/1/3722/1/""" | | Exit         | "204,944.40" | dotnet (3076) | "ThreadID=""7084"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SendAsync"" result=""False"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.57" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""InternalShutdown"" parameters=""how:Both"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" parameters=""(-1)"" ActivityID=""/#3076/1/3722/1/""" | | Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" message=""timeout = -1"" ActivityID=""/#3076/1/3722/1/""" | | Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""timeout = -1"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" parameters="""" ActivityID=""/#3076/1/3722/1/""" | | Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""disposing:true CleanedUp:False"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" parameters="""" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""SetToDisconnected"" parameters="""" ActivityID=""/#3076/1/3722/1/""" | | Info         | "204,949.64" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" message=""Calling _handle.CloseAsIs()"" ActivityID=""/#3076/1/3722/1/""" | | Enter        | "204,949.66" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""UpdateStatusAfterSocketError"" parameters=""""" | | ErrorMessage | "204,949.67" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""UpdateStatusAfterSocketError"" message=""errorCode:OperationAborted""" | | Exit         | "204,950.00" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Dispose"" result="""" ActivityID=""/#3076/1/3722/1/""" | | Exit         | "204,950.00" | dotnet (3076) | "ThreadID=""7388"" ProcessorNumber=""0"" thisOrContextObject=""Socket#7946953"" memberName=""Close"" result=""-1"" ActivityID=""/#3076/1/3722/1/""" | | HandlerMessage | "204,954.88" | dotnet (3076) | "ThreadID=""9076"" ProcessorNumber=""0"" handlerId=""66602504"" workerId=""65572011"" requestId=""65860881"" memberName=""SendAsyncCore"" message=""HttpConnection(HttpConnectionPoolhttps://127.0.0.1:63918): Error sending request: System.IO.IOException: Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Security.SslStreamInternal.g__InternalFillBufferAsync|38_0[TReadAdapter](TReadAdapter adap ValueTask`1 task Int32 min Int32 initial) at System.Net.Security.SslStreamInternal.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter Memory`1 buffer) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request CancellationToken cancellationToken)"" ActivityID=""/#3076/1/3722/1/""" |
stephentoub commented 6 years ago

@rmkerr, thanks. Are you able to get a call stack that shows where the close on the socket is coming from? My hypothesis has been that we're putting a connection back into the pool while there's still a CancellationToken from a previous request registered to close it; the connection gets put back into the pool, another request takes it out to use it, and then the previous request's CancellationToken gets a cancellation signal and cancels the wrong connection. But I've been unable to find where in the code that could be happening; I still think it's the most likely explanation, but it seems we've handled this correctly, so getting a stack for the closing would be super useful.

rmkerr commented 6 years ago

I can't currently see the call stack, but here's what I was able to piece together from the logs:

HttpConnection.RegisterCancellation HttpConnection.Dispose NetworkStream.Dispose Socket.InternalShutdown Socket.Close Socket.Dispose

That seems consistent with your theory. If it would be helpful I can try to capture another trace with more info. I can also share the existing trace if you'd like to take a look.

halter73 commented 6 years ago

@stephentoub @rmkerr I'm wondering if this might be some sort of issue with managed Sockets instead of SocketsHttpHandler. Starting in early July, we started noticing flakiness in a number of Kestrel tests on Windows, and failing with the same exception. Specifically "System.IO.IOException : Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request."

We see the same failure when testing against both Kestrel's libuv and Socket-based transports which points to this being a client issue. I have not seen this failure when running tests on the desktop framework or on other OSs.

The main difference from the original issue is there's no HttpClient involved. Since Kestrel tests involve a bunch of invalid requests and weird client behavior, Kestrel uses it's own Socket-based TestConnection for a large number of its functional tests. I can file a new issue if that's helpful, but I think the Kestrel test failures likely have the same root cause as the ravendb test failures.

There are a number of Kestrel functional tests that fail with this exception, but the test that reproduces this failure most reliably on my Windows 10 workstation is ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders. This test verifies that Kestrel doesn't incorrectly close the connection due to a response body data rate timeout. Our logs indicate that there is indeed no timeout when this test fails.

Despite being the most reliable way to reproduce the failure, it's still fairly hard to reproduce. I haven't seen the issue when running the test in isolation. The best way to repro the failure is to clone the Kestrel repo, navigate to test\Kestrel.Transport.Sockets.FunctionalTests and run dotnet test -f netcoreapp2.2. The entire test project takes aproximately 40-50 seconds, but you might need to run it several times to see the failure. The output for the repro looks like the following:

G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests [release/2.2 ≡ +0 ~3 -0 !]> dotnet test -f netcoreapp2.2
Microsoft (R) Build Engine version 15.7.179.6572 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 54.82 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Core\Kestrel.Core.csproj.
  Restore completed in 54.64 ms for G:\dev\aspnet\KestrelHttpServer\src\Connections.Abstractions\Connections.Abstractions.csproj.
  Restore completed in 55.6 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel\Kestrel.csproj.
  Restore completed in 54.8 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Https\Kestrel.Https.csproj.
  Restore completed in 54.65 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Transport.Abstractions\Kestrel.Transport.Abstractions.csproj.
  Restore completed in 54.64 ms for G:\dev\aspnet\KestrelHttpServer\src\Kestrel.Transport.Sockets\Kestrel.Transport.Sockets.csproj.
  Restore completed in 64.34 ms for G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests\Kestrel.Transport.Sockets.FunctionalTests.csproj.
Build started, please wait...
Build completed.

Test run for G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.Sockets.FunctionalTests\bin\Debug\netcoreapp2.2\Sockets.FunctionalTests.dll(.NETCoreApp,Version=v2.2)
Microsoft (R) Test Execution Command Line Tool Version 15.7.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
[xUnit.net 00:00:22.14]     RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:True, Sends a dynamic table size update at the end of header block) [SKIP]
Skipped  RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:True, Sends a dynamic table size update at the end of header block)
[xUnit.net 00:00:25.20]     RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:True, closed: Sends a DATA frame after sending RST_STREAM frame) [SKIP]
Skipped  RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:True, closed: Sends a DATA frame after sending RST_STREAM frame)
[xUnit.net 00:00:26.05]     RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:False, closed: Sends a DATA frame after sending RST_STREAM frame) [SKIP]
Skipped  RunIndividualTestCase(testCase: http2/5.1/8, HTTPS:False, closed: Sends a DATA frame after sending RST_STREAM frame)
[xUnit.net 00:00:33.79]     RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:False, Sends a dynamic table size update at the end of header block) [SKIP]
Skipped  RunIndividualTestCase(testCase: hpack/4.2/1, HTTPS:False, Sends a dynamic table size update at the end of header block)
[xUnit.net 00:00:34.99]     ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders [FAIL]
Failed   ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders
Error Message:
 System.IO.IOException : Unable to read data from the transport connection: The I/O operation has been aborted because of either a thread exit or an application request..
---- System.Net.Sockets.SocketException : The I/O operation has been aborted because of either a thread exit or an application request.
Stack Trace:
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests.AssertStreamCompleted(Stream stream, Int64 minimumBytes, Int32 targetBytesPerSecond) in G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.FunctionalTests\ResponseTests.cs:line 901
   at Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests.ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders() in G:\dev\aspnet\KestrelHttpServer\test\Kestrel.Transport.FunctionalTests\ResponseTests.cs:line 856
--- End of stack trace from previous location where exception was thrown ---
----- Inner Stack Trace -----

Standard Output Messages:
 | [0.001s] TestLifetime Information: Starting test ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders at 2018-08-14T02:10:59
 | [0.006s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting starting
 | [0.007s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting started
 | [0.007s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Loaded hosting startup assembly Sockets.FunctionalTests, Version=2.2.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60
 | [0.008s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Listening on 5000
 | [0.008s] Microsoft.AspNetCore.Server.Kestrel Debug: Listening on 5000
 | [0.009s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" started.
 | [0.009s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" started.
 | [0.010s] Microsoft.AspNetCore.Hosting.Internal.WebHost Information: Request starting HTTP/1.1 GET http:///
 | [0.957s] Microsoft.AspNetCore.Hosting.Internal.WebHost Debug: Hosting shutdown
 | [0.970s] Microsoft.AspNetCore.Hosting.Internal.WebHost Information: Request finished in 960.4081ms 200
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets Debug: Connection id "0HLG1HCD3P866" reset.
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" disconnecting.
 | [0.970s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" disconnecting.
 | [0.971s] Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets Debug: Connection id "0HLG1HCD3P866" sending FIN.
 | [0.974s] Microsoft.AspNetCore.Server.Kestrel.FunctionalTests.ResponseTests Debug: Connection id "0HLG1HCD3P866" stopped.
 | [0.974s] Microsoft.AspNetCore.Server.Kestrel Debug: Connection id "0HLG1HCD3P866" stopped.
 | [0.977s] TestLifetime Information: Finished test ConnectionNotClosedWhenClientSatisfiesMinimumDataRateGivenLargeResponseHeaders in 0.9759387s

Total tests: 363. Passed: 358. Failed: 1. Skipped: 4.
Test Run Failed.
Test execution time: 36.2769 Seconds

I also managed to collect a wireshark network trace which shows that the connection was indeed reset by the client despite there being no attempt in the managed test code to abort the connection. I can provide the pcap file containing the network trace upon request.

image

The above repro was run with version 2.2.0-preview1-26618-02 of the core runtime.

geoffkizer commented 6 years ago

(copying notes from email)

I’ve done some simple tracing on this but nothing obvious pops out. I think we will need a more involved investigation here, unfortunately.

Some thoughts on how to make progress here:

The test suite is doing lots of connection resets, and none of them look suspicious in and of themselves. To narrow this down I think we should try just running this single test, but run it many times simultaneously, to try to induce the repro. If we can, then we can be pretty sure that any connection reset that happens is unexpected and we can investigate further from there.

Alternatively, we could try to reproduce this in standalone code, which might help reduce the variables here but is more work and may be hard to reproduce.

rmkerr commented 6 years ago

Thanks for the advice Geoff. I'll try running the failing test in a loop to see if I can hit it more consistently.

geoffkizer commented 6 years ago

If running in a loop doesn't do it, perhaps running multiple concurrent instances will help.

rmkerr commented 6 years ago

Interestingly, I haven't even been able to get a failure from running the whole Kestrel.Transport.Sockets.FunctionalTests suite. I'm going to see if I can get things working on a separate machine, as it probably isn't worth diagnosing what is different about my current setup.

halter73 commented 6 years ago

@rmkerr One think that I think improves the reliability of the repro in the Kestrel.Transport.Sockets.FunctionalTests suite is running the H2SpecTests in parallel. These tests won't be enabled unless you are running Windows 8.1 (or the server equivalent) or higher.

It should be obvious whether or not the H2Spec tests are being run because you'll see about 359 passed tests in that project if they're being run and only about 71 passed tests if they're not.

rmkerr commented 6 years ago

I'm finally able to repro this consistently. Thanks for the help :)

I'm going to start with collecting System.Net logs, and see what I can work out from there.

geoffkizer commented 6 years ago

For timing issues like this, I find it's often helpful to run some CPU intensive operation in the background, like a corefx build or HttpClient test suite.

daharmon commented 6 years ago

For what its worth, we had a production site fail with the same error today on await httpClient.SendAsync(request), and once it failed, it snowballed (error showed up on every httprequest) until I restarted the app. I'm interested to see it resolved.

rmkerr commented 6 years ago

That's a really interesting datapoint. Were the requests all to the same endpoint?

ppekrol commented 6 years ago

@rmkerr Have you seen https://github.com/aspnet/KestrelHttpServer/issues/2789 ?

This might be related and there is a very small repro with 'UseLibuv' on kestrel-side as a workaround. It suggests that the issue is in managed sockets. Maybe it will be easier to trace it there and see if it might help with this issue? Just an idea.

daharmon commented 6 years ago

To the same host but not the same path

On Sep 7, 2018, at 5:05 PM, Max Kerr notifications@github.com<mailto:notifications@github.com> wrote:

That's a really interesting datapoint. Were the requests all to the same endpoint?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dotnet/corefx/issues/30691#issuecomment-419577879, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHCPi-zs1qA9ELygOCyntc_tDEk2xRwpks5uYu1ogaJpZM4U5dWk.

geoffkizer commented 6 years ago

I don't see how https://github.com/aspnet/KestrelHttpServer/issues/2789 is related. Maybe I am missing something.

ppekrol commented 6 years ago

@geoffkizer

My thinking was that the issue here could be 'easily' (had to run 1h test suite) reproduced by me by spawning an external process with Kestrel and executing few HTTP requests from the parent application. At the end after killing external process, all connections (not related ones, to different ports) were poisoned. Check here: https://github.com/dotnet/corefx/issues/30691#issuecomment-404318426

And the https://github.com/aspnet/KestrelHttpServer/issues/2789 is doing similar thing, but the error is different, yet from the same category of errors. Was thinking that they could be related and have a similar root cause. I might be wrong, this is a pure speculation.

rmkerr commented 6 years ago

It's definitely possible that the issues are related, but looks like aspnet/KestrelHttpServer#2789 manifests very differently than the issue tracked here. If I'm not able to get any useful data from the current repros it may be worth revisiting though.

I've been trying to figure out whether the Kestrel test failure repros on its own while the CPU is under load, or if it also requires some interaction with one of the other test classes that run in parallel. I don't have an answer yet, but I have time to focus on this issue today and will report back when I know more.

rmkerr commented 6 years ago

I was finally able to collect some useful System.Net traces using perfview. So far I've only been able to confirm the observations made earlier in this thread: It's a socket abort, but the socket is not aborted in the test code, and the socket has not been open long enough for it to be a timeout.

I'm going to dig into the logs in more depth to see if I can gather any new info, but may end up having to enable some additional logging in order to learn anything new.

JohanSpannare commented 5 years ago

We are hitting this bug as well, errors occurs randomly after lots of concurrent tasks and requests at PutAsync on the same client. Below is some exception stacks traces.

Have tries to change settings on SocketsHttpHandler without success.

            var handler = new SocketsHttpHandler
            {
                AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
                PooledConnectionLifetime = TimeSpan.FromMinutes(1),
                ConnectTimeout = TimeSpan.FromSeconds(10),
            };

After this error occures, the next request on the httpclient hangs on GetProxyForUrl. Visual Studio with process dump attached showing stack and hanging threads/tasks

An theory: This is happening in 3 different processes at the same time on the same server. We run the application inside 3 containers, and all gets the same error almost at the same time. Can this error occur if we hit the maximum number of allowed socket on the host OS?

Example 1

2:45:40 PM/00:00:12.7154296 System.OperationCanceledException: The operation was canceled.
   at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Sing.Price.Actor.Gateways.StoriesGateway.PutWorkAsync(ExpositionWork expositionWork, Price price) in D:\BA\1\_work\833\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 41
   at Sing.Price.Actor.Application.Processor.Process(ExpositionWork story) in D:\BA\1\_work\833\s\src\Sing.Price.Actor.Application\Processor.cs:line 65
   at Sing.Price.Actor.Program.<>c__DisplayClass12_0.<<Main>b__3>d.MoveNext() in D:\BA\1\_work\833\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 121

Example 2

4:48:09 AM/00:00:18.6366218 System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.
   at System.Net.Sockets.NetworkStream.WriteAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.WriteToStreamAsync(ReadOnlyMemory`1 source)
   at System.Net.Http.HttpConnection.FlushAsync()
   at System.Net.Http.HttpConnection.SendRequestContentAsync(HttpRequestMessage request, HttpContentWriteStream stream, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Sing.Price.Actor.Gateways.StoriesGateway.PutWorkAsync(ExpositionWork expositionWork, Price price) in D:\BA\2\_work\1117\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 41
   at Sing.Price.Actor.Application.Processor.Process(ExpositionWork story) in D:\BA\2\_work\1117\s\src\Sing.Price.Actor.Application\Processor.cs:line 60
   at Sing.Price.Actor.Program.<>c__DisplayClass11_0.<<Main>b__3>d.MoveNext() in D:\BA\2\_work\1117\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 128

Example 3

4:48:11 AM/00:00:10.0411834 System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Sing.Price.Actor.Gateways.StoriesGateway.GetWorkAsync() in D:\BA\2\_work\1117\s\src\Sing.Price.Actor\Price.Actor\Gateways\StoriesGateway.cs:line 26
   at Sing.Price.Actor.Program.<>c__DisplayClass11_0.<<Main>b__3>d.MoveNext() in D:\BA\2\_work\1117\s\src\Sing.Price.Actor\Price.Actor\Program.cs:line 127