dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

[HttpListener] Hang of System.Net.HttpListener.Tests in CI #21940

Open danmoseley opened 7 years ago

danmoseley commented 7 years ago

https://ci.dot.net/job/dotnet_corefx/job/master/job/ubuntu14.04_debug_prtest/8257/consoleFull#130661499679494335-f7bd-47d0-8771-8661e00c2db2

22:05:43   Finished running tests. End time=05:05:43. Return value was 0
23:44:38 
Build timed out (after 120 minutes). Marking the build as aborted.
23:44:38 
Build was aborted
23:44:38 [xUnit] [INFO] - Starting to record.
23:44:38 [xUnit] [INFO] - Processing xUnit.Net-v2 (default)
23:44:38 
MSBUILD : error MSB4166: Child node "2" exited prematurely. Shutting down. Diagnostic information may be found in files in the temporary files directory named MSBuild_*.failure.txt.
23:44:38 [xUnit] [INFO] - [xUnit.Net-v2 (default)] - 155 test report file(s) were found with the pattern 'bin/**/testResults.xml' relative to '/mnt/j/workspace/dotnet_corefx/master/ubuntu14.04_debug_prtest' for the testing framework 'xUnit.Net-v2 (default)'.
23:44:38 /mnt/j/workspace/dotnet_corefx/master/ubuntu14.04_debug_prtest/Tools/tests.targets(345,5): warning MSB5021: "sh" and its child processes are being terminated in order to cancel the build. [/mnt/j/workspace/dotnet_corefx/master/ubuntu14.04_debug_prtest/src/System.Net.HttpListener/tests/System.Net.HttpListener.Tests.csproj]
23:44:39 Command execution failed with exit code 143.

Not actionable based on this information, but starting an issue.

danmoseley commented 7 years ago

fyi @hughbe @stephentoub note these are in master.

danmoseley commented 7 years ago

Debian & Ubuntu: https://mc.dot.net/#/user/mmitche/pr~2Fjenkins~2Fdotnet~2Fcorefx~2Fmaster~2F/test~2Ffunctional~2Fportable~2Fcli~2F/6c6eb50345329c470db608fef70710e52270c507/workItem/System.Net.HttpListener.Tests

karelz commented 7 years ago

Questions:

mellinoe commented 7 years ago

If we see this actively hung in any CI jobs, please let me know and I will try to debug it. Bonus points to anyone who logs into a machine and attaches a debugger before the process terminates 😄

danmoseley commented 7 years ago

All the time now. https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fportable~2Fcli~2F/build/20170525.01/workItem/System.Net.HttpListener.Tests/wilogs

stephentoub commented 7 years ago

Managed to catch one in CI. At least in the one I saw, it appears to be hanging here: https://github.com/dotnet/corefx/blob/bbb63de0aac837895542e1b126ef71b3ba1a51aa/src/System.Net.HttpListener/tests/HttpRequestStreamTests.cs#L68

@geoffkizer, looks like this might be the same hang you're investigating locally...

geoffkizer commented 7 years ago

Yep, that's exactly what I see locally.

danmoseley commented 7 years ago

https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fportable~2Fcli~2F/build/20170525.02/workItem/System.Net.HttpListener.Tests/wilogs

danmoseley commented 7 years ago

@stephentoub @geoffkizer does this mean if we disable Read_FullLengthAsynchronous_Success it will avoid this hang? If so could one of you please disable it? There is no need to go through CI for this, as it's so impactful I suggest hitting merge immediately.

stephentoub commented 7 years ago

does this mean if we disable Read_FullLengthAsynchronous_Success it will avoid this hang?

I don't think so; I think there are several (not sure how many) tests that hit the same underlying issue or issues. I don't believe it's a problem with a specific test, but rather with something in the implementation (or something systemic to many tests via code that's shared between them). It's possible we could find and disable enough tests to avoid the hangs.

@geoffkizer has been investigating...

stephentoub commented 7 years ago

But I'll see if I can temporarily disable enough to minimize the impact...

stephentoub commented 7 years ago

A bunch of tests have been temporarily disabled. Just search for 20246 to find them. Let's try to get to the bottom of this asap. Hopefully I disabled enough to prevent the hangs in CI.

danmoseley commented 7 years ago

Helix run failed again but I don't know whether it contained that change. https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170526.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

Failed again here: https://mc.dot.net/#/product/netcore/200/source/official~2Fcorefx~2Frelease~2F2.0.0~2F/type/test~2Ffunctional~2Fportable~2Fcli~2F/build/20170531.01/workItem/System.Net.HttpListener.Tests/wilogs

karelz commented 7 years ago

@mellinoe was there any progress on tracking down the root cause of the hangs?

mellinoe commented 7 years ago

@karelz I think @stephentoub had caught this (or a related test) under a debugger at one point. I was not able to successfully get a local repro despite running the tests in a loop for a long time.

stephentoub commented 7 years ago

@geoffkizer and I fixed the issues we knew about. It's possible there's another lurking.

KristinXie1 commented 7 years ago

Failed again: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170601.05/workItem/System.Net.HttpListener.Tests/wilogs

danmoseley commented 7 years ago

Hung again. OSX. Last 3 were all OSX.

https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170601.06/workItem/System.Net.HttpListener.Tests/wilogs

stephentoub commented 7 years ago

Might be worth running the tests in a loop on macOS for a while and seeing if we can get something to hang.

karelz commented 7 years ago

@mellinoe can you please help? (if you don't have Mac, we could ask @wfurt to run it on his Mac ...)

mellinoe commented 7 years ago

I borrowed a MacBook and will try running the tests in a loop here, once I've got my environment sync'd up.

danmoseley commented 7 years ago

This is now the most impactful of all corefx test issues.

mellinoe commented 7 years ago

@stephentoub and I spent several hours debugging this earlier today, and we weren't able to identify the exact problem yet. We saw some strange things that made us doubt the validity of the info coming from lldb/SOS, and ended up stumped for now.

At the very least, the MacBook I'm using is able to reproduce the issue almost every single time I run the tests. So further investigations should be easy to do.

KristinXie1 commented 7 years ago

Failed again: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170606.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

This issue is repro on Core Tests (2.0.0), detail: https://mc.dot.net/#/product/netcore/200/source/official~2Fcorefx~2Frelease~2F2.0.0~2F/type/test~2Ffunctional~2Fcli~2F/build/20170605.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

Failed again on Core Tests (Master) in build 20170607.01, detail: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170607.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

Failed again, detail: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170612.01/workItem/System.Net.HttpListener.Tests

Priya91 commented 7 years ago

@mellinoe Can you share the repro that you used for this.. Did you run the httplistener tests in loop or full test run? Were you able to identify the test that's hanging?

mellinoe commented 7 years ago

@Priya91 There are several tests that hang. It repros on your laptop every time, just doing a regular run of the test project by itself.

KristinXie1 commented 7 years ago

Failed again: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170614.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

This issue is repro on Core Tests (2.0.0), detail: https://mc.dot.net/#/product/netcore/200/source/official~2Fcorefx~2Frelease~2F2.0.0~2F/type/test~2Ffunctional~2Fcli~2F/build/20170615.01/workItem/System.Net.HttpListener.Tests/wilogs

KristinXie1 commented 7 years ago

This issue is repro on OSX.1013, detail: https://mc.dot.net/#/product/netcore/master/source/official~2Fcorefx~2Fmaster~2F/type/test~2Ffunctional~2Fcli~2F/build/20170630.01/workItem/System.Net.HttpListener.Tests/wilogs

danmoseley commented 7 years ago

Need an update from @mellinoe who has been investigating this.

mellinoe commented 7 years ago

I've debugged through a few repros at this point, but I haven't identified a root cause. One repro was hung in this method, https://github.com/dotnet/corefx/blob/master/src/System.Net.HttpListener/src/System/Net/Managed/HttpListenerRequest.Managed.cs#L302, in what could potentially be an infinite loop. That's the closest we've gotten to the bottom of this.

danmoseley commented 7 years ago

Per @karelz moving to 2.0 • OSX only • Not clear it's a product bug. • HttpListener is compat API only (we do not plan to evolve it) • ASP.NET does not depend on it • If it is test bug, we can fix it in servicing branch. • If it is product bug we can service.

KristinXie1 commented 7 years ago

Failed again: https://mc.dot.net/#/product/netcore/200/source/official~2Fcorefx~2Frelease~2F2.0.0~2F/type/test~2Ffunctional~2Fcli~2F/build/20170717.01/workItem/System.Net.HttpListener.Tests/wilogs

2017-07-16 17:48:30,946: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpListenerResponseHeadersTests.ProtocolVersion_SetInvalid_ThrowsArgumentNullException(major: 2, minor: 0) [FAIL]
2017-07-16 17:48:30,947: INFO: proc(54): run_and_log_output: Output:       System.Net.Sockets.SocketException : Too many open files in system
2017-07-16 17:48:30,950: INFO: proc(54): run_and_log_output: Output:       Stack Trace:
2017-07-16 17:48:30,952: INFO: proc(54): run_and_log_output: Output:    [FATAL] System.IO.FileLoadException
2017-07-16 17:48:30,953: INFO: proc(54): run_and_log_output: Output:       System.IO.FileLoadException : Could not load file or assembly 'System.Text.RegularExpressions, Version=4.1.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. A device attached to the system is not functioning.\n (Exception from HRESULT: 0x8007001F)
2017-07-16 17:48:35,471: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpListenerResponseTests.CloseResponseEntity_SendMoreThanContentLength_ThrowsInvalidOperationException [SKIP]
2017-07-16 17:48:35,471: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: \"IsWindowsImplementationAndNotUap\"
2017-07-16 17:48:36,261: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpListenerResponseTests.CloseResponseEntity_AllContentLengthAlreadySent_DoesNotSendEntity [SKIP]
2017-07-16 17:48:36,261: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: \"IsWindowsImplementationAndNotUap\"
2017-07-16 17:48:36,796: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpListenerResponseTests.Abort_Invoke_ForciblyTerminatesConnection [SKIP]
2017-07-16 17:48:36,796: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: \"IsWindowsImplementationAndNotUap\"
2017-07-16 17:48:52,172: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpResponseStreamTests.Write_TooMuch_ThrowsProtocolViolationException [SKIP]
2017-07-16 17:48:52,172: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: \"IsWindowsImplementationAndNotUap\"
2017-07-16 17:48:52,350: INFO: proc(54): run_and_log_output: Output:    System.Net.Tests.HttpResponseStreamTests.Write_TooLittleSynchronouslyAndClose_ThrowsInvalidOperationException [SKIP]
2017-07-16 17:48:52,350: INFO: proc(54): run_and_log_output: Output:       Condition(s) not met: \"IsWindowsImplementationAndNotUap\"
wfurt commented 6 years ago

I think the last association is incorrect: [UPDATE] The referenced comment was hidden as off-topic by @karelz

2017-07-16 17:48:30,947: INFO: proc(54): run_and_log_output: Output: System.Net.Sockets.SocketException : Too many open files in system

This seems to be clearly environmental issues. As far as I can tell, we have no seen this one in long time. I suggest to close it @karelz and we can open it back (or create new one) if it ever happens again.

karelz commented 6 years ago

Closing per @wfurt's analysis above.

stephentoub commented 4 years ago

There are two tests disabled against this issue: https://github.com/dotnet/runtime/blob/04f22268efb4434b7cd81e8ae2c3354be99fd4a7/src/libraries/System.Net.HttpListener/tests/HttpResponseStreamTests.cs#L380-L385 https://github.com/dotnet/runtime/blob/04f22268efb4434b7cd81e8ae2c3354be99fd4a7/src/libraries/System.Net.HttpListener/tests/HttpResponseStreamTests.cs#L419-L424