Closed ahsonkhan closed 3 years ago
This is happening during a GC STW when Mono is trying to stop a thread. Evidently pthread_kill
is returning EAGAIN
(11) which is an indication that the kernel's global real-time queue has overflowed. Not clear if this is a transient condition and we could try again or not.
In mono/mono@e30184752b7047007352e7d384cc471875168a15 we added the ability to abort a preemptive suspend request (at which point GC STW will loop back around and try again), so we could try to recover from this EAGAIN
thing on Linux.
What do you think @vargaz @lateralusX ?
Additional reference: here's how Boehm picks the suspend signal https://github.com/ivmai/bdwgc/blob/b44088c81ce1e864b4c1d05a523c0150beeb7b8b/include/private/gc_priv.h#L2635
On Windows we will follow the semantics (believe mac does similar), if suspend fails (or GetThreadContext) the thread will be left running and mono_threads_suspend_begin_async_suspend will return FALSE and state machine is reset to a state reflecting thread to run as normal (fixed by mono/mono@e301847). On Linux as you said, you currently get an error that will take down the process before any of that could happen, resetting thread to running state and returning FALSE back to caller of mono_threads_suspend_begin_async_suspend. When STW is driving the suspend, returning FALSE back will mark that thread to be skipped, so even if thread state machine is correctly reflecting current state of thread (preventing future asserts), I believe our STW implementation is not fully implemented to support such a scenario, since STW mainly view those threads as currently shutting down or detaching from the runtime, not expecting them to continue running managed code. The fix resetting the state machine on failure had a different use case, in that case threads calling suspend/resume on other threads could trigger this failure and then next GC would trigger an assert in state machine, even if the thread was currently running and could successfully be suspended by STW.
So if this is hit during normal STW, then I believe current machinery won't trigger a retry in STW, when mono_threads_suspend_begin_async_suspend returns FALSE. So if we are going to rely on that it needs to be fixed in STW.
I know we had some sporadic failures calling SuspendThread that on some downstream Mono implementations was resolved with a retry loop (with a limited set of retry's), if all retries where exhausted, it would still return FALSE and let caller handle it. Maybe we could do something similar on Linux, there you also have a specific error code that you should retry, but still it might make sense to have a limited set of retries.
I also believe we should look into fixing STW to better understand failures due to suspend a thread vs threads shutting down, detaching from runtime etc or we will always have the risk of exiting a STW with running threads due to suspend failures.
Fails again in #33716
Different set of assemblies fail in three runs:
Failing in https://github.com/dotnet/runtime/pull/33816
Saw this in different tests upon rerun: Microsoft.Extensions.FileProviders.Physical.Tests System.Xml.XPath.Tests Microsoft.Extensions.DependencyInjection.Tests System.Xml.RW.ReaderSettings.Tests
https://github.com/dotnet/runtime/pull/33915 Microsoft.CSharp.Tests System.Net.Ping.Functional.Tests System.Xml.RW.XmlReader.Tests
https://github.com/dotnet/runtime/pull/33917 System.Xml.RW.XmlReader.Tests
https://github.com/dotnet/runtime/pull/33852 System.Text.Encoding.Tests
Rolling build https://dev.azure.com/dnceng/public/_build/results?buildId=568429 System.Net.Primitives.Functional.Tests System.Resources.ResourceManager.Tests System.Security.Cryptography.Algorithms.Tests System.Xml.XmlSchema.XmlSchemaValidatorApi.Tests
I just sampled the top failing builds for dotnet/runtime and all of them had the mono leg failing with this error. Seems like we need to have someone looking at this soon or disable the mono test leg. @marek-safar
@jaredpar, can you run your tool (or remind me how) to get the complete data on this?
All failing tests have console output with mono_threads_pthread_kill: pthread_kill failed with error 11 - potential kernel OOM or signal queue overflow
Here are the failures over the last 100 builds
Command
runfo search-helix -d runtime -c 100 -pr -v "mono_threads_pthread_kill: pthread_kill failed with error 11"
Seems like we need to have someone looking at this soon or disable the mono test leg.
Could this be a default action to disable the failing tests followed by a draft PR enabling it, which someone can be assigned to push fix commits at? This way only one PR would be red instead of all of them.
@lambdageek could you please take care of this
Ok, we're trying to fix this with https://github.com/dotnet/runtime/pull/33966
The issue is:
SIGRTMIN
) signals.How we're addressing the issue:
EAGAIN
error and sleep and retry instead of failing.pthread_kill
. That will not be necessary with full coop. (But full coop requires an embedding API break, which is not something we're doing right now)Update:
Another possibility is that the stop the world procedure is going haywire and just sending out a ton of pthread_kill
s for some reason - i'm going to look over the console logs from Jared's comment to see if there's evidence of that.
Update 2: I was reading things about the signal queue wrong, I think. it's a per-thread queue. So more likely that we're flooding it somehow
Closing as the issue looks like it's been addressed. Hvaven't seen this in recent runs.
@lambdageek it seems like this is happening again: https://github.com/dotnet/runtime/pull/34976 https://helix.dot.net/api/2019-06-17/jobs/45edf365-4514-43c6-a782-5a696070625c/workitems/System.Composition.Convention.Tests/console
https://github.com/dotnet/runtime/pull/35040 https://helix.dot.net/api/2019-06-17/jobs/4a9ad8fc-dd75-4b7f-ace9-401d8d08c4cd/workitems/System.Net.Http.Json.Functional.Tests/console
Ok, on a hunch I checked what RLIMIT_SIGPENDING
is on helix bots. and it's
ulimit -i is 31805
which is reasonable.
But nonetheless somehow we're flooding the realtime signal queue. I don't really have any good ideas for how it might be happening, though. If it's a mono problem, it doesn't seem like just bubbling up the pthread_kill error to the GC STW loop is a good plan (lots of effort and unclear that it'll fix the problem).
Going to try running some libraries testsuite like System.Net.Sockets.Tests
on a local Ubuntu VM to see if I can reproduce this locally.
Looking at the stack trace from Viktor's comment, I see a lot of threads in mono_coop_cond_timedwait
, so I did an experiment where I send a realtime signal to a thread waiting on a condition variable
Unfortunately, unless I block the signal with pthread_sigmask
or sigprocmask
, it will get delivered to the thread without overflowing the queue even while it's in the call.
The only uses of pthread_setmask
or sigprocmask
in dotnet/runtime/src/mono and dotnet/runtime/src/libraries don't appear to block our GC suspend signal, so I don't think this is the problem.
There are a few other threads in other system calls that might be suspicious:
there's Thread 25 the System.Native SignalHandlerLoop
function that's in a read
call - this thread sometimes calls managed callbacks, which means that Mono can see it and would try to preemptively suspend it if it was in GC Safe mode.
So next I'm going to see if read
on a pipe is uninterruptible by realtime signals. If that's the case, we might need to deal with this thread specially (or maybe refactor it in some way)
epoll_wait
but that is explicitly documented to return EINTR
so I would be surprised if the signal handler doesn't run here for some reasonpthread_mutex_lock
- I expect that this is interruptible since the condvar example was interruptible - it would be surprising if different pthread primitives work differently.Hm. read
from a pipe and pthread_mutex_lock
both interruptible. I don't think this line of investigation is going to pan out.
Will focus on pthread_sigmask
- there's a place in pal_process.c where all the signals are blocked while the process forks. I wonder if under load we bombard this thread with signals while it doesn't get a chance to run.
Update:
pthread_sigmask
in pal_process.c
was relevantulimit -i 2
(real time signal queue limit is 2 events) before running a testsuite.Current plan:
Despite not having a reliable reproduction, the right thing to do is to extend the thread suspend machinery to deal with transient thread suspension failures. This comes up on non-POSIX-signals backends too - for example the Win32 SuspendThread
function can sometimes return -1
in which case we call mono_threads_transition_abort_async_suspend
to try to back out of suspending the thread. If we succeed, the thread is still in a running state and we can try again.
Currently however, mono_threads_suspend_begin_async_suspend
returns a boolean on all platforms, and so if suspending doesn't work we don't know whether to try again. (Currently we mark the thread as skipped and exclude it from the remainder of STW. This is a bad idea when the thread is interacting with the managed heap.)
So the right solution is to return a more refined return code and give STW a chance to try again if multiple threads are in a transient failure state.
I can't find any cases of this happening in the last 2 weeks. Of course, it's possible that's just my tooling misbehaving, but let's close for now given it's quite plausibly no longer an issue
From https://github.com/dotnet/runtime/pull/32364 on:
netcoreapp5.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open
See https://dev.azure.com/dnceng/public/_build/results?buildId=522989&view=logs&j=b4344b0d-0f92-5d69-ccaf-e0b24fbf14a2&t=555ea487-e6f6-5e9e-ac96-b1fa34f2540d&l=98
Test areas that fail: https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.IO.FileSystem.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Linq.Queryable.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.Primitives.Functional.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Net.ServicePoint.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Threading.Timer.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.Linq.xNodeReader.Tests/console https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Xml.RW.XmlWriterApi.Tests/console
For example: https://helix.dot.net/api/2019-06-17/jobs/049bba5a-3a50-4308-8d4f-043b35a7a750/workitems/System.Composition.AttributeModel.Tests/console