Resiliency problem: unable to abort hanging threads

munichmule commented 10 months ago

Description

Hi Guys,

First of all, I know that netcore comes with a cooperative task cancellation by design. Second, I checked these similar issues: 1, 2, 3, 4, 5, 6, 7 Third, I tried to use ControlledExecution

Still the problem persists, eg even ControlledExecution isn't a remedy for certain scenarios (examples below).

I totally understand why you don't want to support un-cooperative abortion, but what are the options? There could easily be 3rd party code which goes into an infinite loop or simply eats too much resources. Code REPL is an obvious example, but there could be so many more. The problem is bigger for web apps, because one request can easily eat all the resources and take down the whole process/pod. How am I supposed to implement a safety valve and recover from such situation?

That's not a matter of taste, that's a matter of resiliency: there must be a way to stop bad tasks, one way or another. If you don't want to support old-fashioned thread abortion - fine, just give us a workaround which works for the 3rd party code (REPL script, or a 3rd party library etc). I can't believe it remains unsolved after so many requests over years...

Reproduction Steps

This seems to work for win and linux:

        ControlledExecution.Run(() =>
        {
            while (true)
            {

            }
        }, new CancellationTokenSource(500).Token);

This doesn't work for linux (Ubuntu):

        ControlledExecution.Run(() =>
        {
            var task = Task.Run(() =>
            {
                while (true)
                {

                }
            });
            Task.WaitAll(task);
        }, new CancellationTokenSource(500).Token);

There are more examples. The key thing is that there should be a way to stop hanging tasks (sync, async, thread-pooled etc).

Expected behavior

Should be able to kill hanging tasks/threads without killing the process

Actual behavior

One hanging thread can easily take down the whole process/container

Regression?

Yes, since netfx

Known Workarounds

ControlledExecution class (obsolete, doesn't cover all cases)

Configuration

Tested on Linux Ubuntu 22.04

Other information

No response

jkotas commented 10 months ago

there must be a way to stop bad tasks

The only 100% reliable way to stop bad tasks is to kill the process.

ghost commented 10 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

### Description Hi Guys, First of all, I know that netcore comes with a cooperative task cancellation by design. Second, I checked these similar issues: [1](https://github.com/dotnet/runtime/issues/69622), [2](https://github.com/dotnet/runtime/discussions/66480), [3](https://github.com/dotnet/runtime/pull/71661), [4](https://github.com/dotnet/runtime/issues/65556), [5](https://github.com/dotnet/runtime/issues/41291), [6](https://github.com/dotnet/runtime/issues/39145), [7](https://github.com/dotnet/runtime/issues/2315) Third, I tried to use [ControlledExecution](https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/src/System/Runtime/ControlledExecution.CoreCLR.cs#L43) Still the problem persists, eg even `ControlledExecution` isn't a remedy for certain scenarios (examples below). I totally understand why you don't want to support un-cooperative abortion, but what are the options? There could easily be 3rd party code which goes into an infinite loop or simply eats too much resources. Code REPL is an obvious example, but there could be so many more. The problem is bigger for web apps, because one request can easily eat all the resources and take down the whole process/pod. How am I supposed to implement a safety valve and recover from such situation? That's not a matter of taste, that's a matter of resiliency: there **must** be a way to stop bad tasks, one way or another. If you don't want to support old-fashioned thread abortion - fine, just give us a workaround which works for the 3rd party code (REPL script, or a 3rd party library etc). I can't believe it remains unsolved after so many requests over years... ### Reproduction Steps This seems to work for win and linux: ``` ControlledExecution.Run(() => { while (true) { } }, new CancellationTokenSource(500).Token); ``` This doesn't work for linux (Ubuntu): ``` ControlledExecution.Run(() => { var task = Task.Run(() => { while (true) { } }); Task.WaitAll(task); }, new CancellationTokenSource(500).Token); ``` There are more examples. The key thing is that there should be a way to stop hanging tasks (sync, async, thread-pooled etc). ### Expected behavior Should be able to kill hanging tasks/threads without killing the process ### Actual behavior One hanging thread can easily take down the whole process/container ### Regression? Yes, since netfx ### Known Workarounds ControlledExecution class (obsolete, doesn't cover all cases) ### Configuration Tested on Linux Ubuntu 22.04 ### Other information _No response_

Author:	munichmule
Assignees:	-
Labels:	`area-System.Threading`, `untriaged`, `needs-area-label`
Milestone:	-

munichmule commented 10 months ago

@jkotas Yeah, but in case of web app there is no guarantee that all your processes/pods don't get constantly bombarded with malicious requests etc. Code/math REPL is a good example, but there could be much more (like a general 3rd party dependency which doesn't support cooperative cancellation etc). I mean killing the process / restarting a pod isn't a remedy for for hanging threads in a web app. This sounds like a super-standard problem: to measure cpu/memory per-request and throttle/drop bad requests to keep the app resilient. I'm surprised I can't find a standard solution, maybe I'm looking in a wrong place, dunno...

If there is no idiomatic solution, I would appreciate an idiomatic workaround. :)

CC @AntonLapounov @stephentoub

jkotas commented 10 months ago

If you have a hanging thread or run-away task in your web app, your web app has a bug. The remedy is to fix the bug.

ControlledExecution is a best-effort solution for REPLs. It was not designed to be used as remedy for run-away threads or tasks in production. It is understood that it does not handle all situations. We do not plan to fix that. We do not know how to fix that even if we wanted to.

MichalPetryka commented 10 months ago

If there is no idiomatic solution, I would appreciate an idiomatic workaround.

I'd guess the idiomatic workaround would be to have a pool of worker processes you'd be delegating work to. AFAIR no mainstream operating systems have been designed to be fully resilient against sudden thread abortion so using it can break more than just your app.

davidfowl commented 10 months ago

This sounds like a super-standard problem: to measure cpu/memory per-request and throttle/drop bad requests to keep the app resilient. I'm surprised I can't find a standard solution, maybe I'm looking in a wrong place, dunno...

That's because its solved in the operating system and language runtimes don't really want to re-implement it.

You can consider using WASM in the future for this sort of stuff, it's being designed with this in mind.

hez2010 commented 10 months ago

A workaround would be obtaining the Thread ID and killing it using the system API. But still, it cannot handle those run-away threads.

munichmule commented 10 months ago

A workaround would be obtaining the Thread ID and killing it using the system API.

Yeah, I would be happy to do that, but here comes the challenge: I need to somehow track managed thread enter/exit, maintain managed threads to os tasks mapping (which is not 1<>1 and unreliable), and then force-kill tasks without root access.

I implemented it in some silly way, but I'm not satisfied. So it would be nice to get some guidance from dotnet team. 😺

PS: Not sure how wasm can help with asp.net thread starvation though...

jkotas commented 10 months ago

it would be nice to get some guidance from dotnet team.

There is no good reliable way to do what you are trying to do in .NET.

Not sure how wasm can help with asp.net thread starvation though...

Wasm helps only if you can isolate the untrusted code (and its state) from the trusted code (and its state). You can then run the untrusted code in wasm vm and kill it if it starts misbehaving. The trusted code state and its state stays intact.

If you are not able to isolate the untrusted code/state, wasm does not help much. You can kill the wasm vm, but it is not that much different from killing a regular process - all your state is gone.

hez2010 commented 10 months ago

By using the wasm approach we will have to face severe performance issues (blazor is ridiculously slow even when compared with JavaScript, no matter whether AOT-ed or not). Wondering if is it feasible to provide an API to host a coreclr on top of coreclr so that the hosted coreclr becomes the isolated execution environment.

jkotas commented 10 months ago

the wasm approach we will have to face severe performance issues

Yes, the execution isolation is relatively expensive.

Wondering if is it feasible to provide an API to host a coreclr on top of coreclr so that the hosted coreclr becomes the isolated execution environment.

coreclr does not provide execution isolation. I do not see how stacking two coreclr on top of each other creates execution isolation.

hez2010 commented 10 months ago

coreclr does not provide execution isolation

The "isolation" here means we can take the entire host down and reload it when needed, such as killing an out-of-control operation in the REPL scenario.

jkotas commented 10 months ago

The "isolation" here means we can take the entire host down

There is no way to unload codeclr and the runtime libraries from a process cleanly. The only way to unload coreclr is by killing the process.

munichmule commented 10 months ago

Let's say I want to measure resources consumption per asp.net request (for logging/troubleshooting purposes etc), and - optionally - kill runaway threads (if possible). So there are 3 problems here: 1) from async operation entry point, observe how control flows across managed threads (async operation -> thread1 enter -> thread2 enter -> thread2 exit -> thread1 exit -> ...) 2) for threads visited in #⁠1, read thread stats (cpu time, memory etc) and be able to map it to underlying os task in a reliable way 3) detect runaway threads (and kill them if possible)

So far I understand that you can't help me with #⁠3, but how about #⁠1-2? For #⁠1 I tried a hack using AsyncLocal with callback to track context switching, but I'm not sure if it's reliable, and it doesn't look like a production-grade approach.

For #⁠2 I tried Process.Threads, but it adds a significant slowdown (because it reads all threads data at once). I also tried pure pinvoke approach (eg gettid/getrusage or GetCurrentThreadId/GetThreadTimes etc), but I noticed tid can change for the same managed thread, so I can't map them 1<>1, and thus I can't track tid change from the managed code. And it's even worse for OSX.

It would be nice to have these things solved with default sdk, eg by having a more reach managed thread interface and async state machine events. But I totally understand if you don't want to support that. Could you please advise then on how to achieve what I want?

davidfowl commented 10 months ago

Maybe it's still unclear from the previous comments but this isn't something that the .NET runtime solves or wants to solve. You cannot reliably measure resource usage of a unit of work, nor can you control the effect arbitrary user code can have on the process. If you want isolation, you need to use a process.

munichmule commented 10 months ago

Maybe it's still unclear from the previous comments

@davidfowl I understand that non-cooperative abortion and true isolation is not possible. But it is still unclear from the previous comments what runtime can offer to detect problematic async code paths. How am I supposed to detect/log requests that cause cpu/thread starvation in my web app? Why do you say runtime doesn't want to solve that? I don't ask for the Moon.

If you have a hanging thread or run-away task in your web app, your web app has a bug. The remedy is to fix the bug.

@jkotas Awesome point! It would be nice to find the code/conditions that cause the bug first. I guess I need to measure and logs something... but how? 😄

PS: Process per request is way too expensive for a web app with a decent load, clearly not an option.

jkotas commented 10 months ago

I guess I need to measure and logs something... but how?

The runtime, libraries and ASP.NET provide number of perf counters and telemetry events. It is possible to detect hung requests or unusual high CPU consumption using these perf counters and telemetry events. It is typically done by APM software that monitors your service.

munichmule commented 10 months ago

Yes, per process. I would like to do some measurements per request. Problems are marginal, only 1/10000 requests maybe cause high cpu or leave run-away threads. Those requests are not necessarily long-lasting, they may end in a timely fashion but eat resources.

I can't just restart the process every N seconds, that's not an acceptable solution. I need to be able to deal with those requests, to identify them for analysis as least (to "fix the bug", as you suggested).

It's not about non-cooperative abortion anymore - this part is clear from comments above. But even for cooperative cancellation, I need to measure something in code, to log the problem, cancel the token, fail gracefully etc.

davidfowl commented 10 months ago

@davidfowl I understand that non-cooperative abortion and true isolation is not possible.

Excellent.

I can't just restart the process every N seconds, that's not an acceptable solution. I need to be able to deal with those requests, to identify them for analysis as least (to "fix the bug", as you suggested).

We have great docs on the options for diagnosing problems with .NET https://learn.microsoft.com/en-us/dotnet/core/diagnostics/.

As you can imagine, people build high scale services daily with our stack and run into these sorts of problems with code (within their control) that often doesn't scale on the backend.

But even for cooperative cancellation, I need to measure something in code, to log the problem, cancel the token, fail gracefully etc.

They aren't measuring the resource usage of individual requests, that's not a thing, they are looking at process health and metrics exported by the runtime.

You can also use tools like https://github.com/benaadams/Ben.BlockingDetector to find blocking code in the application that might be problematic.

There's no bulletproof solution here.

dotnet / runtime