cyanfish / grpc-dotnet-namedpipes

Named pipe transport for gRPC in C#/.NET
Apache License 2.0
182 stars 47 forks source link

parallelism challenges? #48

Open zachrybaker opened 10 months ago

zachrybaker commented 10 months ago

Ben, great work here, appreciate the investment of your time.

I've seeing what appears to be client-server serviceability issues when stress testing using this package as a way to process separate some data services that use an old synchronous codebase of mainframe code replatformed to C#. Trying not to literally run the mainframe inside of .net web apps :)

I set up a scenario where i have 3 clients utilizing 40 threads each, where each thread is keeping one request outstanding for data at time with the server. So at any one point, there may be 120 requests outstanding against the data service, calling one of about 350 different methods that are codegenerated.

The client side reuses a single static NamedPipeChannel and a static client service. The server's NamedPipeServer is also singular and reused, as is the server services bound to it, and the server does not explicitly create a thread for fulfilling the request, it just utilizes the threadpool. Over time it will grow to have north of a hundred threads involved.

I have set the timeout to infinite, because the data services unfortunately are pretty unpredictable in terms of their run times when executed parallel due to blocking and locking on the sql server side.

What I'm seeing is that over time these static services stop completing and even servicing requests.

Your note here on the long-running tasks and the .net task scheduler is interesting. In my case I also was really hoping that the thread pool servicing the pipe could be used to run the work of the implementation of the service, but I see you suggesting he should probably use newed threads for long running tasks.

In my case creating threads is a sizeable overhead in typical loads as many of the requests are going to run to completion in less than 50ms. But there are some that take seconds. I'm going to test generating code with background threads to run the data load, just to see if anything behaves differently, and perhaps create a 2nd threadpool for the work itself and ConcurrentQueue my way out of the mess. But I was curious if you've seen a scenario like I'm describing solved in a way I have not described here?

cyanfish commented 10 months ago

A few things. First, I would verify there's not an actual deadlock issue.

Second, long-running tasks should in general be avoided. For something like SQL, you should be using async/await so that it yields the thread. If you're using EF, that supports this. Though I see you mention it's an old codebase so maybe that's not feasible, but it is the "correct" solution.

Failing all that, you can use a custom ThreadPool/TaskScheduler to dispatch the work. Or perhaps you could create multiple server processes and load-balance between them.

zachrybaker commented 10 months ago

Ben,

Thanks for your responsiveness.

Regarding should vs doing....for sure. Unfortunately the converted mainframe code is 10k files of code that works closely with a closed-source framework that uses an ORM of sorts that is not EF. And while some of the IP that I can't see is probably async nternally, the 10k subprograms/programs that is code-conversation is sync and will probably always be that way. If it was feasible to change or just walk away from the code, it would have been done. The interface to marshal into mainframe code is synchronous as well, so I have to manage that gap in my GRPC code separation.

I have been able to verify that it is definitely possible to create locks/blocks in SQL without Grpc in play but it recovers relatively well when everything is in-process. And I have just verified that newing up a Thread for each data call is a great way to make the system unresponsive but accomplishing nothing very quickly. The OS will kill the service / process, but be 100% busy servicing the deferred calls (or so it thinks).

I was actually just reading about the ability in the TPL to use a different TaskFactory, configured differently. I am working on understanding the many enumerations and their significance. It's surprising how easy it seems to be to take the default scheduler and mess things up

cyanfish commented 10 months ago

Yeah, if you have a custom TaskFactory/TaskScheduler, that might be enough to make it work (e.g. spawn a fixed 120 threads or however many you need to get the level of concurrency you require).

I'll add that if you do find that works, I would be open to a PR that adds a custom TaskFactory option to the server dispatching so you wouldn't need to use it manually in each RPC method.

zachrybaker commented 10 months ago

@cyanfish I'll try to get you a PR then, thanks!

cyanfish commented 10 months ago

You need to fork the project first and then push to your fork. https://opensource.com/article/19/7/create-pull-request-github

zachrybaker commented 10 months ago

You need to fork the project first and then push to your fork. https://opensource.com/article/19/7/create-pull-request-github

I've been living under a rock, and forgot most projects are set up like this. Got the CLA signed and ready for you.

zachrybaker commented 10 months ago

https://github.com/cyanfish/grpc-dotnet-namedpipes/pull/49

zachrybaker commented 10 months ago

@cyanfish I went ahead and exposed a ThreadPoolSize NamedPipeServerOption. Four threads seemed inadequate in my use case, even when I asked the scheduler to prefer FIFO via TaskFactory. I left four as the default pool size.

Granted this case is probably not the norm:

When replaying two weeks of calls recorded against a mainframe (synchronous code) by running 3 test harnesses with 40 threads each thread keeping a request outstanding, some interesting things show up.

At first, I couldn't even complete the test, ever. It takes several hours to complete, but now it at least completes!

Before it was possible to make the whole thing grind to a halt. Switching to FIFO preference seemed to help this, but even still... I would sometimes see the CPU go off the rails and (I believe) see the syscalls go thru the roof because the pipes were not being read. Even if I killed the process, the CPU would be 100% utilized by system until the clients themselves were killed!

Bumping the listener thread count seemed to keep things flowing and able to hold up well to sustained hammering.

However....there's a bug I think I found with remote pipe servicing, that I'll create a separate issue for.

zachrybaker commented 10 months ago

https://github.com/cyanfish/grpc-dotnet-namedpipes/pull/51 is an alternative PR. It goes one step further to enhance logging ever so slightly by separating error from trace logging and exposing constructors to utilize said logging.

cyanfish commented 6 months ago

I've discovered and fixed (in 3.0.0) some specific issues in with Windows stalling under heavy parallel connections. Assuming you're using a single (or just a few) NamedPipeChannel objects, ff33ad10f443610f5ecead94c36f0302df5884d0 should fix that stall by coordinating connection attempts across threads.