Offload long-running tasks to other OS threads

Bulat-Ziganshin commented 2 years ago

Here I save ideas from our discussion on subj. There are 3 things we should implement:

thread pool for CPU-heavy tasks
thread pool for I/O tasks (in particular, synchronous calls to OS)
how to integrate them into the Chronos async eventloop

Resources:

Communication

Since Chronos and task-executing threads are, indeed, run in different threads, we need communication channels that can work over thread bounds.

We have several candidates:

system/channels_builtin
SPMC queue for sending data from Chronos to threadpools

Depending on compilation options, we may need to pass between threads only non-GCed objects, solutions are:

Isolated[T]
serialization
gcUnref+gcRef in another thread
allocShared, i.e. manual memory management
read data into raw buffers and deserialize them in receiver thread

In order to identify tasks we may send with each task a random-generated id and return this id with the task answer.

Thread pool for CPU-heavy tasks

It can be implemented by nim-taskpools, Weave or internally:

proc cpuThread(pool: CPUThreadPool) = 
  while pool.notFinished:
    (id,task) = pool.taskQueue.recv()
    let res = task()
    pool.resultsQueue.send(id,res)

Thread pool for I/O tasks

It can be implemented internally. The main difference compared to CPU-heavy taskpool is that we can run a lot of threads and create threads on demand:

proc ioThread(pool: IOThreadPool) = 
  while pool.notFinished:
    (id,task) = pool.taskQueue.recv()
    if (pool.freeTasks-=1) == 0 && pool.runningTasks <= 100:
      pool.runningTasks += 1
      startThread(ioThread, pool)
    let res = task()
    pool.freeTasks += 1
    pool.resultsQueue.send(id,res)

Integrate them into the Chronos async eventloop

This code:

await ioOperation()
step = 1

is translated by async macro into closure iterator with code:

let fut = ioOperation()
yield fut
step = 1

that's compiled by Nim into sort of:

let fut = ioOperation()
fut.then:
  step = 1
return fut

So, when ioOperation will be finished, the future will complete and run its continuation that holds all the code after the await ioOperation() line.

WIP: code sample

var ctx

var table: Table[threadid, Future[T]]
proc threadPoll() =
  for fut in ctx.futures:
    let flow = spawn fut.task
    if flow.isReady:
      let val = flow.tryRecv()
      let future = val.fut
      future.complete(val.value)

asyncqueue from Chronos
channels from taskpool

emizzle commented 2 years ago

We implemented some of this in https://github.com/status-im/nim-task-runner, including a thread pool. Check out the tests in https://github.com/status-im/nim-task-runner/tree/test/use-cases/test/use_cases.

mratsim commented 2 years ago

allocShared, i.e. manual memory management

Don't use allocShared, with --threads:on it triggers a lock that is quite expensive. Using import system/ansi_c + c_malloc is better. Also It avoids running into issues of reclaiming memory and it's easier to instrument in valgrind/sanitizers.

arnetheduck commented 2 years ago

These discussions tend to end up in fairly extensive and complex solutions that take a lot of time, are high on potential for issues and end up failing because of this.

There's a trivial solution that already works, which is basically to use a socket (or pipe) with chronos - this works without any changes to anything, without new libraries or compiler options and has numerous advantages: guarantees memory isolation between the threads, is "fast enough" for any reasonable long-term workloads etc - down the line, if ever a fancier solution is needed, it can easily be replaced as well - either by increasing isolation (separate processes instead of threads) or decreasing it (shared memory solutions).

Bulat-Ziganshin commented 2 years ago

thank everyone for help, it's invaluable!

@arnetheduck Is it correct example of the solution you are proposing: start.nim ?

I never programmed with pipes or sockets, but quick googling tells that both are 1:1 connections. And we have one Chronos thread and N threads doing tasks. We can return answers with N separate pipes, but emplacing tasks is necessarily SPMC operation.

We can use single thread dedicated for communication with Chronos, and then classic MPMC channels to communicate this thread with all worker threads. I done that (MPMC channels for tasking threadpool) in C++ and believe that it can be copied in Nim.

Another solution proposed by dryajov is to use classic MPMC channels directly in Chronos loop with code like that:

while true:
  task = channel.tryRecv()
  if task: 
    task()
    yield
  else:
    await sleepAsync(1.ms)

I.e. don't try to integrate AsyncChannel into Chronos event loop, but use polling with standard thread-based channel.

So, my question - which ways are wrong? It's not place for discussions, so I ask everyone which solution(s) he consider as reliable, and which ones he don't trust.

Also I will be glad if you can provide any estimates about the speed for any solution, in messages/second (each message is just a pointer).

And the solutions we found so far are

AsyncChannel from nim-task-runner
named pipe or socket + dispatcher thread with two MPMC thread-based channels
polling with backpressure in Chronos loop + two MPMC thread-based channels

New solutions are also highly appreciated. We look for solutions that can handle up to 10K - 100K messages per second and minimize our coding efforts.

zah commented 2 years ago

These discussions tend to end up in fairly extensive and complex solutions that take a lot of time, are high on potential for issues and end up failing because of this.

There's a trivial solution that already works, which is basically to use a socket (or pipe) with chronos - this works without any changes to anything, without new libraries or compiler options and has numerous advantages: guarantees memory isolation between the threads, is "fast enough" for any reasonable long-term workloads etc - down the line, if ever a fancier solution is needed, it can easily be replaced as well - either by increasing isolation (separate processes instead of threads) or decreasing it (shared memory solutions).

I find this fear of tackling the AsyncChannel problem puzzling. The expertise in this organisation is surely enough to develop a correct async channel implementation over time once we do this, it would benefit the entire Nim community (async channels are one of the last missing pieces in Chronos that will turn it into a truly general purpose library).

The "just use a socket" advice is somewhat problematic because the socket will be exposed to arbitrary software running on the same machine (potentially by less privileged users) and this has various security implications. Using unix domain sockets is better as these can be secured by file system rights, but then you still might have a problem around the fact that multiple clients can try to connect to the same socket (should the software run an accept loop or not).

An in-process AsyncChannel could be a much simpler facility that is cheap to create and doesn't bring any of these problems into the picture.

arnetheduck commented 2 years ago

I find this fear of tackling the AsyncChannel problem puzzling.

It's not so much fear as a rational approach to product development: in this particular case, it's a comment about the fact that it has derailed several projects already, which were at a stage in their development where it would have been more valuable to use a more simple existing approach to prove the concept of that project, and later, if it would have been deemed necessary, develop more complex solutions.

If we expand on the causes, Nim in general has a very poor track record of threading: it is off by default meaning that there is a high risk of running into core / GC issues that take up significant amount of time to resolve at the expense of not shipping useful features - AsyncChannel is a means to an end - it's not a useful feature on its own, ergo, it comes low in the priority list - ie it's not until you run into actual issues with existing, more simple, approaches that you'd reach for writing frameworks for thing - product first, framework later also has the advantage that what you're building in the framework actually has a realistic use case, which in general makes for better frameworks in shorter time.

Similarly, the primitives for threading in general in the language are not well developed and known to be buggy - pretty much all primitives in the std lib are broken one way or another and can't practically be used (ie flowvars, spawn etc) - this again speaks to the fact that a simple socket solution will be a time-saver, overall, until there's an actual need. Codex in general has a long way to go before there's an actual need (ie it has 0 users of now).

The 10k - 100k case is uninteresting if the project doesn't get to the point where it's able to handle 1 message, and we've been in this situation before - this issue description looks .. similar, and my advice really is to focus development on focus on other things and reuse simpler options, even if it means an extra copy here or there - copying data in memory is incredibly fast and it's extremely rare to have well-written programs where that is the bottleneck - those programs are usually doings something else wrong already, and a fancy threadpool / task scheduler might be able to shove the problem under the carpet for a while, but it won't address the more systemic issues.

And we have one Chronos thread and N threads doing tasks. We can return answers with N separate pipes, but emplacing tasks is necessarily SPMC operation.

You can have the task scheduler sit on the S side and hand out tasks to workers that are free - this of course is slightly unusual as far as traditional SPMC solutions go where workers pluck tasks off a queue, but it's adequate and would allow focusing on other things for the next 6 months :)

mratsim commented 2 years ago

I don't think there is a need for a generic solution, and certainly not before Nim arc/orc works:

Nim default GC is unsuitable for multithreading due to thread-local heaps. This is not a problem for compute tasks because those usually own their data. Async tasks however strongly rely on reference semantics.
A simple solution is to use createThread and have it run an event loop, waiting on an async lock, That lock protect something (a channel/memory location) to pass tasks with. And there is another to pass results.

This "producer-consumer" pattern is basically shared-memory services and the OS handles the scheduling.

See examples here:

A simple standalone service (block and attestation quarantine) running on a dedicated thread: https://github.com/mratsim/blocksmith/blob/master/quarantine.md
A standalone service, that includes a simple worker pool: https://github.com/mratsim/blocksmith/blob/master/rewinder.md

dryajov commented 2 years ago

I generally agree with the sentiment that this should be kept maximally simple. Both the pipe/socket approach as well as what @mratsim outlines above have been discussed to some extent. We should experiment with both, but I suspect that pipes/sockets will perform acceptably overall and might be less of a hassle to deal with.

codex-storage / nim-codex