AlexSabourinDev / IceBox

An open world, designer focused engine
MIT License
5 stars 3 forks source link

Implement new job system #23

Closed AlexSabourinDev closed 4 years ago

AlexSabourinDev commented 4 years ago

Implemented a new job system. Here is the documentation from IBJobs.cpp.

Fixes #8

Multi-threading

Multi-threading is a fairly involved and complicated topic. As a result, we'll quickly gloss over out-of-order processors and compiler reordering.

Out-of-order processors

"Modern" CPU processors are typically out-of-order processors. You can think of an in-order processor as a simple machine that consumes instructions in the order that they arrived. Instead, an out-of-order processor can start executing instructions that are further down the pipe if the instructions they depend on has already completed execution.

As an example, imagine that we're feeding our processor with some instructions: A, B, C, D If instruction B depends on instruction A, then our processor has to execute A then B However, if instruction C does not depend on instruction A or B and does not need the same execution unit as the other 2 (maybe it simply uses a memory unit instead of the ALU) Then it can execute while A is executing or while B is executing.

This is important because this means that our store could theoretically complete before another store instruction. (Say instruction B was a store instruction).

As a result, if C completed before B, then a thread could see the changes to C but the changes to B might not be visible yet. (This is not the case for intel's CPUs as of this writting, however moving to a weaker memory model would make this the case)

You can guarantee that B has completed it's store to memory by inserting a store fence in between B and C. Something like this: B Store Fence C

This would assure that all stores before the store fence are completed before C is executed. This is useful if we have a flag that tells our other threads that data is ready.

Compiler Reordering

It is also possible for our compiler to change the ordering of our instructions if it determines that it could improve performance. As a result, the compiler might say "I would like to move C before B to improve performance".

In some scenarios, you want to make sure that the order in which you read/write things is maintained. As a result you can add a compiler barrier to assure that specific instructions aren't reordered accross a boundary.

Typically, using a store fence or load fence will achieve the same results as a pure compiler barrier.

Volatile Store/Load

You'll noticed that we use volatileStore and volatileLoad quite a bit here. This is because our compiler can decide to optimize a load away, load it once or not load it at all if it thinks it will not impact the single-threaded behaviour of the program.

volatileLoad and volatileStore is a way of telling the compiler "Don't optimize this load, it might have changed because of someone else"

Job Systems

Now that we've spoken a little bit about some of the intricacies of multi-threading, we can talk about our job system. The purpose of a job system is to minimize the creation of threads and to instead break up work into a series of small chunks of work. "Jobs"

The reason we don't want to be creating and destroying threads continuously is because this has a substantial performance impact. Instead, we want to create our threads beforehand and feed them work as we have it.

We feed work through a set of worker thread specific queues that store a list of all the jobs that a worker thread is expected to complete. We want a queue per thread in order to reduce contention. If we were to include a global list, all threads would have to compete with each other to pull something off of our global queue. This would require either a lock-free approach to the global queue or a global lock. Both are acceptable, however, if we have a single queue per worker then our threads can pull from the queue without any atomic operations or locks. This also means that they don't have to "try again" if a thread has preempted them to the global queue.

A potential downside, is that some threads might have a lighter workload than others and having to wait for longer periods of time. If that ends up being the case, we can look into implementing a job stealing algorithm where a thread can pull work from another thread if it runs out of work for a while.

Job Queues

Our job queue is implemented as a fixed sized ring buffer. This means that the queue cannot be resized. Which allows us to avoid having to deal with the challenges of resizing the queue.

Our queue has 2 indices, a consumer index and a producer index. The consumer index is in charge of retrieving work from the queue and the producer index is in charge of assing work to the queue.

Our approach is implemented in a lock free manner using this algorithm: The consumer (worker thread) is the only one that can move the consumer index forward. This means that it can do so without doing any atomic operations,

The consumer will move the consumer index forward once it has completed the work assigned to that index. As a result, if there is no work at the index, the consumer will simply wait for work to appear there.

This means that the producer (threads launching jobs) must feed the consumer sequentially.

This approach of our consumer waiting for work to appear allows our consumer to be fairly lenient to work taking some time to appear in it's queue.

The producer side of our queue is where we introduce some atomic operations to assure that no threads introduce a race condition.

The idea for our approach is fairly simple. Our thread will simply iterate through each worker and determine if that worker has some space in it's queue.

If it has space, it will try to commit a slot by moving the producer index forward. If it succeeds at moving its producer index forward, that means the previous index is now commited and it can write data to it safely. We can guarantee this behaviour using a compare and exchange.

Once we know we can safely write to an index, we want the transfer of data to be as simple as possible. If we have to send a lot of data, our consumer would have to be notified when all of the data has been written. Thankfully, we only write a single pointer to the shared slot in the queue. We can expect this operation to be atomic and the consumer thread should not see any partial writes. (Needs source)

Once our pointer has been written, we want to wake up our waiting thread (it might have gone to sleep while waiting to save cycles and allow the OS to switch to a more useful thread.)

References

https://preshing.com/20120625/memory-ordering-at-compile-time/ https://stackoverflow.com/questions/4537753/when-should-i-use-mm-sfence-mm-lfence-and-mm-mfence