Improve engine's cpu affinity masking

Bruno-DaSilva commented 1 year ago

Context

We've long ago noticed that cpu scheduling can be a bit wonky sometimes and not schedule optimally, especially on windows. So we've attempted to set the affinity of worker+main threads to specific cores: https://github.com/beyond-all-reason/spring/blob/BAR105/rts/System/Threading/ThreadPool.cpp#L617C45-L617C55

Typically, we want to assign one worker to each PHYSICAL core as it provides us with the most optimal performance (logical cores, with simultaneous multithreading on, means two logical cores share the same execution units.

Problem 1

The problem we've noticed is that in some cases on linux, the ordering of logical threads can vary.

Sometimes, logical cores 0,1 2,3 ... 14,15 are two threads of the same physical core, meaning we'd want to select logical cores 0, 2, 4, ..., 12, 14.
In other cases, logical cores 0,8 1,9 ... 7,15 are paired threads on the same physical core. So we'd want to select logical cores 0,1,2...,6,7

We only expect the second case, here: https://github.com/beyond-all-reason/spring/blob/ee87f0dcf04ace7af4de607edeaf3a769326b5fd/rts/System/Threading/ThreadPool.cpp#L430

So this means we sometimes pin our threads to half of the physical cores and make each set of two workers thrash on the same core.

Problem 2

AMD has new CPUs (eg 7950x3d) that have fancy 3d v-caches that, on the multi-CCD CPU models, have significant speed differences depending on which CCDs you run on. There's no current way with the CPUID instruction to know which cores have the 3d vcache bolted onto them.

In windows 11 (with xbox game bar), and presumably in linux at some point (maybe already?), the OS + Ryzen have paired up to more smartly schedule games onto the v-cache CCD. So perhaps we can more smartly trust the OS cpu scheduler?

Problem 3

We ideally want to pin the main thread to the highest performing core. Can we identify that somehow?

Proposed solutions

To solve problem 1, rather than harcoding first N cores as our affinity mask, we should ideally ask libcpuid something like "which logical cores are the 'physical cores'?" and use that as the affinity mask for our threads. This requires patching libcpuid as of current date as it does not support this in it's API. See: #896
To solve problem 2, we'd need to be able to better trust the OS scheduler. We could provide a mask of ALL physical cores to all threads, but if the scheduler misbehaves then we will have threads excessively context switching or sharing the same core. There's some additional testing that would need to be done here.
To solve problem 3, we'd need to somehow obtain information about which core is fastest. In linux this is exposed with the /sys/devices/system/cpu interface - which probably means this could come from CPUID as well. The OS typically should know which core is 'fastest' in order to schedule optimally, so we should be able to get that information, too. Ideally, this change would live within the libcpuid library.

See discord thread for context + discussion: https://discord.com/channels/549281623154229250/1106640705440448693

alexpyattaev commented 1 year ago

I've been looking at CPU usage in the games, and it seems that the engine is barely using 2 cores. Is it because LUA is not multithreaded? If so, why would it move the work between the 2 cores?

lhog commented 1 year ago

The engine uses multithreading where possible. Usually half of the available logical cores is allocated for the purpose of thread pool. The fact that you don't see MT to kick in more often is because the opportunities to employ MT are very limited too. MT sections not always can be made deterministic and Lua call outs have to be execute from a single thread only creating a lot of bottlenecks.

Also see: https://github.com/beyond-all-reason/spring/wiki/Determinism-In-Engine

beyond-all-reason / spring