Neos-Metaverse / NeosPublic

A public issue/wiki only repository for the NeosVR project
195 stars 9 forks source link

Better Performance with less Threadpool threads #2466

Open KyuubiYoru opened 3 years ago

KyuubiYoru commented 3 years ago

Describe the bug

On my system I have a better performance with a lower amount of Threadpool threads, this effect is stronger in Desktop mode. On the first Screenshot I had very high CPU load and unstable fps jumping between 70-100 fps. On the second Screenshot the CPU load is lower and stable 120 fps, between 130-142 fps with the steam fps limit set to 144.

Both screenshots are in the same World right after starting Neos, only difference is thread amount changed from 39 to 6 total threads.

It seems like the Threadpool takes CPU cycles from the rendering thread what causes unstable fps. The Threadpool also seem the to do a lot of "unnecessary work" what might be an underlying problem.

Expected behavior

The Threadpool shouldn't cause fps drops or CPU load that is not necessary.

Screenshots / Video

photo_2021-06-11_20-01-01 photo_2021-06-11_20-03-20

Bug information (please complete the following information):

Additional context

CPU: Intel I9-7900X 10C/20T GPU: NVIDIA GTX 1080TI

Reporters:

Kyuubi#3630

brodokk commented 3 years ago

Its maybe not useful to say it, but i have nearly the same behavior on AMD. Just a little less visible on an empty world.

CPU: AMD Ryzen 5 5600X GPU: AMD Radeon RX 6900XT

I can provide picture if needed

Frooxius commented 3 years ago

This is a bit tricky issue yeah and something that I plan on addressing. Part of it is due to some of the locking mechanisms, causing bigger lock contention. The background threads run at lower priority, so by themselves they shouldn't affect rendering too much, but that's also dependent on the system scheduling.

I think one issue here might be to do with hyper-threading, since that spawns too many threads for your CPU and that results in resource contention.

They're definitely not doing any "unnecessary work" like that though. If you reduce number of workers, they still do the same amount of work, it's just on fewer cores.

One thing that'd be useful to know though is how this changes performance characteristics in different scenarios though. Something like this can help in some cases, but hurt your performance in others, once there's a lot more work to do.

I'm not sure if we'll change the number of workers outright, as it might cause more harm than good depending on the machine and scenario, but I plan on analyzing and tuning the multithreaded performance over time so it behaves better overall.

KyuubiYoru commented 3 years ago

Maybe a setting to change the thread amount on runtime could be helpful to test it in different scenarios and collect data from what works best on different systems.

Frooxius commented 3 years ago

Maybe once the settings UI is reworked we could put it somewhere in advanced settings, but generally this shouldn't be something that most users should change. Rather our approach is to figure out the underlying issues with this (e.g. lock/resource contention causing poor performance) and fixing these.

KyuubiYoru commented 3 years ago

Not sure if there is an easy way to detect Hyper Threading and adjust the thread amount accordingly. Some test on low end systems (Intel 4 core CPU) show that changing the Threads to the amount of CPU cores can double the fps.

alexderpyfox commented 3 years ago

Maybe once the settings UI is reworked we could put it somewhere in advanced settings, but generally this shouldn't be something that most users should change. Rather our approach is to figure out the underlying issues with this (e.g. lock/resource contention causing poor performance) and fixing these.

You could put it in the Neos launcher as a setting tho then it would be not easy to access on default but we can still test it

shadowpanther commented 3 years ago

I think one issue here might be to do with hyper-threading, since that spawns too many threads for your CPU and that results in resource contention.

Would disabling the hyper-threading (in BIOS) lead to any performance increase in this case? Would this be noticeable?

Frooxius commented 3 years ago

@KyuubiYoru There are some, although not directly from C#, so it's a bit harder.

@alexderpyfox That has same issue as current settings - it's tedious to add more things to it. However we might just expose it as command line parameter perhaps, that might be much better approach overall.

@shadowpanther You'll have to try it and see. It's hard to predict how will things like this behave.

KyuubiYoru commented 3 years ago

Command line is a good idea and gives user with problems the ability to do something until this can be fixed probably.

KyuubiYoru commented 3 years ago

It seems like ManualResetEventSlim is not working as supposed, on a quick search i found only this https://stackoverflow.com/questions/28368142/manualresetevent-in-dll-used-for-unity3d might be a Unity problem.

With Monitor.Wait() and Monitor.Pulse() the CPU load with 39 Threads stays low and scales up with the load of the world.

Geenz commented 3 years ago

@Frooxius It's perhaps better to go off of physical core count rather than logical as a default. I'm seeing similar issues on my boxes where SMT is enabled. It's probably better to have three options here:

I've played around with some logic for the second option, would be pretty easy to switch between though.

KyuubiYoru commented 3 years ago

I did some tests with Unity. Using ManualResetEventSlim also causes 100% CPU Load: image_2021-06-15_13-00-29

With Monitor Wait/Pules the Load is much lower: image_2021-06-15_13-02-35

Edit: Initializing ManualResetEventSlim with SpinCount 0 also seems to fix it. image_2021-06-15_13-05-21

Frooxius commented 3 years ago

@Geenz Yes, that's what I was thinking too. Use physical cores (maybe -1) and add a commandline argument to override the count. Though there's not a nice multi-platform way to detect physical cores unfortunately, so we'll probably have to implement ones for each platform (mainly Windows and Linux).

@KyuubiYoru That can be normal. ManualResetEventSlim uses "busy spinning" (and setting it to 0 disables it), which will use CPU cycles under the assumption that the wait will be very low, which is beneficial in some cases, but not in some others. I'll have a look at the different locking mechanisms and so some profiling, it's really hard to tell much from synthetic benchmarks for these.

There's a bigger plan to rework some more of the multithreading to avoid locking as much as possible. For many of the use-cases there are actually concurrency optimized primitives and collections, but when we originally used those, they caused significant stutters due to the poor GC used in Unity, since they generate a lot more garbage, so it can be a tricky balancing act.

It's not really Unity specific issue though, the one you linked before was related to the old Mono 2.0.0 runtime used years ago, but that has been since upgraded to newer 5.10.0 or 5.11.0.

Frooxius commented 3 years ago

Added commandline options in 2021.6.18.4

- You can override number of background and high priority workers with -BackgroundWorkers <number> and -PriorityWorkers <number> commandline argument (based on request from @Kyuubi and @Alex the фурри авали🐦)
-- IMPORTANT: This is use at your own risk feature for experimenting with performance tuning. If you experience performance problems, loading, UI and other things getting stuck as result of changing those values, we won't likely provide diagnostics and support

Still working on the switch to detect number of physical cores in logical ones, as there's not a convenient API to detect it in C# in general (the headless on windows will actually detect it now, but the graphical client won't).

towneh commented 3 years ago

I feel this is quite an important topic as core to core latency (CCD to CCD) on the Ryzen architecture is something I try to account for: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5

For workloads which are synchronisation heavy and are multi-threaded up to 8 primary threads, this is a great win for the new Zen3 CCD and L3 design.

I generally run my games with a core affinity across one CCD only (8 physical, 16 smt on an AMD 5950X) set by Process Lasso.

I imagine this manual override of background/priority workers is the only way I can get NeosVR to run in this configuration without it stuttering due to being given more workers than I'm allowing cores. (which i've confirmed does happen)