AlexHarker / FrameLib

A library for arbitrary-rate arbitrary-size frame processing
BSD 3-Clause "New" or "Revised" License
74 stars 9 forks source link

Sub-optimal multithreading performance on windows. #81

Open balintlaczko opened 3 years ago

balintlaczko commented 3 years ago

Again, thanks for the great package! I am still walking through the tutos, and I thought I just report this one too. In the multithreading tutorial: image when I switch on multithreading here: image I get almost no change in the "Median CPU" meter, (it is consistently +3-4 with multithreading). If I am using WASAPI audio drivers, the CPU level of Max.exe stays relatively the same in the Task Manager. However if I am on ASIO, the CPU level of Max.exe jumps to 4-5x the level (in my case from around 4% to around 20%) when I turn on multithreading (the "Median CPU stays more-or-less the same again, with +3-4 percents as with WASAPI). This surge seems independent from I/O or sigvs size. Audio still comes out unchanged from the patch with or without multithreading. (And no crash or error message.)

AlexHarker commented 3 years ago

Thanks - depending on the scenario you may or may not see a CPU benefit to multithreading - however, what you are describing doesn't sound ideal. The threading primitives on windows are a bit different to Mac, as well as the thread priority settings and it may be that these can be tweaked to improve the situation. I will aim to take a look when I can.

balintlaczko commented 3 years ago

Thanks a lot! I suspected that this might be an issue, I remember I also had problems with a beta build of mubu a while ago when I tried multithreading on it, it worked as expected on Mac, and drove Max to a complete halt on Windows.

AlexHarker commented 3 years ago

Can you increase the value of the length of the ramp from 1024 to 8192 and check again? The wins are always likely to be better when the computer is working harder for longer periods, so that may give different results and some info on whether you ever get an improvement.

For reference my results are:

On Mac Default settings I get 24% and 15% (multithreading off and on) For 8192 I get 100+% and 35% (multithreading off and on)

Windows (on Mac hardware) Default settings I get 23% and 23% (multithreading off and on) For 8192 I get 100+% and 60% (multithreading off and on)

So - this suggest that the threading overheads are higher on windows, which I'd be keen to reduce if possible, but the basic functionality seems to work...

AlexHarker commented 3 years ago

Also - the idea that the CPU measured in task manager would increase is not so unexpected, as if CPU is being measured across cores the usage would increase, but Max measures CPU in terms of time only, so more cores doing the same work by the same time will look the same. I suspect there will be a limit to the extent of the reduction in threading overheads possible that might not match the Mac implementation, but if I can do better I will.

balintlaczko commented 3 years ago

Aha! It woooorks! Tested only on ASIO at the moment, but my results with 8192 samples ramp (..and with 1024 i/o and sigvs if that matters):

CPU in Max No MT: constant 100% | MT: 27-28%

CPU in Task Mgr: No MT: around 16% (which on the 6-core machine means 100% in "Mac terms") | MT: 27-28%

I also noticed (following the fan noise ramps) that with no MT one of the cores is always near 100 degrees (and the load hops from core to core), and of course fans ramp up desperately. While with MT all cores stay firmly at around 60 degrees, and the fan calms down too.

So it seems like my original report was a false alarm, everything seems to work as intended, it's just the OS difference.

It is also interesting that with MT the Max CPU meter and the Max.exe in the Task Manager lined up (coincidence?).

Thanks a lot for the help!

AlexHarker commented 3 years ago

I'd still like to improve things further if I can, as on the same hardware here the speedup is not as good, but glad to hear that it is at least working...

balintlaczko commented 2 years ago

Hey there! Just testing the multithreading performance again on Windows with the prerelease.

On Windows 10, I get median CPU of 11-12 with multithreading OFF, and 69-70 (plus audible crackle) with multithreading ON. This is with WASAPI drivers, and NOT in exclusive mode (which is totally good almost always). If I use a (still WASAPI-based) ASIO driver IN exclusive mode, then the crackle goes away, but the huge difference in Median CPU remains. The other weird thing is that if I look at CPU in the Task Manager, with multithreading OFF I get around 0.7-1.0% CPU, which consistently drops(!?) to 0.1-0.8% (yes, more variance), mostly hovering around 0.3. I/O and signal vectors both at 128. What sense does this make?

...and then some more tests:

All tests are made in WASAPI-based ASIO (FlexAsio) in exclusive mode, 128 samples I/O and signal vector sizes, 44100Hz.

Params: streams=100, interval=512, length=1024 ST: "Median CPU"=12, Tskmgr=0.6-1% MT: "Median CPU"=61, Tskmgr=0.1-0.7%

Params: streams=100, interval=100, length=1024 ST: "Median CPU"=47, Tskmgr=3.0-3.6% MT: "Median CPU"=100, Tskmgr=0.1-0.9%, unusable, constant dropouts

Params: streams=100, interval=512, length=10000 ST: "Median CPU"=79, Tskmgr=5.6-5.9% MT: "Median CPU"=45-49, Tskmgr=6.3-7.5%

Params: streams=1000, interval=512, length=1024 ST: "Median CPU"=100, Tskmgr=6.9-7.4%, unusable, constant dropouts MT: "Median CPU"=88-95, Tskmgr=18.6-20.6%

Params: streams=100, interval=100, length=10 ST: "Median CPU"=10, Tskmgr=0.5-0.7% MT: "Median CPU"=100, Tskmgr=0.1-1.0%, unusable, constant dropouts

AlexHarker commented 2 years ago

Thanks

For my Mac running windows I now get:

Default settings I get 17% and 17% (multithreading off and on) For 8192 I get 100+% and 45% (multithreading off and on)

So it looks like potentially the multithreading fixes might make things a bit worse on windows. I will try to attempt some improvement here if I can.

AlexHarker commented 2 years ago

I've tried a few things, but none of them have significantly improved the situation. I'm keeping notes here for future reference.

Things tried:

Sadly, given that none of this has worked, at the moment there are no obvious routes to improvement that doesn't involve significantly rethinking the multithreading approach for windows, and that is not guaranteed to end up with a performance win.

AlexHarker commented 2 years ago

I've just tried one last thing which is in this build:

https://drive.google.com/file/d/1juxiO7XXsnkZFVSW3KGruRxuEvVG0IFf/view?usp=sharing

@balintlaczko - could you try this build at your end and report on the scenarios you outlined above?

AlexHarker commented 2 years ago

[Edited after more investigation]

Updates. With vcredist updated on the i9 it would seem that results are much improved (but still below the Mac side speedups). The build above is also improved calling into question the use of thread sleeping wherever it appears in framelib.

Observations / things to note for now. I aim to fix as much as I can before release and return to this over time:

At some point a good goal would still be to reduce the use of locks, particularly in relation to the memory allocator, although at present a fully lock free memory allocator is probably out of scope for quite some time.