High CPU usage on Windows

diiigle commented 2 years ago

I recently updated my tev binary (Windows, pre-built) from v1.17 to v1.19 and experience a consistent high CPU load of tev (1 thread, 100%). This happens even if I just start the application in idle with no images loaded yet.

I think it is connected to the IPC communication as I observed some interesting behaviour there:

Background

I have an app that starts the tev process, connects to it over TCP, sends a CreateImage, then an UpdateImage packet. I can also repeat this process on the click of a button.

Steps

When I open an image the first time, the tev instance opens correctly, but doesn't show the image. Tev runs with 100% CPU load on one thread. Even if I reopen the image (different name, so new CreateImage, UpdateImage combo) from my GUI it doesn't show either. If I close the tev window, the process stays there, with 100% CPU load on one thread.

If I kill the tev process, and click on the "reopen image" button again, a second instance opens, and shows the image correctly on the first go.

Thoughts

I'm inclined to rule out my side of the IPC implementation, as it works fine with v1.17 (every time) and v1.19 (the second time). Even if this is unrelated to IPC, the high CPU load is still indicating a blocking call (possibly some mutex) somewhere.

PS: It doesn't matter if the instance is the primary or a secondary. The CPU load is high in both cases.

diiigle commented 2 years ago

Breaking in a debugger stops at https://github.com/Tom94/tev/blob/master/src/main.cpp#L363 if that's of any help.

Unfortunately I'm not being able to build v1.17 myself as it complains about:

Fetched in submodule path 'dependencies/nanogui', but it did not contain 3a744201965d9191052b33f845026b5d68c79007. Direct fetching of that commit failed.

So git bisect grinds to a halt :/

Tom94 commented 2 years ago

Ouch, I must have force pushed onto the submodule at some point… good lesson to never do that again in the future. Sorry about breaking bisect.

In any case: thanks for reporting this! I’m currently abroad without access to a Windows machine, so it’ll be a couple of days until I can have a closer look.

Breaking on the sleep call seems correct FWIW (and should mean that that thread isn’t responsible for the 100% CPU usage… unless something is seriously wrong with sleep_for)

diiigle commented 2 years ago

For future reference: The 'missing' commit is actually related to swapping out the underlying nanogui repository. 3a74420 is the head of https://github.com/Tom94/nanogui/tree/tev. Git doesn't realize changes in .gitmodules, you have to manually delete the (initialized) submodule from the .git/config, the .git/modules/ folder and delete the dependencies/nanogui folder. Then, a git submodule update --init --recursive will reclone the correct 'old' nanogui.

I did manage to get v1.17 to build and also experience the same issue there with my own build binary (remember, not with yours). Oddly enough, a colleague can replicate the problem with v1.19, when it is started through our application, but not when he starts it from explorer / cmd.

Profiling

Profiling the CPU usage on multiple commits always returns this as being the hot path:

It seems to point to code that hasn't changed in tev for 4 years, and also the underlying nanogui replicates that behaviour with the old and new version.

I see some locking calls in the constructor of istringstream, possibly connected to locales, just like here or here. Still puzzled what's the cause for this and why it would occur so sporadically.

Tom94 commented 2 years ago

Wow, thanks a bunch for the detailed investigation!

My best guess is a regression in the MSVC runtime and/or STL that wasn't there yet when I built the v1.17 release.

In any case -- I'm not sure we need to figure this out to solve the problem. Does it go away when you replace the istringstream dance with C++17's std::from_chars (https://en.cppreference.com/w/cpp/utility/from_chars)?

Scalar value = 0;
std::string str = TextBox::value();
std::from_chars(str.data(), str.data()+str.size(), value);
return value;

If so, I wouldn't mind updating tev's nanogui fork with this change/fix. The latext tev already requires C++20 to compile, after all.

diiigle commented 2 years ago

Yes the below snippet made this part significantly more performant. I'm additionally avoiding the string copy (alloc + dealloc). It might be useful to push this upstream to Wenzel, in case he is willing to accept the C++17 requirement.

Scalar value = 0;
const std::string& str = TextBox::value();
std::from_chars(str.data(), str.data()+str.size(), value);
return value;

But the overall issue persists and the hotpath is pointing at the sleep_for. Apparently it internally uses sleep_until with some clock::now() 🔥 and + arithmetic involved.

So I started looking at the value of the fpsBox. I played with your v1.19 binary and found an odd threshold that triggers the CPU load: with a value of <14 the CPU usage stays 0, and with 15 and above it uses a full core continuously. 😲

Then I updated my VS (+ all SDKs) to the newest version. Issue persists and I could replicate the 14/15 threshold with my own build (master branch btw). It does not happen with the v1.17 binary.

Profiling

Again some profiling, this time I recorded the timings between continuous hits on the breakpoint at the sleep_for. It seems like when the value is 15 and above, the thread never actually sleeps (<1ms between consecutive triggers). With 14 it at least sleeps for some 140ms (should be 72, but there might be debugger overhead).

The value that is fed to sleep_for is correct. I don't know about the type, is it maybe selecting a wrong overload. Either way something rounds to zero somewhere...

Tom94 commented 2 years ago

Hmmm... I'm not sure whether ditching istringstream is worth it if it doesn't solve the underlying issue. The main advantage of istringstream is that it gracefully ignores leading whitespace -- plus, being locale-aware might be desired for nanogui? Perhaps worth asking in any case. IIRC, the latest nanogui version already requires C++17, so no problem on that front.

The timing issue with sleep_for is worrying to say the least. Thanks again for the continued digging! I'll try to reproduce it when I'm on my Windows machine. If nothing else helps, we could resort to waiting for a condition variable while playback is paused, rather than spinning idly... but it'd be quite unsatisfying to just work around the problem like that. :p

Tom94 commented 2 years ago

Unfortunately I'm unable to reproduce the problem on my Windows machine.

Regardless, in the hopes that rounding is to be blamed I experimentally changed the sleep interval to be based on an integer amount of microseconds, rather than floating point seconds. Could you try the above branch (fix-high-cpu-usage) and report back whether it makes a difference?

KarelPeeters commented 2 years ago

I'm experiencing the exact same issue, and unfortunately fix-high-cpu-usage does not seem to help.

For me the issue only happened over IPC when the process that sends data over the TCP socket is the same one that started tev. If I start tev myself and then connect to it over TCP everything works fine.

Adding a delay between spawning tev and sending the CreateImage command also turns out to fix the issue, so there is probably a race condition there.

Tom94 commented 2 years ago

That's a good datapoint; thanks for sharing!

Would it be possible to get a minimal reproducing example (e.g. single cpp source file w/o dependencies) that I can compile locally and debug with?

diiigle commented 2 years ago

@KarelPeeters I'm somewhat happy to hear that the problem did non only affect me. This bug was driving me crazy and I started to question many things. Thanks for also looking into it.

@Tom94 On my end, building the fix-high-cpu-usage branch seems to solve the problem. Both, when starting from explorer and as a child process for IPC connection. I did profile the breakpoint hits inside the play loop and it correctly triggered in the set FPS frequency (with 1-2 ms debugger delay).

Unfortunately I can't provide a minimal example from my end (way to many dependencies). But if you eg. build something with Python I can offer to test if it does trigger the bug on my system.

KarelPeeters commented 2 years ago

I can reproduce the issue from both python and rust by starting tev and then immediately connecting to it, for example in Python:

import subprocess
import time
import numpy as np
from ipc import TevIpc

print("Starting tev")
subprocess.Popen("cmake-build-release\\tev.exe")

# Adding this solves the issue
# time.sleep(1)

print("Connecting to tev")
with TevIpc() as tev:
    channels = ["R", "G", "B"]
    tev.create_image("test", 32, 32, channels)
    tev.update_image("test", np.random.rand(8, 8, 3), channels)

# keep this process alive
while True:
    time.sleep(1000)

This reliably causes tev to get stuck with a full thread worth of CPU usage, with not even the initially empty image shown. I'm using the release build of fix-high-cpu-usage, the debug build appears to work fine.

I can investigate some more later today if necessary, but first I have to figure out how to get this to happen inside of a debugger or profiler.

diiigle commented 2 years ago

Did some quick tests with @KarelPeeters script and summarize my results with the table below. For me, adding the delay does not change any behaviour, which is consistent with my earlier findings that it is unrelated to IPC (at least on my machine).

Version	Starting from explorer	Subprocess + IPC	Subprocess + IPC with delay
v1.19	🔥	🔥	🔥
v1.20	✔️	✔️	✔️
`fix-high-cpu-usage` Debug	✔️	✔️	✔️
`fix-high-cpu-usage` Release	✔️	✔️	✔️

Interestingly enough, I redownloaded the v1.19 binary, and while it still produces the high CPU load, the fps threshold at which it starts to work normally again changed from 14/15 to 7/8. Why v1.20 does not exhibit the bug is beyond me. Using integer microseconds (in fix-high-cpu-usage) seems to do the trick for me. It probably uses a different implementation in the MSVC++ std lib than the floating point seconds 'path'. I suspect @KarelPeeters to have a different version / MSVC++ Redistributable than me. That still does not explain why it is connected to IPC for him and why he can disable the bug by introducing this delay.

@KarelPeeters When testing your script, I did notice sometimes that a svhost.exe (responsible for inter-machine network communication) was causing a one core CPU load instead of tev.exe. But I couldn't pinpoint when exactly it happened and when not. Are you using Resource Monitor to monitor CPU usage and looking at the exact process list? If you were to just look at Task Manager CPU graph, you might be able to mistake 'your' (IPC) bug with 'my' 'sleep_for' bug. We will know more once you can also reproduce it inside a debugger.

Tom94 commented 2 years ago

I can also not reproduce the issue, both on master and fig-high-cpu-usage.

I hit the merge button in any case, but will re-open the issue since things don't seem to be resolved on @KarelPeeters end.

tazlarv commented 1 year ago

I (was) experiencing the same issue as @KarelPeeters, but it seems to not happen anymore on version 1.25 (Windows release binary).

Specifically, as long as a tev process was started, which was immediately followed by establishing of TCP connection and sending of CreateImage command, it resulted in the observed issue (no command(s) received, one core being stuck running at 100% utilization).

If a delay was applied after establishing of TCP connection and before sending of CreateImage command, everything worked fine.

Therefore the observable issue is that it was possible to establish a TCP connection with tev and send a CreateImage command before tev was "ready" to receive it. That is also why the delay applied in @KarelPeeters example above (that is, delay after startup of the tev before establishing of TCP connection) has the same effect - it results in additional time for tev to initialize whatever it needs to receive commands via TCP.

I was able to replicate this issue on Windows release executable versions 1.18 to 1.24. Versions 1.17 and 1.25 work fine.

After a bit of digging with my colleague, we think that the cause may be Windows SDK, as the Ipc implementation did not fundamentally change over time (and it - among other things - interacts with the filesystem). The version of Windows SDK was changed in the CI explicitly between versions 1.17 and 1.18, and it could? have changed again between 1.24 and 1.25 under the hood of the current windows-2022-based CI build.

sergeevabc commented 6 months ago

Errr... I can't even participate in the discussion because Tev refuses to start on my end.

$ tev spectrogram.png
SUCCESS  Initialized IPC, listening on 127.0.0.1:14158
INFO     Launching with 8 bits of color and LDR display support.
SUCCESS  Loaded C:\spectrogram.png via STBI after 0.078 seconds.
GLFW error 65543: WGL: A forward compatible OpenGL context requested but WGL_ARB_create_context is unavailable
ERROR    Uncaught exception: Could not create an OpenGL 3.2 context!

It seems you should mention hardware requirements to meet.

Tom94 commented 6 months ago

It would be helpful to know which hardware and OS / driver versions you experience this issue on.

sergeevabc commented 6 months ago

@Tom94, Core 2 Duo T7250 2GHz, 4 GB RAM (with 384 MB reserved for Intel 965 graphics, including OpenGL 2.0), Windows 7 x64.

Tom94 / tev