genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.04k stars 246 forks source link

sculpt/hw: noticeable performance degradation with increased uptime in virtualbox #5240

Open jschlatow opened 4 weeks ago

jschlatow commented 4 weeks ago

After almost a week of running Sculpt 24.04.1 with base-hw on my T490s, I repeatedly observed the following effect:

nfeske commented 4 weeks ago

Does this mysterious effect persist when restarting the VM?

If not, maybe you are observing an overflow of the free-running counter returned by Thread::_call_time? To rule out / detect this situation, may you consider adding a check in Timer::Device::now() in the timer driver, remembering the current value, checking that the new value is always higher than the last one, and printing an error should this invariant be violated?

Should the effect disappear upon the restart of the VM, the issue might be more local to the virtualization interface.

jschlatow commented 3 weeks ago

Does this mysterious effect persist when restarting the VM?

Yes, it persists. Restarting the VM actually takes forever.

If not, maybe you are observing an overflow of the free-running counter returned by Thread::_call_time? To rule out / detect this situation, may you consider adding a check in Timer::Device::now() in the timer driver, remembering the current value, checking that the new value is always higher than the last one, and printing an error should this invariant be violated?

I instrumented the method as you suggested. The counter overflowed once, yet this did not have any noticeable impact. I observed the aforementioned effect only about an hour later and without any additional error message.

jschlatow commented 3 weeks ago

I just noticed that I had accidentally created this issue in the Goa repository and therefore transferred the issue.

ssumpf commented 3 weeks ago

@jschlatow: I actually can confirm this. A hard restart in Leitzentrale helps. So it looks like a VBox6 problem or in one of its dependencies. As you stated: The problem requires a couple of hours to appear.

ssumpf commented 3 weeks ago

As an additional information: I don't observe any system load in top_view nor CPU frequencies going up in msr_gui where they remain at 800MHz or below.

nfeske commented 3 weeks ago

I instrumented the method as you suggested. The counter overflowed once, [...]

That is very worrying because this can have all kinds of effects, ranging from the missed (or hours late) firing of timeouts to the wrapping of the values returned by Timer::Session::elapsed_us, which is unexpected by clients. The continuity of kernel time is an ultimate assumption. Therefore, before investigating the problem anywhere else, we need to eliminate the wrapping of the time value reported by the kernel.

In my understanding, the kernel tick counter is a 64-bit value that counts at a rate of around 2 GHz. These values are converted to microseconds (timer_ticks_to_us) using a calibration value ticks_per_ms. This conversion must anticipate an uptime longer than a few hours, which is seemingly not the case right now.

I'll try to reproduce the wrap-around using a synthetic test.

jschlatow commented 3 weeks ago

@nfeske I haven't written down the counter values but I just remembered that I noticed the values were very close. Thinking about this, it didn't seem like a typical overflow situation.

chelmuth commented 3 weeks ago

By brute force testing I discovered that timer_ticks_to_us() does not always yield increasing results for increasing ticks values, for example with 0x4ffffffff, 0x500000000, and 500000001 as input the clock ticks backwards about 588 us. As I was not able to fix the current implementation up to now, you may just replace it by a simple return ticks * 1000 / ticks_per_ms for testing.

chelmuth commented 3 weeks ago

@jschlatow does the suggested change above or 6b6f31190c7089ff471599b5f5ef6eeecf1ec226 have positive impact on this issue?

jschlatow commented 3 weeks ago

@chelmuth I'm still in the process of testing the fix since the issue is a bit sporadic. Without 6b6f311, I noticed that the degradation occurred with and without a preceding timer overflow and, vice versa, I have seen timer overflows without noticing any degradation. I'll keep you updated.

jschlatow commented 2 weeks ago

@chelmuth Unfortunately, 6b6f311 does not affect this issue.