Timer inaccuracy - Githubissues

bitwiseworks / qtwebengine-chromium-os2

Port of Chromium and related tools to OS/2

9 stars 2 forks source link

Timer inaccuracy #22

Open dmik opened 3 years ago

dmik commented 3 years ago

The debug build sometimes spits assertions like this one:

[6077:16:1008/225338.806000:FATAL:histogram.cc(1096)] Check failed: false.

Deeper debugging reveals that the error comes from https://github.com/bitwiseworks/qtwebengine-chromium-os2/blob/026c9bf4c34c37bf7fbc48360ee75da8d56b0d86/chromium/base/time/time_now_posix.cc#L50

Which seems to report values that are smaller than the values it reported before. This results in negative time deltas which make hstogram code expectedly barf (negative time is not something possible in our Universe).

Here it happens both when using CLOCK_REALTIME (ends up in gettimeofday in LIBC which uses the millisecond counter from Global Info Segment, equivalent to what DosQuerySysInfo(QSV_MS_COUNT) returns if I get it right) and when using CLOCK_MONOTONIC (ends up in __libc_Back_timeHighResNano in LIBC which uses DosTmrQueryTime high resolution timer API).

It might be an SMP kernel issue, some out of sync between some data and how the thread sees it. I need to do more tests.

Note that this problem is potentially dangerous as negative deltas can screw up a lot of code.

Needed for https://github.com/bitwiseworks/qtwebengine-os2/issues/6.

dmik commented 3 years ago

Some corrections. What gettimeofday returns is basically the current system time represented as the number of milliseconds since the Epoch (Jan 1, 1970, UTC), not QSV_MS_COUNT. However, LIBC calculates it in some really strange way in __ftime by accessing GIS fields. I'm not sure what all this really means as I lack any documentation on GIS (Global info segment) and can only guess about the meaning of its fields. What I can tell is that __ftime is surely not thread safe — it uses some static variables so the results are hard to predict when several threads call it on SMP, especially when doing it for the first time. This may explain some weird values I see in CLOCK_REALTIME mode.

__libc_Back_timeHighResNano is actually not thread-safe either — the timer frequency is stored in the static variable the first time this function is called. And if two threads do it at the same time, the results might also get wrong.

At least, I'm calling DosTmrQueryTime directly in parallel (after having done thread-safe freq reading) and I don't see any negative deltas from it.

dmik commented 3 years ago

Ok, now I've got a proof that both reading the high res timer directly and via LIBC may both result in negative intervals. And while LIBC functions are not thread safe WRT the first call, it should not be related here. These are typical values when it breaks (us is LIBC reading, hr is direct from DosTmrQueryTime, ms is QSV_MS_COUNT, just for reference, the first number is the length of the last interval — between the previous and current call to ClockNow from Chromium which come in braces in that order; these counters use TLS so they are thread-safe per se; all values are in microseconds except ms which is milliseconds):

[13323:12:1009/024756.821000:ERROR:time_now_posix.cc(78)] ****** us 15 (25632304245:25632304260)
[13323:12:1009/024756.821000:ERROR:time_now_posix.cc(79)] ****** hr -54908 (25632304251:25632249343)
[13323:12:1009/024756.821000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25638201:25638201)
[13323:12:1009/024812.694000:ERROR:time_now_posix.cc(78)] ****** us 18 (25648177896:25648177914)
[13323:12:1009/024812.694000:ERROR:time_now_posix.cc(79)] ****** hr -54908 (25648232831:25648177923)
[13323:12:1009/024812.694000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25654074:25654074)
[13323:16:1009/024814.342000:ERROR:time_now_posix.cc(78)] ****** us 19 (25649825679:25649825698)
[13323:16:1009/024814.342000:ERROR:time_now_posix.cc(79)] ****** hr -54907 (25649880614:25649825707)
[13323:16:1009/024814.342000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25655722:25655722)
[13323:1:1009/024814.672000:ERROR:time_now_posix.cc(78)] ****** us 32 (25650155204:25650155236)
[13323:1:1009/024814.672000:ERROR:time_now_posix.cc(79)] ****** hr -54899 (25650155218:25650100319)
[13323:12:1009/024814.672000:ERROR:time_now_posix.cc(78)] ****** us -54885 (25650155208:25650100323)
[13323:1:1009/024814.672000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25656052:25656052)
[13323:12:1009/024814.672000:ERROR:time_now_posix.cc(79)] ****** hr 50 (25650155223:25650155273)
[13323:12:1009/024814.672000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25656052:25656052)
[13323:1:1009/024815.442000:ERROR:time_now_posix.cc(78)] ****** us -54916 (25650924200:25650869284)
[13323:1:1009/024815.442000:ERROR:time_now_posix.cc(79)] ****** hr 12 (25650924205:25650924217)
[13323:1:1009/024815.442000:ERROR:time_now_posix.cc(80)] ****** ms 0 (25656822:25656822)
etc

As one may see, both us and hr result in negative intervals of about 55 ms. This looks suspicious as 55 has something to do with some timer frequency somewhere... IIRC 55ms is the old-style system timer granularity (one tick). The question is why does it get one tick back then? Looks like either SMP or heavy CPU load or both are related here. So far I'm out of any ideas on why and how to fix that.

dmik commented 3 years ago

BTW 55 ms is what 65536 cycles will count with a frequency of 1193167 (this is what DosTmrQueryFreq returns and this looks like a typical timer crystal frequency. Adding raw tick values in hex, I get this:

[16743:1:1009/040204.401000:ERROR:time_now_posix.cc(80)] ****** us 33 (30079889565:30079889598)
[16743:1:1009/040204.403000:ERROR:time_now_posix.cc(81)] ****** hr -54882 (30079889574:30079834692)
[16743:1:1009/040204.403000:ERROR:time_now_posix.cc(82)] ****** rw -65483 (85b3affd4:85b3a0009)
[16743:1:1009/040204.403000:ERROR:time_now_posix.cc(83)] ****** ms 0 (30084783:30084783)

[16743:12:1009/040204.401000:ERROR:time_now_posix.cc(80)] ****** us -54894 (30079889579:30079834685)
[16743:12:1009/040204.403000:ERROR:time_now_posix.cc(81)] ****** hr -54890 (30079889592:30079834702)
[16743:12:1009/040204.405000:ERROR:time_now_posix.cc(82)] ****** rw -65494 (85b3affea:85b3a0014)
[16743:12:1009/040204.405000:ERROR:time_now_posix.cc(83)] ****** ms 0 (30084783:30084783)

[16743:5:1009/040243.123000:ERROR:time_now_posix.cc(80)] ****** us 87 (30118612414:30118612501)
[16743:5:1009/040243.125000:ERROR:time_now_posix.cc(81)] ****** hr -54838 (30118612419:30118557581)
[16743:5:1009/040243.125000:ERROR:time_now_posix.cc(82)] ****** rw -65431 (85dfbff99:85dfb0002)
[16743:5:1009/040243.125000:ERROR:time_now_posix.cc(83)] ****** ms 0 (30123503:30123503)

[16743:16:1009/040247.299000:ERROR:time_now_posix.cc(80)] ****** us 12 (30122786868:30122786880)
[16743:16:1009/040247.299000:ERROR:time_now_posix.cc(81)] ****** hr -54910 (30122786873:30122731963)
[16743:16:1009/040247.299000:ERROR:time_now_posix.cc(82)] ****** rw -65517 (85e47ffee:85e470001)
[16743:16:1009/040247.299000:ERROR:time_now_posix.cc(83)] ****** ms 0 (30127679:30127679)

[16743:12:1009/040247.299000:ERROR:time_now_posix.cc(80)] ****** us 26 (30122786858:30122786884)
[16743:12:1009/040247.301000:ERROR:time_now_posix.cc(81)] ****** hr -54896 (30122786863:30122731967)
[16743:12:1009/040247.301000:ERROR:time_now_posix.cc(82)] ****** rw -65500 (85e47ffe2:85e470006)
[16743:12:1009/040247.303000:ERROR:time_now_posix.cc(83)] ****** ms 0 (30127679:30127679)

As one may see it really looks like some overflow handling error. When I see negative delta, the lower 16 bits of the previous reading look like 0xffe2 and the current reading is like 0x0006, so there was definitely an overflow, but the higher 16 bits are unchanged! (85e4 in the last case).

dmik commented 3 years ago

I can also confirm that commenting out PSD=ACPI.PSD in CONFIG.SYS solves the timer problem completely. Not a single failure. Of course, this makes the browser nearly unusable because of high CPU load, but that's different (#25).

So, the bug is in either ACPI or the kernel itself.

dryeo commented 3 years ago

Is it the HPET timer again? Are you lucky enough to have access to a system where HPET can be turned off?

dmik commented 3 years ago

Good call. No, I don't have a BIOS setting for it on my Asus k75vj but it has shed some light on the issue. I guess I have it sorted out now.

The data in GIS is updated by RTC on IRQ8 - a good old real time clock. This includes both the seconds since the Epoch counter (first GIS field, ULONG - what __ftime and gettimeofday use) and the free running msec counter (second GIS field, ULONG - what QSV_MS_COUNT returns and what __ftime uses to calculate milliseconds for gettimeofday and friends).
The DosTmr API of the stock OS/2 kernel uses Intel 8254, a good old clock signal chip. Its frequency is always 1.19 MHz (DosTmrQueryFreq returns 1193167 for it). It has nothing to do with HPET.
The DosTmr API of the OS4 kernel uses HPET, a modern clock signal chip. There DosTmrQueryFreq returns 14318179 on my motherboard, i.e. 14.3 MHz. I heard there are machines returning 20 MHz here.

I've double/triple/quadruple checked and the overflow bug happens only on the stock kernel and only when using the DosTmr API when ACPI.PSD is loaded (I've tried the latest version 3.23.14 BTW - the results is the same). The number of enabled cores does not matter — even if I use /MAXCPU:1, the bug still shows up. So it's definitely an ACPI.PSD problem of some sort.

On the OS4 kernel with its own ACPI4 I never get negative results even after quite some time.

Interesting enough that I also see negative intervals via GIS (us is data read from GIS.time via __ftime):

[1189:1:1009/173022.998000:ERROR:time_now_posix.cc(80)] ****** us -14400000000 (1602264622998000:1602250222998000)
[1189:1:1009/173022.998000:ERROR:time_now_posix.cc(81)] ****** hr 319 (1155364313:1155364632)
[1189:1:1009/173022.998000:ERROR:time_now_posix.cc(82)] ****** rw 380 (522ae3ec:522ae568)
[1189:1:1009/173022.998000:ERROR:time_now_posix.cc(83)] ****** ms 0 (1165808:1165808)

This is always -14400 seconds, which equals to 4 hours. Which makes me think that it might have some relation to the timezone setup (it's UTC+4 here — well, we actually have UTC+3 but TZ data in OS/2 is VERY old and doesn't know about that). And it has nothing to do with the kernel or ACPI, I see it on both kernels, with or without ACPI.PSD. Also, limiting the number of CPUS to just one makes this bug disappear. Which brings me back to the conclusion that it's a consequence of the fact that __ftime is not thread-safe and SMP mode quickly explodes it. I will check that just in case by reading GIS.time directly.

dmik commented 3 years ago

I've created a simple test case for the issue (http://rpm.netlabs.org/test/tmrtest2.zip) and reopened the respective Mantis ticket: https://mantis.arcanoae.com/view.php?id=1702#c20125. It explodes the bug almost immediately here, here is the typical output:

D:>TmrTest2.exe
DosTmrQueryFreq: ulTmrFreq=1193167 ticks/sec
Press ESC to exit
TID 4: ALERT: previous time 0x488414FFFE > current time 0x48841403F4 by 0xFC0A ticks (54076 us)
TID 6: ALERT: previous time 0x488418FCAB > current time 0x4884180001 by 0xFCAA ticks (54210 us)
TID 5: ALERT: previous time 0x48846AFFFE > current time 0x48846A033E by 0xFCC0 ticks (54228 us)
TID 5: ALERT: previous time 0x48848DFFFD > current time 0x48848D0166 by 0xFE97 ticks (54623 us)
TID 6: ALERT: previous time 0x4884BDFFFF > current time 0x4884BD057E by 0xFA81 ticks (53746 us)
TID 6: ALERT: previous time 0x4884CBFFFE > current time 0x4884CB04E8 by 0xFB16 ticks (53871 us)
TID 6: ALERT: previous time 0x488533FCBF > current time 0x4885330002 by 0xFCBD ticks (54226 us)
TID 3: ALERT: previous time 0x48857EFEE9 > current time 0x48857E0002 by 0xFEE7 ticks (54690 us)
TID 2: ALERT: previous time 0x4885CBFF17 > current time 0x4885CB0002 by 0xFF15 ticks (54729 us)

And not a single ALERT when no ACPI.PSD or under OS/4 Kernel. Other users confirm the results on several machines.

dmik commented 3 years ago

With the test build of ACPI from David (3.23.15) the problem seems to have gone — no more negative deltas and no more assertions in the debug build. However, I can't say that the timer is very smooth now. This test http://gus.biysk.ru/os4/test/timerTest.html demonstrates the following results in the release build when using the high res timer (via CLOCK_MONOTONIC in libc or directly via DosTmr, does not matter):

This is slightly worse than in Firefox 49 (with the ACPI fix, see the ArcaNoae ticket). My guess is that some timer/JS thread lacks priority boost (I remember changing thread priorities is a no-op on OS/2 ATM). Also might be related to #26.