genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.06k stars 250 forks source link

hw/rpi: timestamp integer overflow in depot_autopilot #4960

Closed chelmuth closed 10 months ago

chelmuth commented 1 year ago

I interpret the following logs as result of integer overflows / calculation errors in the timeout framework.

[2023-07-06 02:34:55] [init -> depot_autopilot] --- Run "test-timer" (max 30 sec) ---
[2023-07-06 02:34:55] [init -> depot_autopilot] 
[2023-07-06 02:34:55] [init -> depot_autopilot] 0.135 [init -> client] --- timer test ---
[2023-07-06 02:34:55] [init -> depot_autopilot] 0.160 [init -> client] register 4-seconds timeout...
[2023-07-06 02:34:59] [init -> depot_autopilot] 4.168 [init -> client] timeout fired - 0/159/160
[2023-07-06 02:35:00] [init -> depot_autopilot] 
[2023-07-06 02:35:00] [init -> depot_autopilot]  test-timer                      failed   4299.667  timeout 30 sec
[2023-07-06 02:32:09] [init -> depot_autopilot] --- Run "test-libc_vfs_ram" (max 30 sec) ---
[2023-07-06 02:32:09] [init -> depot_autopilot] 
[2023-07-06 02:32:10] [init -> depot_autopilot] 0.293 [init -> test-libc_vfs] calling mkdir(trailing_slash_dir_name, 0777) dir_name=testdir/
...
[2023-07-06 02:32:11] [init -> depot_autopilot] 1.916 [init -> test-libc_vfs] starting round 2
[2023-07-06 02:32:11] [init -> depot_autopilot] 
[2023-07-06 02:32:11] [init -> depot_autopilot]  test-libc_vfs_ram               failed   4296.903  timeout 30 sec
tomga commented 11 months ago

I investigated this and I found solution that seem to fix the issue. Commit tomga/genode-rpi@f072edab1e5a18f3e59ceaf523c1fbfe504d52c0 contains required change. Without this test-timer failed on each run and other tests failed randomly. With that change all depot autopilot tests run successfully few times.

Before I got to above solution, I browsed through time interpolation code and I believe I may have found an issue there, that can optimize calculations. In this line there is a variable defined that is not updated at all and it seems that the intention was to use it to quickly get out of probing loop, if current result is good enough. I think that tomga/genode@3290e35354772065cda5ac51b90b7bfe0309debb may be the proper fix. It seems to not break anything on rpi1 and during debugging caused quitting loop after 1 iteration, not 5.

chelmuth commented 11 months ago

@tomga I added a preliminary fixed label. I came across the missing latency_us too, but unfortunately forgot about it. @m-stein what do you think about the change?

tomga commented 11 months ago

@chelmuth, do you plan to cherry pick this change into genodelabs/genode-rpi repository?

chelmuth commented 11 months ago

@chelmuth, do you plan to cherry pick this change into genodelabs/genode-rpi repository?

Yes, but currently I do not understand what is fixed by the change...

tomga commented 11 months ago

Actually, I don't understand it fully either :-)

I saw that error comes from this function and it was the only calculation that I could imagine to suffer from being performed on too short integers.

m-stein commented 11 months ago

@tomga @chelmuth Thanks for pointing me to the latency_us issue! I found that this was broken already in the first version of this file and, somehow, made it unfixed through all my reviews throughout the years. As far as I can tell, https://github.com/tomga/genode/commit/3290e35354772065cda5ac51b90b7bfe0309debb is exactly what I meant to implement in 2017 and can be merged :)

m-stein commented 11 months ago

Regarding https://github.com/tomga/genode-rpi/commit/f072edab1e5a18f3e59ceaf523c1fbfe504d52c0: The CLO and CMP regs are uint32, so, I can imagine that the compiler used to do the subtraction in uint32 in contrast to uint64 (time_t). This makes a difference whenever CLO < CMP which should never be the case if IRQ = 1. However, I think we have a raise at hand that should be fixed by https://github.com/m-stein/genode-rpi/commit/1a315cc4ef39a3c7508aefdf4a60c117af496624.

skalk commented 11 months ago

@tomga @chelmuth regarding the system timer implementation for broadcom, the cast of the 32-bit counter value to a 64-bit value is no good solution. It might have fixed your observation by accident only?! However, if I fill value like 0x0 for the actual 32-bit counter, and 0xffffffff for the comparision register (meaning the interrupt triggered one tick before, and the counter wrapped, with this solution the calculation gets wrong. The actual problem lies in my work on "enabling strict warnings in base-hw" from February in commit genodelabs/genode-rpi@e943f72. Here I broke the overall 32-bit calculation with correct wrapping. I'll provide a solution quickly.

tomga commented 11 months ago

Probably my change just made the issue much less frequent. Probably 2^32 times less frequent, which would be definitely enough to not see it in test run.

skalk commented 11 months ago

I've provided a fix for the issue and the potential raise reported by @m-stein. Are you all fine with it? Then I would push it into genode-rpi upstream.

m-stein commented 11 months ago

I'm fine if it works on hardware.

chelmuth commented 11 months ago

No objections except a missing s in the comment about 32-bit registers.

tomga commented 11 months ago

@skalk, I also confirm that with this patch depot autopilot tests work and that this change looks definitely better than mine. Thank you.

skalk commented 11 months ago

@tomga thank you for your investigation and identifying the right place!

tomga commented 11 months ago

@tomga @chelmuth Thanks for pointing me to the latency_us issue! I found that this was broken already in the first version of this file and, somehow, made it unfixed through all my reviews throughout the years. As far as I can tell, tomga@3290e35 is exactly what I meant to implement in 2017 and can be merged :)

@chelmuth I assume that issue was marked as completed automatically and modification of interpolation code was just forgotten, am I correct?

chelmuth commented 11 months ago

@chelmuth I assume that issue was marked as completed automatically and modification of interpolation code was just forgotten, am I correct?

I intend to merge the adaption of the interpolation after the Sculpt release. The RPi commit solved the issue and triggered the close on merge.

chelmuth commented 11 months ago

@tomga merged 567457e3c144ebd88e05eacff7d38b17baf3a129 just now.

tomga commented 11 months ago

Thanks. I wasn't aware that Sculpt release is so close and thought that we may forget about this change if we have the issue closed.

nfeske commented 10 months ago

Fixed in master.