Flaky test: cluster/restart

gibmat commented 2 months ago

During a recent rebuild of dqlite 1.16.4 (using bundled libraft) on an arm64 host, one test failed. It's likely this is a flaky test:

> cluster/restart                                             
>   num_records=0, disk_mode=0                                [ OK    ] [ 0.09290296 / 0.01869385 CPU ]
>   num_records=0, disk_mode=1                                [ OK    ] [ 0.11798750 / 0.01738483 CPU ]
>   num_records=1, disk_mode=0                                [ OK    ] [ 0.16095490 / 0.01925361 CPU ]
>   num_records=1, disk_mode=1                                [ OK    ] [ 0.09314319 / 0.01966425 CPU ]
>   num_records=256, disk_mode=0                              [ ERROR ]
> Error: test/lib/server.c:65: assertion failed: rv == 0 (1 == 0)
> Error: child killed by signal 6 (Aborted)
>   num_records=256, disk_mode=1                              [ ERROR ]
> Error: test/lib/server.c:65: assertion failed: rv == 0 (1 == 0)
> Error: child killed by signal 6 (Aborted)
>   num_records=993, disk_mode=0                              [ OK    ] [ 4.22504766 / 0.85921956 CPU ]
>   num_records=993, disk_mode=1                              [ OK    ] [ 4.26175325 / 0.89829095 CPU ]
>   num_records=2200, disk_mode=0                             [ OK    ] [ 7.24100937 / 0.67460259 CPU ]
>   num_records=2200, disk_mode=1                             [ OK    ] [ 7.25931588 / 0.81881875 CPU ]

cole-miller commented 2 months ago

Thanks, we'll keep an eye on it. Unfortunately our tests are not well-isolated so this kind of flakiness is hard to squash entirely. If if happens again then LIBDQLITE_TRACE=1 output would be useful in figuring out the culprit.

gibmat commented 2 months ago

Yeah, I haven't been able to reproduce this exact failure myself. Would adding LIBDQLITE_TRACE=1 when running the tests have any adverse side effects? If not, I'll just add it to the normal packaging rules so it's always there to help with future debugging.

cole-miller commented 2 months ago

No, there shouldn't be any adverse effects.

freeekanayaka commented 2 months ago

I'd say that the problem is that those tests are time-dependent and not deterministic. I would argue that in this kind of unit/integration test suites that are ran at package build time there should be only deterministic and time-independent tests.

From what I've seen, this (and other) tests just fail because the hard-coded timeouts or time expectations don't match the capacity of the underlying hardware.

While adding LIBDQLITE_TRACE=1 might help figuring exactly what timing is wrong, I think the most robust solution would be to rewrite those tests, because tweaking timings is intrinsically fragile.

gibmat commented 2 months ago

I've enabled LIBDQLITE_TRACE=1 when running the tests, and it will be included whenever the next upload of dqlite is made to unstable.

canonical / dqlite

Flaky test: cluster/restart #643