daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
66 stars 58 forks source link

Codegen test fails non-deterministically in GitHub action #729

Open pdamme opened 4 months ago

pdamme commented 4 months ago

The test suite contains a test case on codegen (test/codegen/CodegenTest.cpp). Currently, this test case sometimes succeeds and sometimes fails when run during the GitHub action. For instance, in PR #728, all test cases passed in the GitHub action. After putting the single commit in the PR on the main branch (rebase & merge resulting in 4eb07576b58e854ef096877722f0ebbc7339b5c4, no additional changes at all), the test case fails in the CI. The problem is also affecting PR #675 at the moment, but I guess it could randomly happen in any PR or commit on main at the moment.

On my local machine, I cannot reproduce this issue. It may be related to multi-threading or any other source of non-determinism that is different in the CI environment.

There have been problems with the codegen test case before, when it got stuck non-deterministically (reported in the context of PR #675) and fixed in c2900a39189a8e7e407082278a33f6ef3981ff76.

It is not clear if the problem is caused by the codegen test or if it only becomes effective there.

philipportner commented 3 months ago

Looks like we finally got logs of an error.

https://github.com/daphne-eu/daphne/actions/runs/9701687915/job/26775861276

On what CPU is the CI executed? Is it the same HW every time? Could be that for certain CPUs the tiling logic simply does not achieve the tiling required from the description when using FileCheck for the matmul_tile.mlir test case.

philipportner commented 3 days ago

Still happening, https://github.com/daphne-eu/daphne/actions/runs/11124869078