packed::iter::performance on riscv64 has PermanentlyLocked failure running fixture

EliahKagan commented 2 weeks ago

Current behavior 😯

The gix-ref-tests::refs packed::iter::performance test fails on a riscv64 machine when I run all tests and force archives to be regenerated by running:

GIX_TEST_IGNORE_ARCHIVES=1 cargo nextest run --no-default-features --features max-pure --all --no-fail-fast

The way it fails is:

Error: PermanentlyLocked { resource_path: "make_repository_with_lots_of_packed_refs", mode: AfterDurationWithBackoff(180s), attempts: 202 }

With a bit of further context:

        SLOW [>180.000s] gix-ref-tests::refs packed::iter::performance
        FAIL [ 180.162s] gix-ref-tests::refs packed::iter::performance

--- STDOUT:              gix-ref-tests::refs packed::iter::performance ---

running 1 test
test packed::iter::performance has been running for over 60 seconds
test packed::iter::performance ... FAILED

failures:

failures:
    packed::iter::performance

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 124 filtered out; finished in 180.12s

--- STDERR:              gix-ref-tests::refs packed::iter::performance ---
Error: PermanentlyLocked { resource_path: "make_repository_with_lots_of_packed_refs", mode: AfterDurationWithBackoff(180s), attempts: 202 }

        PASS [ 212.628s] gix-ref-tests::refs packed::find::find_speed
------------
     Summary [ 631.358s] 2538 tests run: 2537 passed (8 slow, 1 leaky), 1 failed, 3 skipped
        FAIL [ 180.162s] gix-ref-tests::refs packed::iter::performance
error: test run failed

So this looks like a deadlock, though I don't know that it is one. This does not happen on an x64 machine, though the x64 machine I have been testing on is faster, though it has the same number of cores as the riscv64 machine I've been testing on. They both have four cores. The riscv64 machine has this cat /proc/cpuinfo output and more information on the details of its hardware is available in https://github.com/zlib-ng/zlib-ng/issues/1705#issuecomment-2366412936 and on this page.

Full details of all max-pure runs are in this gist, which supersedes the earlier gist linked in https://github.com/rust-lang/libz-sys/pull/218#issuecomment-2369323532 where I first noticed the problem. The significance of that older gist is that it shows the problem happens even with max. Other runs I've done to investigate this have been with max-pure, to examine the problem independently of https://github.com/rust-lang/libz-sys/issues/200 (https://github.com/zlib-ng/zlib-ng/issues/1705).

The strangest thing is that the problem only occurs if I rebuild the code. Specifically, cleaning with git restore . followed by gix clean -xd -m '*generated*' -e is insufficient to cause the next full run to have the failure, but cleaning with git restore . followed by gix clean -xde is sufficient to cause the next full run to have the failure. I have verified that content under target/ is the only content reported by gix clean -xdn as eligible for deletion after running gix clean -xd -m '*generated*' -e, so it appears that, somehow, rebuilding is part of what is needed to produce the problem.

Although this is present in the readme for the new gist, the gist "table of contents" serves both as a summary of what kinds of runs produce what results and the order in which I did the runs, and as a collection of links to the nine runs in case one wishes to examine them in detail. Therefore, I quote it here:

All cleaning included git restore . where applicable, even if not mentioned in the summaries below.

run-1-x64-use-archives.txt - start out completely clean, no GIX_TEST_IGNORE_ARCHIVES, passes

run-2-x64-ignore-archives.txt - clean with gix clean -xde and set GIX_TEST_IGNORE_ARCHIVES=1, passes

run-3-riscv64-use-archives.txt - start out completely clean, passes

run-4-riscv64-ignore-archives.txt - clean with gix clean -xde and set GIX_TEST_IGNORE_ARCHIVES=1, failure

run-5-riscv64-ignore-archives-rerun.txt - equivalent rerun, clean with gix clean -xde (though not shown in transcript) and set GIX_TEST_IGNORE_ARCHIVES=1, failure

run-6-riscv64-ignore-archives-rerun-without-cleaning.txt - no cleaning, set GIX_TEST_IGNORE_ARCHIVES=1, passes

run-7-riscv64-ignore-archives-rerun-partial-clean.txt - clean with gix clean -xd -m '*generated*' -e and set GIX_TEST_IGNORE_ARCHIVES=1, passes

run-8-riscv64-ignore-archives-rerun-partial-clean-again.txt - equivalent rerun, clean with gix clean -xd -m '*generated*' -e and set GIX_TEST_IGNORE_ARCHIVES=1, passes

run-9-riscv64-ignore-archives-rerun-full-clean.txt - rerun as in runs 4 and 5, clean with gix clean -xde and set GIX_TEST_IGNORE_ARCHIVES=1, failure

All tests in the newer gist were run at the current tip of main, 612896d. They were all run on Ubuntu 24.04 LTS systems. The two x86 runs were on one system, and the seven riscv64 runs were, of course, on a different system (but the same system as each other). Because the older gist also shows the problem, it is not new, at least not newer than be2f093.

Expected behavior 🤔

All tests should pass.

Secondarily, it seems to me that when git restore . has been run and there are no intervening commands that modify the working tree, gix clean -xd -m '*generated*' -e should be as good as gix clean -xde at resetting state associated with fixtures that are forced to be rerun due to the use of GIX_TEST_IGNORE_ARCHIVES=1, at least when repeated re-running has not identified any nondeterminism in failures that occur related to fixtures. However, I am not certain that there is necessarily a specific, separate bug that corresponds to this second expectation.

Git behavior

Probably not applicable, unless the speed at which the git commands in the fixture script are run turns out to be a contributing factor (but I think that would still not be git behavior that corresponds to the code in gitoxide where the test fails).

Steps to reproduce 🕹

On a riscv64 machine in GNU/Linux (and specifically Ubuntu 24.04 LTS, if one wishes to reproduce the setup I used), either clone the gitoxide repo, or run git restore . and gix clean -xde in the already cloned gitoxide repo (after ensuring that one has no valuable modifications that could be lost by doing so). Then run:

GIX_TEST_IGNORE_ARCHIVES=1 cargo nextest run --no-default-features --features max-pure --all --no-fail-fast

To observe that it reliably happens when run this way, run gix clean -xde and then that test command again. This can be done as many times as desired.

To observe that rebuilding seems to be required to produce the problem, replace gix clean -xde with gix clean -xd -m '*generated*' -e in the above procedure and verify that (except on the first run, which is already clean) the problem does not occur. Going back to gix clean -xde verifies that the order is not the cause.

In case it turns out to be relevant, the git command on the system I used reports its version when running git version is as git version 2.43.0. It is more specifically the downstream version 1:2.43.0-1ubuntu7.1 (packaged for Ubuntu 24.04 LTS) as revealed by running apt list git.

It occurs to me that the inability to produce the problem without having just recently rebuilt might potentially be due to the effect of rebuilding on dynamic clock speed. However, I would expect at least some nondeterminism in the failures to be observed if this were the case, since the failing test is not one of the earliest tests to run. I may be able to investigate that further if other approaches do not reveal what is going on here.

Byron commented 2 weeks ago

Thanks a lot for reporting!

My recommendation here is to try and increase the timeout for file-based locks in gix-testtools, which seem to be 180s now. This value was increased multiple times as failures like these were encountered on Windows before.

My guess here is that rebuilding something while the tests are running is so slow while involving the same fixture that the performance test starves before even being started - it's fixture can't be run/validated as another test, probably the one building a binary, is capturing it.

An alternative validation would be to limit the scope of the tests that are run.

Lastly, while the test is running, using git status one will see lock files in the working tree. From those it's possible to deduce the fixture they are protecting and with that one can probbaly see which tests are currently running. The expectation is that the performance test is running and waiting for the lock, while one more test is running that is holding the lock.

EliahKagan commented 2 weeks ago

Sounds good, I'm experimenting with that now.

EliahKagan commented 2 weeks ago

I've opened #1606, which doubles the timeout and seems to fix this.

I'm not totally sure what's going on, but it looks like there was a contribution from build processes that were triggered by tests. cargo nextest builds all tests beforehand, but some tests trigger further building, which, as noted in https://github.com/Byron/gitoxide/pull/1361#issuecomment-2149007626, are also written into target/ and will thus be reused if target/ is not cleaned.

I didn't observe tracked lock files or anything from running git status while the affected test or other slow tests were running, but with top and ps aux I did observe cargo and rustc processes, as well as git pack-refs --all, and sometimes processes like submodule-b7b02.

GitoxideLabs / gitoxide