Open EliahKagan opened 6 months ago
While a number of changes have been made to the test suite, this issue holds up without modification: the same tests fail on Windows as before (and as shown above) when GIT_TEST_IGNORE_ARCHIVES=1
, at c1d7c023d595eb04891b65295f001d85c9ba8074.
The gix-ref-tests::refs packed::iter::performance
test may depend on the speed of the test machine, but that was the case before. On my machine, it continues to take about the same amount of time (and fail for taking too long).
In view of the insight in #1442 that Git Bash MSYS2 behavior of ln -s
creating copies rather than symlinks is behind the new failure (not listed above) of
gix-diff-tests::diff tree::changes::to_obtain_tree::many_different_states
when GIX_TEST_IGNORE_ARCHIVES=1
, I was curious if causing it to actually create symlinks by setting MSYS=winsymlinks:nativestrict
would change of the failing tests noted here to pass.
It causes one of the tests whose failure is noted here to pass:
gix::gix object::tree::diff::track_rewrites::copies_in_entire_tree_by_similarity_with_limit
The other failures here are unaffected.
It also causes 9 tests that had otherwise passed to fail, both with and without GIX_TEST_IGNORE_ARCHIVES=1
. I've opened #1443 for that.
As expected per https://github.com/Byron/gitoxide/issues/1358#issuecomment-2209859265, the fixes for #1443 in #1444 have also made this test, originally reported as failing here, pass:
gix::gix object::tree::diff::track_rewrites::copies_in_entire_tree_by_similarity_with_limit
I have edited the description here accordingly, updating the title and summary list of failing tests, linking to a new gist with the current output of a full run, and also revising for clarity and adding some missing details.
Here are updated runs. The failing tests are the same. This is just to clarify that, since I'm working on some changes in gix-testtools
(in #1591 and elsewhere) and ran this on the current tip of main to compare against.
The reason for doing two runs within a short time of each other and to report both in the gist is to check that the number of tests reported as leaky varies across runs rather than having changed due to recent code changes (over time or across branches).
2024-10-22 update: The situation is unchanged at db5c9cf (gitoxide v0.38.0).
A new test from #1618, gix-merge::merge tree::run_baseline
, also fails on Windows with GIX_TEST_IGNORE_ARCHIVES=1
. But I have not added it to the list here, because #1652 includes a fix for it in the changes it proposes.
Thank you! Do you think it makes sense (and is in the time budget) to run the tests on Windows as well with GIX_TEST_IGNORE_ARCHIVES=1
? If so, I think it makes sense to do it.
Answering my own question, it's clearly not in the cards as the fast test takes 3 to 4 minutes on MacOS and Linux but will take 14-15 minutes on Windows, while the full test takes 13 minutes on Linux.
However, what if the new internal
tool would be used to detect changes compared to the base of a PR, extract the involved crates, and then would run the tests only for the crates in question with the ignore-archives flag enabled? Probably this can even be done in scripting entirely without having to compile internal
at all.
What do you think?
However, what if the new
internal
tool would be used to detect changes compared to the base of a PR, extract the involved crates, and then would run the tests only for the crates in question with the ignore-archives flag enabled? Probably this can even be done in scripting entirely without having to compileinternal
at all.
It seems to me that, depending on how narrowly or broadly it detects crates as involved, this might either miss breakages that it is intended to catch, or implicate too many crates to confer an adequate speed improvement. My thinking is that changes to one crate can affect any crates that depend on it directly or indirectly, and the effects may only be seen in tests of functionality in the dependent crate.
You are right, my idea was probably too naive to catch everything in practice. However, what about running the full test suite just like on linux as additional job, which just like the fuzzing isn't linked to the auto-merge feature. This means that failures will trigger an email notification so despite the changes merged, it will become clear sooner that something broke. Depending on the longevity of a PR, it will be more common that these failures will prevent a merge for the right reasons, than not preventing it.
It's an improvement with well-known drawbacks. If you think the same, I think a PR would be quickly done for this one and I wouldn't steal that opportunity :).
You are right, my idea was probably too naive to catch everything in practice.
The initial idea in https://github.com/GitoxideLabs/gitoxide/issues/1358#issuecomment-2456354745 may still be worth doing, even though it would not catch everything. Maybe it could be added later, atop the subsequent simpler idea of https://github.com/GitoxideLabs/gitoxide/issues/1358#issuecomment-2457234171, as a required check (or, required or not on CI, as a tool that could be run locally, such as when targeting platforms not covered by CI). But I do not plan to implement it immediately.
However, what about running the full test suite just like on linux as additional job, which just like the fuzzing isn't linked to the auto-merge feature.
I agree with doing this.
Please note that this will probably still slow down required checks sometimes, by causing them to be queued longer. As far as I know, runners for required and non-required checks always share the same quota and there is no prioritization of required checks.
Although this check could be made dependent on other checks, that would probably be undesirable because it would make things take longer when load is low (by running it and the required checks in series instead of parallel), and because it would not address how checks from earlier events (e.g. earlier pushes to a pull request branch) would still use up quota (by which I just mean rate limiting). It's possible to cause runs on earlier commits to be canceled automatically, but I recommend against that because it would make statuses hard to understand and because comparing output for different commits can be valuable.
If new non-required checks turn out to slow down required checks too much, then they could be scaled back or removed. My guess is that it will be okay. To speed things up, I think that:
test-fast
but with GIX_TEST_IGNORE_ARCHIVES=1
, and not include the other elements of the full test
job.GIX_TEST_IGNORE_ARCHIVES=1
could run on whichever of ARM64 and x86-64 is faster, unless a reason arises to prefer one of them for another reason. (The x86 runner might be faster if Git for Windows does not yet have native ARM64 builds.)If you think the same, I think a PR would be quickly done for this one
I don't anticipate that this would be quick to implement, though it may just be that I haven't thought of the right approach. I think these checks should report success when no unexpected failures happen; otherwise, they would be constantly failing.
(It may also be that they should fail when a test that is expected to fail passes, so that it can be reviewed and removed from being marked or listed as being expected to fail.)
Maybe this is easy to do, but I don't know how to do it with cargo nextest
. If possible, I think the new CI check(s) should really run all the tests, even those that are expected to fail, and report their failures, showing details. As far as I know (and can find in the documentation), cargo nextest
doesn't support xfail.
The only general approach I know of to do this is to parse the output of cargo nextest
. This could be done on the human-readable output, but cargo nextest
supports saving XML, and enabling this does not suppress what is usually displayed. It's fairly unobtrusive, being written into a subdirectory of target/
, so it may be that it could just be enabled by default, but if not then a separate nextest profile can be configured.
Assuming this is done, the configuration to output XML should probably be placed in Cargo.toml
and not with a nextest.toml
(unless the configuration is being added on the fly on CI, rather than being committed). However, for testing I made a .config/nextest.toml
file:
[profile.with-xml.junit]
path = "junit.xml"
Then I ran:
TZ=UTC GIX_TEST_IGNORE_ARCHIVES=1 cargo nextest --profile=with-xml run --workspace --no-fail-fast
Even though not currently intended for any of the GNU/Linux runs, I did this experiment on Arch Linux rather than Windows for speed. The results are in this gist, showing both the displayed human-readable text output and the saved XML output.
The most important thing to figure out is if I am missing something such that some altogether different approach is preferable. (Thus, I am replying here with what I have found so far, rather than waiting until a PR is ready.)
The next most important thing to figure out is how to elegantly parse the output. I'm not very familiar with the JUnit XML output format, but my understanding is that has a lot of tooling support, so there may even be purpose-built tools for this that would be preferable to using a general-purpose XML parsing tool.
Thanks a lot for the detailed response!
- This should be like
test-fast
but withGIX_TEST_IGNORE_ARCHIVES=1
, and not include the other elements of the fulltest
job.
That's a great idea!
I don't anticipate that this would be quick to implement, though it may just be that I haven't thought of the right approach. I think these checks should report success when no unexpected failures happen; otherwise, they would be constantly failing.
It's incredible how I could miss this, given the title of the issue is 15 tests fail on Windows with GIX_TEST_IGNORE_ARCHIVES=1
and I am very sorry to have wasted your time :/. I truly thought doing this was immediately possible.
But on the bright side, thanks to that you may have looked into it earlier and to my mind found a very viable approach. It seems that even the most simple of all ways of parsing, a grep for <failure
would already do the trick in seeing if there are failures. I really like the idea of making this job dependent on the normal test-fast
as then it would only run if test-fast
succeeds, hence no new 'normal' issues were found. And if one would make that parent-job the MacOS version of the test, it could be done in 3-4 minutes, adding minimal latency only.
Sure, just doing a grep and counting lines isn't bulletproof, but I think it's solid enough to make use of it, probably catching all the cases that it could catch in its lifetime - after all this is quite a rare occurrence to begin with and it's unlikely that somehow one of the 15 tests is fixed while a new one appears, hence masking the anomaly.
Even though I started my answer somewhat disappointed I definitely finish it with excitement :). Thank you!
Since XML output offers counts of each status across the entire run, including failure and error status, I think it may actually be easier to parse that precisely out of the XML. cargo nextest
doesn't seem to support writing a copy of its human-readable output to a log file. This can be achieved with tools that may not be present on Windows runners, like script
, or by piping to tee
. In either case, for the output to be fully readable, it should be displayed colorized when run, which complicates parsing if it is to be done precisely (ANSI color sequences ending in m
have unintuitive effects on where word boundaries are, for example), and complicates debugging if parsing is to be done imprecisely.
If we do end up parsing that out and counting, then I think we can parse the count of failed tests in the summary rather than counting the number of lines that fail, and this can be matched in such a way that no failure reports can be misinterpreted as it, if we are willing to match across multiple lines. But I think that this approach, even if corrected and simplified, and even if scaled back by allowing some ambiguity, will still be more complicated than parsing XML, which (based on local testing) seems like it can be done in a couple of lines of PowerShell. For example, if we only had to count failures (and not errors) then this, which I have verified works locally, would be sufficient, and adding errors should be no hurdle:
[xml]$junit = Get-Content -Path 'target/nextest/with-xml/junit.xml'
[int]$junit.testsuites.failures
The OS of most interest for this job--and possibly the only one it will use--is Windows, which even defaults shell
to pwsh
on GHA runners, but I believe that shell is also available on the non-Windows GitHub-hosted runners and can be specified as the value of the shell
key.
Although I had originally intended the changes in #1654 to be the first part of a PR to do all this, only those changes, conceptually unrelated to this, are ready. So I've opened that sooner.
I really like the idea of making this job dependent on the normal
test-fast
as then it would only run iftest-fast
succeeds, hence no new 'normal' issues were found. And if one would make that parent-job the MacOS version of the test, it could be done in 3-4 minutes, adding minimal latency only.
I don't actually know how to depend on a specific job generated by a matrix, rather than the whole group, and I'm not sure there is a way. When I was working on the publishing workflow, I tried to do this for the jobs that created macOS universal binaries out of the x86-64 and AArch64 (ARM64) builds, and I did not find a way. (I did find claims that GitHub Actions doesn't support it, but I don't have those on hand, and I don't know if they were accurate or still accurate.) The "obvious" way to do it would be to form a job id
from a matrix value, then depend on that id
. I recall testing this and finding that an id
could not be formed that way.
Of course, there is already significant conditional special-casing in test-fast
. If that were split up into three separately defined jobs, then they could easily be depended on individually. However, then they would not automatically be canceled based on each other's failures. Cancellation could presumably be implemented but it might be cumbersome. Still, just splitting them up might be worthwhile, assuming we really do want to keep the new job that runs tests on Windows with GIX_TEST_IGNORE_ARCHIVES=1
from starting early.
I am not sure if we should, though. It looks, from the testing linked above, to be the longest running job. It might be a good idea to let it have a head start, even at the expense of slowing down required jobs' progression through the queue a bit more often.
after all this is quite a rare occurrence to begin with and it's unlikely that somehow one of the 15 tests is fixed while a new one appears, hence masking the anomaly.
I think that counting failures is valuable and a reasonable place to start even if ultimately we do check for specific failures.
However, I don't know that it's unlikely that one would start working while another stops. One of the tests is a performance test, and it only sometimes fails. It usually, but does not always, fail locally, and it seems usually, but not always, to pass on CI. This could easily coincide with a new failure. It did actually coincide with a local failure--vaguely referenced at the end of the d74e919 (#1652) commit message and I hope to open an issue for it sometime soon--and I had initially missed it by looking at counts, but fortunately found it because i was also comparing to the list of failures here using this diff tool.
Still, having something start failing when more tests than we are accustomed to ever failing start failing seems like a significant gain, whether or not it ends up being subsequently made more exacting.
Wow, I really like this powershell approach. It could hardly be simpler, and portable or not, it only has to work on Windows as far as I can tell.
Of course, there is already significant conditional special-casing in
test-fast
. If that were split up into three separately defined jobs, then they could easily be depended on individually. However, then they would not automatically be canceled based on each other's failures. Cancellation could presumably be implemented but it might be cumbersome. Still, just splitting them up might be worthwhile, assuming we really do want to keep the new job that runs tests on Windows withGIX_TEST_IGNORE_ARCHIVES=1
from starting early.
I see, so these jobs would have to run outside of the matrix, and then it's possible to depend on the one running on Windows. That seems at (least conceptually) simple.
I am not sure if we should, though. It looks, from the testing linked above, to be the longest running job. It might be a good idea to let it have a head start, even at the expense of slowing down required jobs' progression through the queue a bit more often.
Admittedly I am not thinking about queueing and quotas at all when thinking about this job. To me it's mostly about not having it fail if a test fails, as it will increase the amount of emails I get unnecessarily. From my experience, I ran into quotas the fewest of times, and if it happened it was easy enough to cancel older jobs. But that's just me and there are other workflows. Latency isn't a huge concern, so the job could also run on a cronjob occasionally, so it's never more than say, a day, until one gets alerted about breakage. But I don't know how cron-jobs can be triggered in PRs and only in PRs.
Still, having something start failing when more tests than we are accustomed to ever failing start failing seems like a significant gain, whether or not it ends up being subsequently made more exacting.
Oh, I wasn't aware that there is flakiness in one of these tests :/ - thus far I was quite proud that CI is incredibly reliable right now. Maybe having that new job and the possible flakiness coming with it will be a motivation, also for me, to do something about it.
In any case, exciting research you have been doing, thank you!
To me it's mostly about not having it fail if a test fails, as it will increase the amount of emails I get unnecessarily.
Do you get separate emails for different failing jobs from the same workflow run? When I watch a repository, I get notifications and associated emails for failing CI workflows, but only one notification/email for each workflow that fails, each time the workflow is run and fails. I don't get separate emails for different jobs in the same workflow run.
If your experience is the same, then just keeping all the jobs in the ci.yml
workflow, which should be done anyway due to their close relatedness and having the same event trigger, should be sufficient to avoid getting extra emails. In that case, any way of organizing the jobs and deciding/expressing their dependences should be fine.
If you get separate emails for separate failing jobs in the same workflow run (but not for jobs that are canceled) then their dependencies could affect this, but I think I would need to better understand how that works in order to figure out what arrangement would produce the smallest number of low-value notifications.
I've opened #1657 to run Windows tests on CI with GIX_TEST_IGNORE_ARCHIVES=1
and assert that no more than the expected number of failures occur. Relating to https://github.com/GitoxideLabs/gitoxide/issues/1358#issuecomment-2460384492, I have not (yet?) made the new job declare a dependency on any other job(s), but that could be done.
If your experience is the same, then just keeping all the jobs in the
ci.yml
workflow, which should be done anyway due to their close relatedness and having the same event trigger, should be sufficient to avoid getting extra emails. In that case, any way of organizing the jobs and deciding/expressing their dependences should be fine.
Yes, you are right, it's actually only one per workflow (file), so that's a non-issue.
In https://github.com/GitoxideLabs/gitoxide/issues/1358#issuecomment-2457901521, I mentioned testing with TZ=UTC
, but I didn't explain the reason for that, given that I hadn't used it previous local test runs. (On CI, the time zone is already set to UTC.) The reason is #1696 (which I had not yet reported at that time).
Current behavior 😯
Running tests on Windows in a Git Bash environment (similar to the environment in which they run in Windows on CI, where
bash
is Git Bash), all tests are able to pass normally. However, this apparently relies on the use of generated archives.When the tests are run with the environment variable
GIX_TEST_IGNORE_ARCHIVES
set to1
rather than unset, 15 tests fail. The failing tests, as quoted from the end of the test run, are:The full output is available in this gist, showing a test run at c2753b8, which has the fixes in #1444.
Before those fixes, one other test had been reported as failing here. See https://github.com/Byron/gitoxide/issues/1358#issuecomment-2211890726 and the old gist if interested. The discussion in #1345 is still relevant, though it links to this even older gist.
As noted in comments in #1345, the failure in
compare_baseline_with_ours
seems particularly interesting, since it involves an unexpected effect of.gitignore
pattern matching that is different on Windows.Expected behavior 🤔
All tests should pass, even when suppressing the use of generated archives by setting
GIX_TEST_IGNORE_ARCHIVES=1
. Differences between Windows and other platforms should be accounted for when intentional and desirable, or fixed otherwise.Git behavior
Not fully applicable, since this is about a failure of multiple gitoxide tests when run in a certain way.
However, some discrepancies--intended or unintended--between gitoxide and Git may turn out to be related to some of the failures. So this section may be expanded in the future, or perhaps new issues will be split out from this one.
Steps to reproduce 🕹
I ran the tests on Windows 10.0.19045 (x64) with developer mode enabled so that symlink creation is permitted even without UAC elevation, with
git version 2.45.2.windows.1
.I used the current tip of the main branch, which at this time is c2753b8425c285c6b53f46eba9bc3584aa85eb01. When I opened this issue originally, that did not exist, but all experiments described here have been performed again, and the reported results have been updated. I used the latest stable Rust toolchain and
cargo-nextest
, though this does not seem to affect the results.Local development environments and CI sometimes differ in relevant ways, so I also verified that the tests all pass when
GIX_TEST_IGNORE_ARCHIVES
is not set. This may be considered an optional step, but is beneficial because it checks that all needed dependencies are installed and working, and that failures really can be attributed to the effect ofGIX_TEST_IGNORE_ARCHIVES
. To run the tests that way, I run this in Git Bash:Then I did a full clean:
Then, also in Git Bash, I ran this command, which produced the test output and failures described above:
The reason the tests must be run in Git Bash is that there is a separate issue where many test failures occur when they are run from a typical PowerShell environment (#1359).
I also ran the tests, with and without
GIX_TEST_IGNORE_ARCHIVES=1
, on Ubuntu 22.04 LTS (x64) and macOS 14.5 (M1). As expected, all tests passed on those systems, confirming that the failures are Windows-specific.