Closed mzabaluev closed 2 years ago
This incident, and the duration for which it has remained unfixed since it has been reported and discussed in the community forums, disturbs me enough that I would like the resolution to include a post-analysis on how the problem has occurred.
@mzabaluev, bsdtar is pre-installed on clean macOS system by Apple and it is not something that we install during image-generation from our side. So I guess it is better to report this issue on Apple forum. Also images contain GNU tar so you can use it instead of bsdtar.
@maxim-lobanov I admit it was a bit presumptive of me to think that Apple would not let that kind of breakage go unnoticed due to having actually good QA... But I appear to be right: tar on my laptop macOS 11.2, with the same version as reported by the virtual environment, does not corrupt the locally built binaries on archival.
I don't have a macOS 10.15 installation at hand to verify there, but I assume that if this issue was present on Macs in general, we would have heard of this.
Hello @mzabaluev, Could you please provide minimal steps to reproduce the issue to speed up the investigation? We are looking forward to your reply.
Hello, I reported this issue to Github support back in December. Links to my reports are in the top post.
@maxim-lobanov I can confirm that tar
on my multiple physical Mac machines works as expected. It is only on the Github runners that this problem manifests.
@Darleev The minimal steps appear to be running an Action that uses the tar
command on a Mac runner. I assume it is possible for someone at Github to ssh
into a runner and run interactive commands? I think it would be instructive to see what the output of running tar
on some files like this would be.
It does not seem to be straightforward to repro: a test run with archiving a smaller hello world binary had it come out as Mach-O (MacOS won't let me run it after unpacking it from the downloaded artifact, so it's possible that it is still corrupted). With the larger build linked in the description, the files are described only as "data" by file
, and the redundancy of their content is a lot more than what uncorrupted binaries should have judging by the archive size being about 1/3 of normal.
Will try with some larger binary from the system and checksums, some time later tonight.
Tried archiving clang
and ld
copied from /Library/Developer/CommandLineTools/usr/bin
, no corruption seen with these files.
Hello, @mzabaluev I've tried to reproduce your archive and have checked hashes of original and extracted files - all works without any errors: https://github.com/andy-mishechkin/Azure-PipeLine-Test/runs/1862611131?check_suite_focus=true Workflow yaml is here: https://github.com/andy-mishechkin/Azure-PipeLine-Test/blob/master/.github/workflows/test.yml May you provide some other repro steps where you've got corrupt archive. And did you get this corruption regularly or sometimes ? Thank you
https://github.com/actions/cache/issues/403#issuecomment-677967547
What makes it really weird...if I add sleep 10 after cargo test and before creating the tar, it works fine.
It seems to occur with binary outputs of two compiler/linker toolchains based on LLVM: clang and rustc. It does not occur with every output, either: a small hello-world binary is not corrupted.
My only plausible guess is that the virtual filesystem used by the macOS runner has a bug that is triggered by specific file access patterns of the linker and bsdtar.
May you provide some other repro steps where you've got corrupt archive. And did you get this corruption regularly or sometimes ?
The workflow linked in the description has been producing corrupt binaries regularly since at least the end of October. It seems to not corrupt with 100% regularity though: this release had two tarballs with macOS binaries built with different CPU optimization settings; only one of the archives got its contents corrupted.
Thanks @mzabaluev for digging this deep!
@mzabaluev , could you please add sleep 10 after cargo / rust invocation before using tar to check if it helps as @Cyberbeni mentioned above?
It looks like that process that creates / builds files, blocks them for some additional time after execution. And tar
tries to archive files that were not written fully yet. It explains why we can't reproduce it with tar directly and why sleep 10 helps to resolve the issue.
We don't tweak filesystem somehow from our side during image-generation so there shouldn't be any difference with local system.
Does it reproduce for you only with Rust? We still can't reproduce the issue on small project.
(I also had this issue using the @actions/cache
npm package in my GitHub Action after building with the Swift compiler which is also LLVM based, fixed with v1.0.6 since that uses gnu-tar on macOS)
@mzabaluev , could you please add sleep 10 after cargo / rust invocation before using tar to check if it helps as @Cyberbeni mentioned above?
It seems to eliminate the corruption, indeed. In a run with the same steps except sleep 10
, tar
archives file contents that are different from what shasum
had read just prior.
It looks like that process that creates / builds files, blocks them for some additional time after execution. And tar tries to archive files that were not written fully yet.
I doubt it: the compiler process that has built the jormungandr
binary exits some 7 minutes before tar is invoked, yet both binaries end up corrupted in the archive. cargo
runs the compiler as a regular child process, it's not left running in the background when the command exits. It's not that the files are incomplete, either: the file
utility does not recognize the Mach-O magic in the heads of the extracted files, meaning that their content is completely different. The size of both files also corresponds to what they would likely be originally, but gzip is able to compress the contents to about 1/3 of the expected size of the archive, meaning that the entropy of the corrupted content is much larger.
Perhaps the idle 10 seconds let some settlement happen in an OS or virtualization storage layer that is otherwise prevented by busy processes?
We don't tweak filesystem somehow from our side during image-generation so there shouldn't be any difference with local system.
You do run it under VMWare on a virtual disk volume, though.
Does it reproduce for you only with Rust? We still can't reproduce the issue on small project.
I was unable to reproduce it with a smaller Rust project, or by archiving copies of some large binaries installed in the system.
I would use DTrace to see how the linker and tar access the files, but it does not seem to be possible with the cloud-hosted runners.
@mzabaluev it seems the problem is the build has not completed at the moment you are calculating the checksum and starting to tar the binaries.
I've just confirmed the 10.15 osx tar works as expected on the any artificial binary files generated with dd if=/dev/random
and i suspect there's something wrong with the binaries produced by the build.
While i'm still investigating the cargo step can you please try to enforce the disabling any parallel build settings if there are any?
Calling sudo /usr/sbin/purge
before building the tarball on a macOS is a known workaround.
See https://github.com/marketplace/actions/rust-cache#cache-details
and https://github.com/rust-lang/cargo/issues/8603
and https://github.com/actions/cache/issues/403
it seems the problem is the build has not completed at the moment you are calculating the checksum and starting to tar the binaries.
It does not seem so from anything I can observe: the cargo
process has exited after waiting on any tool processes it has launched to build the executables before tar gets to run. I'd like to run a tracing tool to capture the timeline of the processes, but lack the knowledge to do it on macOS.
There is a report of this (or similar?) problem being reproducible on a MacBook Pro: https://github.com/Homebrew/brew/issues/6539#issuecomment-542799372
https://github.com/rust-lang/cargo/issues/8603#issuecomment-684085412 suspects an interaction of tar with sparse files on APFS can be at fault. I can't confirm if the LLVM toolchain invokes that feature.
@mzabaluev as the first step can you please to check if the problem can be solved with -S tar option during the unpacking process tar -xzSf ...
?
For the further investigation we need to reproduce the problem on the SIP disabled machine - do you think it is possible to reduce the faulty build to the less size? Unfortunately we can not use the artefacts because they copying the binaries to the artefacts may skip their sparse-ness.
@mzabaluev as the first step can you please to check if the problem can be solved with -S tar option during the unpacking process
tar -xzSf ...
?
This did not solve it: https://github.com/mzabaluev/jormungandr/runs/2112154653
For the further investigation we need to reproduce the problem on the SIP disabled machine - do you think it is possible to reduce the faulty build to the less size?
I'm not sure, the result seems sensitive to how sizable the binaries are. This run produced a small test binary that was archived without corruption.
@mzabaluev ok, and as the last step to confirm the problem is not related to the sparse files can you please issue ls -ls
agains every binaries to be included into the archive. This command will show actual blocks used to store the file content. This way we know either the files created by the build are sparse or not.
@mzabaluev ok, and as the last step to confirm the problem is not related to the sparse files can you please issue
ls -ls
agains every binaries to be included into the archive. This command will show actual blocks used to store the file content. This way we know either the files created by the build are sparse or not.
Here you are, though I can't make much of this information.
I have also reduced the build to a single binary that is the smaller of the two.
according to
diskutil info -all | grep "Block Size"
the macos latest images has mounted filesystems of 512 or 4096 bytes blocks
From the other side ls -s
output
328 -rwxr-xr-x 2 runner staff 8554552 Mar 16 14:56 jcli
shows the total file size of 8554552 is much more than 328 x 4096 = 1 343 488. This is the evidence jcli
is a sparse file and osx tar skips "holes" inside it and thus CRS sum of untarred file is different from the the input.
@mzabaluev in order to confirm we have sparse file issue can you please confirm the untarred binaries still work as expected despite the CRS check fails?
@maxim-lobanov most likely we have sparse files to be archived and osx tar and GNU tar handle them different. Gnu tar by default does not make attempt to detect "holes" and include them to archive (https://www.gnu.org/software/tar/manual/html_node/sparse.html) and osx tar might skip them while tarring and does not create on restoring.
As a result i see nothing we can do with the problem.
osx tar might skip them while tarring and does not create on restoring.
This still does not explain why does it work after a 10 second sleep. Does the OS patch the holes in the sparse files in the meantime? I suspect the problem may be in the kernel.
@mzabaluev in order to confirm we have sparse file issue can you please confirm the untarred binaries still work as expected despite the CRS check fails?
https://github.com/mzabaluev/jormungandr/runs/2132586952?check_suite_focus=true#step:9:29
The original sparse file binary can be run by the OS. The untarred file is not recognized as an executable format.
I have also added a sleep 10
step and run ls -ls
on the original binary again, the block count does not change.
https://github.com/actions/virtual-environments/issues/2619#issuecomment-788397841
Does calling sudo /usr/sbin/purge
before building the tarball workaround this issue?
@mzabaluev can you please add another line of shell code to your build ls /usr/local/opt/runner/settings/instancename
?
It will print a string we can use to identify the environment the build is executed in.
@mzabaluev as an addition to @dsame comment — we suspect some storage issues in particular environments but need to gather enough data to prove it
Does calling
sudo /usr/sbin/purge
before building the tarball workaround this issue?
It does: https://github.com/mzabaluev/jormungandr/runs/2153654779
@mzabaluev can you please add another line of shell code to your build
ls /usr/local/opt/runner/settings/instancename
? It will print a string we can use to identify the environment the build is executed in.
we suspect some storage issues in particular environments but need to gather enough data to prove it
Good if this information helps, but the issue has been 100% reproducible with a particular source tree.
@mzabaluev not 100% actually. I've forked your repo and made a few runs to prove our assumptions. 2 out of 4 passed and we observe the correlation with particular environments. Moreover looks like we understood why it happens, but not sure we can fix it from the image side since it's related to the underlying infrastructure. We're going to escalate the issue to the engineering team, but unfortunately, I'm not sure how much time it will take them to fix it.
FWIW I had the same problem, using the system tar -czvf
command without installing gnu-tar
system tar seems to work fine when I do the same locally on my macbook
and sudo /usr/sbin/purge
work-around in the Action seems to have fixed it
before: https://github.com/anentropic/chuckd/runs/2470461074#step:10:7 after: https://github.com/anentropic/chuckd/runs/2470588309#step:11:7
Pulling my hairs off due to this. What I've experienced so far in my workflow:
md5
md5
of the binary - diff from step 3) 👎 cannot execute binary file
👎 Very random, sometimes 6) and 7) are good
See https://github.com/actions/virtual-environments/issues/2619#issuecomment-788397841
@trung Either run sudo /usr/sbin/purge
before building the tarball or use GNU Tar. brew install gnu-tar && gtar --version
Either run sudo /usr/sbin/purge before building the tarball or use GNU Tar. brew install gnu-tar && gtar --version
gtar
is already available in the macos image. No need to do brew install
The problem happens on archiving sparse files (common case is the binary created by llvm tools), apple tar removes the "holes" during the compressing and can not restore the executable in its original size.
There's no known workaround except either use GNU tar or try apply "-S" option with "-x" mode of apple tar.
Description
The bsdtar binary installed in the virtual environment produces tarballs with corrupted file contents. This was first observed months ago: https://github.com/actions/upload-artifact/issues/151 https://github.community/t/bizarre-issue-with-tar-archive-for-macos-build/145377
The silent corruption resulted in our GitHub workflows producing unusable release assets for months before anyone noticed, and cost some in our team half a day of rubbing our eyes, running our build system locally to check for broken output, and questioning our sanity. #1534 provides a band-aid by installing GNU tar into a location that is not listed in PATH, but it assumes that the workflow users are either aware about this bug or have other reasons not to use bsdtar.
Area for Triage:
Apple
Question, Bug, or Feature?:
Bug
Virtual environments affected
Image version 20210130.1
Expected behavior
The tarballs produced by tar contain files with their original content intact.
Actual behavior
The tarball created to pack some binaries is unpacked to corrupted files that cannot be executed.
Repro steps
This workflow:
https://github.com/mzabaluev/jormungandr/blob/2d9ad4c97a7051d2a19bec5bfc42c36243d37f23/.github/workflows/release.yml#L273
The resulting macOS asset with corrupted binaries: https://github.com/mzabaluev/jormungandr/releases/download/nightly.20210204/jormungandr-0.10.0-nightly.20210204-x86_64-apple-darwin-generic.tar.gz