actions / runner-images

GitHub Actions runner images
MIT License
9.17k stars 2.84k forks source link

tar corrupts archive contents on macOS #2619

Closed mzabaluev closed 2 years ago

mzabaluev commented 3 years ago

Description
The bsdtar binary installed in the virtual environment produces tarballs with corrupted file contents. This was first observed months ago: https://github.com/actions/upload-artifact/issues/151 https://github.community/t/bizarre-issue-with-tar-archive-for-macos-build/145377

The silent corruption resulted in our GitHub workflows producing unusable release assets for months before anyone noticed, and cost some in our team half a day of rubbing our eyes, running our build system locally to check for broken output, and questioning our sanity. #1534 provides a band-aid by installing GNU tar into a location that is not listed in PATH, but it assumes that the workflow users are either aware about this bug or have other reasons not to use bsdtar.

Area for Triage:
Apple

Question, Bug, or Feature?:
Bug

Virtual environments affected

Image version 20210130.1

Expected behavior
The tarballs produced by tar contain files with their original content intact.

Actual behavior
The tarball created to pack some binaries is unpacked to corrupted files that cannot be executed.

Repro steps
This workflow:

https://github.com/mzabaluev/jormungandr/blob/2d9ad4c97a7051d2a19bec5bfc42c36243d37f23/.github/workflows/release.yml#L273

The resulting macOS asset with corrupted binaries: https://github.com/mzabaluev/jormungandr/releases/download/nightly.20210204/jormungandr-0.10.0-nightly.20210204-x86_64-apple-darwin-generic.tar.gz

mzabaluev commented 3 years ago

This incident, and the duration for which it has remained unfixed since it has been reported and discussed in the community forums, disturbs me enough that I would like the resolution to include a post-analysis on how the problem has occurred.

maxim-lobanov commented 3 years ago

@mzabaluev, bsdtar is pre-installed on clean macOS system by Apple and it is not something that we install during image-generation from our side. So I guess it is better to report this issue on Apple forum. Also images contain GNU tar so you can use it instead of bsdtar.

mzabaluev commented 3 years ago

@maxim-lobanov I admit it was a bit presumptive of me to think that Apple would not let that kind of breakage go unnoticed due to having actually good QA... But I appear to be right: tar on my laptop macOS 11.2, with the same version as reported by the virtual environment, does not corrupt the locally built binaries on archival.

I don't have a macOS 10.15 installation at hand to verify there, but I assume that if this issue was present on Macs in general, we would have heard of this.

Darleev commented 3 years ago

Hello @mzabaluev, Could you please provide minimal steps to reproduce the issue to speed up the investigation? We are looking forward to your reply.

spinicist commented 3 years ago

Hello, I reported this issue to Github support back in December. Links to my reports are in the top post.

@maxim-lobanov I can confirm that tar on my multiple physical Mac machines works as expected. It is only on the Github runners that this problem manifests.

@Darleev The minimal steps appear to be running an Action that uses the tar command on a Mac runner. I assume it is possible for someone at Github to ssh into a runner and run interactive commands? I think it would be instructive to see what the output of running tar on some files like this would be.

mzabaluev commented 3 years ago

It does not seem to be straightforward to repro: a test run with archiving a smaller hello world binary had it come out as Mach-O (MacOS won't let me run it after unpacking it from the downloaded artifact, so it's possible that it is still corrupted). With the larger build linked in the description, the files are described only as "data" by file, and the redundancy of their content is a lot more than what uncorrupted binaries should have judging by the archive size being about 1/3 of normal.

mzabaluev commented 3 years ago

Will try with some larger binary from the system and checksums, some time later tonight.

mzabaluev commented 3 years ago

Tried archiving clang and ld copied from /Library/Developer/CommandLineTools/usr/bin, no corruption seen with these files.

andy-mishechkin commented 3 years ago

Hello, @mzabaluev I've tried to reproduce your archive and have checked hashes of original and extracted files - all works without any errors: https://github.com/andy-mishechkin/Azure-PipeLine-Test/runs/1862611131?check_suite_focus=true Workflow yaml is here: https://github.com/andy-mishechkin/Azure-PipeLine-Test/blob/master/.github/workflows/test.yml May you provide some other repro steps where you've got corrupt archive. And did you get this corruption regularly or sometimes ? Thank you

Cyberbeni commented 3 years ago

https://github.com/actions/cache/issues/403#issuecomment-677967547

What makes it really weird...if I add sleep 10 after cargo test and before creating the tar, it works fine.

mzabaluev commented 3 years ago

It seems to occur with binary outputs of two compiler/linker toolchains based on LLVM: clang and rustc. It does not occur with every output, either: a small hello-world binary is not corrupted.

mzabaluev commented 3 years ago

My only plausible guess is that the virtual filesystem used by the macOS runner has a bug that is triggered by specific file access patterns of the linker and bsdtar.

mzabaluev commented 3 years ago

May you provide some other repro steps where you've got corrupt archive. And did you get this corruption regularly or sometimes ?

The workflow linked in the description has been producing corrupt binaries regularly since at least the end of October. It seems to not corrupt with 100% regularity though: this release had two tarballs with macOS binaries built with different CPU optimization settings; only one of the archives got its contents corrupted.

spinicist commented 3 years ago

Thanks @mzabaluev for digging this deep!

maxim-lobanov commented 3 years ago

@mzabaluev , could you please add sleep 10 after cargo / rust invocation before using tar to check if it helps as @Cyberbeni mentioned above?

It looks like that process that creates / builds files, blocks them for some additional time after execution. And tar tries to archive files that were not written fully yet. It explains why we can't reproduce it with tar directly and why sleep 10 helps to resolve the issue.

We don't tweak filesystem somehow from our side during image-generation so there shouldn't be any difference with local system.

Does it reproduce for you only with Rust? We still can't reproduce the issue on small project.

Cyberbeni commented 3 years ago

(I also had this issue using the @actions/cache npm package in my GitHub Action after building with the Swift compiler which is also LLVM based, fixed with v1.0.6 since that uses gnu-tar on macOS)

mzabaluev commented 3 years ago

@mzabaluev , could you please add sleep 10 after cargo / rust invocation before using tar to check if it helps as @Cyberbeni mentioned above?

It seems to eliminate the corruption, indeed. In a run with the same steps except sleep 10, tar archives file contents that are different from what shasum had read just prior.

It looks like that process that creates / builds files, blocks them for some additional time after execution. And tar tries to archive files that were not written fully yet.

I doubt it: the compiler process that has built the jormungandr binary exits some 7 minutes before tar is invoked, yet both binaries end up corrupted in the archive. cargo runs the compiler as a regular child process, it's not left running in the background when the command exits. It's not that the files are incomplete, either: the file utility does not recognize the Mach-O magic in the heads of the extracted files, meaning that their content is completely different. The size of both files also corresponds to what they would likely be originally, but gzip is able to compress the contents to about 1/3 of the expected size of the archive, meaning that the entropy of the corrupted content is much larger.

Perhaps the idle 10 seconds let some settlement happen in an OS or virtualization storage layer that is otherwise prevented by busy processes?

We don't tweak filesystem somehow from our side during image-generation so there shouldn't be any difference with local system.

You do run it under VMWare on a virtual disk volume, though.

Does it reproduce for you only with Rust? We still can't reproduce the issue on small project.

I was unable to reproduce it with a smaller Rust project, or by archiving copies of some large binaries installed in the system.

mzabaluev commented 3 years ago

I would use DTrace to see how the linker and tar access the files, but it does not seem to be possible with the cloud-hosted runners.

dsame commented 3 years ago

@mzabaluev it seems the problem is the build has not completed at the moment you are calculating the checksum and starting to tar the binaries.

I've just confirmed the 10.15 osx tar works as expected on the any artificial binary files generated with dd if=/dev/random and i suspect there's something wrong with the binaries produced by the build.

While i'm still investigating the cargo step can you please try to enforce the disabling any parallel build settings if there are any?

sjackman commented 3 years ago

Calling sudo /usr/sbin/purge before building the tarball on a macOS is a known workaround. See https://github.com/marketplace/actions/rust-cache#cache-details and https://github.com/rust-lang/cargo/issues/8603 and https://github.com/actions/cache/issues/403

mzabaluev commented 3 years ago

it seems the problem is the build has not completed at the moment you are calculating the checksum and starting to tar the binaries.

It does not seem so from anything I can observe: the cargo process has exited after waiting on any tool processes it has launched to build the executables before tar gets to run. I'd like to run a tracing tool to capture the timeline of the processes, but lack the knowledge to do it on macOS.

There is a report of this (or similar?) problem being reproducible on a MacBook Pro: https://github.com/Homebrew/brew/issues/6539#issuecomment-542799372

https://github.com/rust-lang/cargo/issues/8603#issuecomment-684085412 suspects an interaction of tar with sparse files on APFS can be at fault. I can't confirm if the LLVM toolchain invokes that feature.

dsame commented 3 years ago

@mzabaluev as the first step can you please to check if the problem can be solved with -S tar option during the unpacking process tar -xzSf ...?

For the further investigation we need to reproduce the problem on the SIP disabled machine - do you think it is possible to reduce the faulty build to the less size? Unfortunately we can not use the artefacts because they copying the binaries to the artefacts may skip their sparse-ness.

mzabaluev commented 3 years ago

@mzabaluev as the first step can you please to check if the problem can be solved with -S tar option during the unpacking process tar -xzSf ...?

This did not solve it: https://github.com/mzabaluev/jormungandr/runs/2112154653

For the further investigation we need to reproduce the problem on the SIP disabled machine - do you think it is possible to reduce the faulty build to the less size?

I'm not sure, the result seems sensitive to how sizable the binaries are. This run produced a small test binary that was archived without corruption.

dsame commented 3 years ago

@mzabaluev ok, and as the last step to confirm the problem is not related to the sparse files can you please issue ls -ls agains every binaries to be included into the archive. This command will show actual blocks used to store the file content. This way we know either the files created by the build are sparse or not.

mzabaluev commented 3 years ago

@mzabaluev ok, and as the last step to confirm the problem is not related to the sparse files can you please issue ls -ls agains every binaries to be included into the archive. This command will show actual blocks used to store the file content. This way we know either the files created by the build are sparse or not.

Here you are, though I can't make much of this information.

I have also reduced the build to a single binary that is the smaller of the two.

dsame commented 3 years ago

according to diskutil info -all | grep "Block Size" the macos latest images has mounted filesystems of 512 or 4096 bytes blocks

From the other side ls -s output 328 -rwxr-xr-x 2 runner staff 8554552 Mar 16 14:56 jcli

shows the total file size of 8554552 is much more than 328 x 4096 = 1 343 488. This is the evidence jcli is a sparse file and osx tar skips "holes" inside it and thus CRS sum of untarred file is different from the the input.

@mzabaluev in order to confirm we have sparse file issue can you please confirm the untarred binaries still work as expected despite the CRS check fails?

dsame commented 3 years ago

@maxim-lobanov most likely we have sparse files to be archived and osx tar and GNU tar handle them different. Gnu tar by default does not make attempt to detect "holes" and include them to archive (https://www.gnu.org/software/tar/manual/html_node/sparse.html) and osx tar might skip them while tarring and does not create on restoring.

As a result i see nothing we can do with the problem.

mzabaluev commented 3 years ago

osx tar might skip them while tarring and does not create on restoring.

This still does not explain why does it work after a 10 second sleep. Does the OS patch the holes in the sparse files in the meantime? I suspect the problem may be in the kernel.

mzabaluev commented 3 years ago

@mzabaluev in order to confirm we have sparse file issue can you please confirm the untarred binaries still work as expected despite the CRS check fails?

https://github.com/mzabaluev/jormungandr/runs/2132586952?check_suite_focus=true#step:9:29

The original sparse file binary can be run by the OS. The untarred file is not recognized as an executable format.

I have also added a sleep 10 step and run ls -ls on the original binary again, the block count does not change.

sjackman commented 3 years ago

https://github.com/actions/virtual-environments/issues/2619#issuecomment-788397841

Does calling sudo /usr/sbin/purge before building the tarball workaround this issue?

dsame commented 3 years ago

@mzabaluev can you please add another line of shell code to your build ls /usr/local/opt/runner/settings/instancename? It will print a string we can use to identify the environment the build is executed in.

miketimofeev commented 3 years ago

@mzabaluev as an addition to @dsame comment — we suspect some storage issues in particular environments but need to gather enough data to prove it

mzabaluev commented 3 years ago

#2619 (comment)

Does calling sudo /usr/sbin/purge before building the tarball workaround this issue?

It does: https://github.com/mzabaluev/jormungandr/runs/2153654779

mzabaluev commented 3 years ago

@mzabaluev can you please add another line of shell code to your build ls /usr/local/opt/runner/settings/instancename? It will print a string we can use to identify the environment the build is executed in.

Voilà

we suspect some storage issues in particular environments but need to gather enough data to prove it

Good if this information helps, but the issue has been 100% reproducible with a particular source tree.

miketimofeev commented 3 years ago

@mzabaluev not 100% actually. I've forked your repo and made a few runs to prove our assumptions. 2 out of 4 passed and we observe the correlation with particular environments. image Moreover looks like we understood why it happens, but not sure we can fix it from the image side since it's related to the underlying infrastructure. We're going to escalate the issue to the engineering team, but unfortunately, I'm not sure how much time it will take them to fix it.

anentropic commented 3 years ago

FWIW I had the same problem, using the system tar -czvf command without installing gnu-tar

system tar seems to work fine when I do the same locally on my macbook

and sudo /usr/sbin/purge work-around in the Action seems to have fixed it

before: https://github.com/anentropic/chuckd/runs/2470461074#step:10:7 after: https://github.com/anentropic/chuckd/runs/2470588309#step:11:7

trung commented 3 years ago

Pulling my hairs off due to this. What I've experienced so far in my workflow:

  1. Build the binary 👍
  2. Run the binary 👍
  3. Taking md5
  4. Tar the binary
  5. Untar the binary
  6. Taking md5 of the binary - diff from step 3) 👎
  7. Run the binary- failed: cannot execute binary file 👎

Very random, sometimes 6) and 7) are good

sjackman commented 3 years ago

See https://github.com/actions/virtual-environments/issues/2619#issuecomment-788397841

@trung Either run sudo /usr/sbin/purge before building the tarball or use GNU Tar. brew install gnu-tar && gtar --version

trung commented 3 years ago

Either run sudo /usr/sbin/purge before building the tarball or use GNU Tar. brew install gnu-tar && gtar --version

gtar is already available in the macos image. No need to do brew install

dsame commented 2 years ago

The problem happens on archiving sparse files (common case is the binary created by llvm tools), apple tar removes the "holes" during the compressing and can not restore the executable in its original size.

There's no known workaround except either use GNU tar or try apply "-S" option with "-x" mode of apple tar.