Open jaraco opened 1 month ago
Downgrading to 4.28, the problem goes away. The issue exists on 4.29 also.
@jaraco The above worked for me without modification on Docker Desktop for Mac v4.30.0.
@jaraco The above worked for me without modification on Docker Desktop for Mac v4.30.0.
Are you using the same macOS version and architecture?
Interestingly, after upgrading from 4.28.0 to 4.30.0, the problem seems to be gone... or not. Okay, let me capture some of the steps. From earlier,
docker buildx build --platform linux/amd64 -t jaraco/multipy-tox .
. Command failed after info: installing component 'rust-docs'
(same Segmentation fault).run
command again and it failed.Now I'm beginning to wonder if it's an intermittent issue, so I ran the command again, but this time, it failed with a different error (a panic).
I ran it 5 more times. The first three times I got the Segmentation fault. The fourth time, the process hung after 'downloading component rustc'. The fifth time, the command succeeded. For good measure, I ran it one more time and got another Segmentation fault.
So it does appear as if whatever is happening is intermittent, which means it may also be sensitive to the host hardware and OS. I'm on a 2023 Macbook Pro 14" (M3 Pro, 36GB RAM).
I'd worry the issue was unique to my environment, but others in the rust bug were able to replicate it, so we know it's not just me.
@jaraco The above worked for me without modification on Docker Desktop for Mac v4.30.0.
Are you using the same macOS version and architecture?
The host OS is macOS 14.5 Apple Silicon and the guest OS is Ubuntu (Noble) X86. Next, I ran it 20 times in a loop, and no errors here.
Next, my default builder, desktop-linux
, after Docker Desktop for Mac installation looks like the following:
➜ docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
default docker
\_ default \_ default running v0.13.2 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/mips64le, linux/mips64
desktop-linux* docker
\_ desktop-linux \_ desktop-linux running v0.13.2 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/mips64le, linux/mips64
I discovered that disabling Rosetta
suppresses the error. I disabled that setting, restarted the engine, then ran the repro 5 times without failure. I then re-enabled Rosetta, restarted the engine, and the repro elicited the error twice in a row. After running the command, here's what I see for buildx ls
:
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
multi* docker-container
\_ multi0 \_ desktop-linux running v0.13.2 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
default docker
\_ default \_ default running v0.13.2 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
desktop-linux docker
\_ desktop-linux \_ desktop-linux running v0.13.2 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
Next, I installed Docker on my 2020 Mac mini (with M1 chip, running macOS 14.4.1) and ran the command and it doesn't reproduce there (succeeded twice), so it does appear as if the issue is sensitive to the speed or number of cores or actual silicon layout of the M3 macbook. Maybe Rosetta is exercising some unique features of these newer chips. Or maybe Rosetta actually has a bug on a newer chip.
Oh, wow. I updated my mac mini from macOS 14.4.1 to 14.5, and now it is also replicating the failure. So that excludes the concerns about chip speed and chip generation.
@jaraco I have Rosetta 2 enabled as you can see from the image:
Next, here's a gist of a complete run of your command:
Yesterday, I also fully uninstalled and reinstalled Docker Desktop for Mac.
Can you confirm you have rosetta installed? pgrep oahd
should return a process ID if Rosetta is installed.
pgrep oahd
The above produces the following:
➜ pgrep oahd
849
Yes, Rosetta 2 is running because Apple has a couple of Intel processes running in Activity Monitor. For example,
Okay. I'll see if I can replicate the issue in a UTM VM. I'm worried that it won't fail, but if it does, then at least that's something that can be shipped.
I created a UTM VM of macOS, with the hope of replicating the issue in a clean environment. Unfortunately, Docker will not run in a VM because it doesn't have access to the hypervisor.
It occurred to me I might possibly be able to replicate the issue by doing something similar to what Docker does, using Rosetta to emulate x86_64 to run Linux, and see if the issue reproduces there. It's a bit of a long shot. I couldn't find the documented setting to enable Rosetta, and installing Linux using emulation is taking forever. I see now that Rosetta functionality is to enable executing amd64 binaries on an arm64 Linux guest. I don't think I'll be able to mimic closely enough what Docker does with virtualization to replicate the issue. We'll need to rely on a real macOS machine.
I'm struggling to think how to make more progress on this issue. I'm tempted to just leave it open for now and disable Rosetta as a workaround, but ideally I'd like to get the issue to a state where it's at least theoretically solvable by Docker.
Since you're unable to replicate the issue but other people are, perhaps we could find some other people willing to run the test?
If I gave you a login on my mac mini, could you possibly use that to replicate the issue and bisect the differences between that machine and your own? If the issue doesn't reproduce in your profile on my machine, I could take over the login and bisect the differences between that profile and the one where the failure occurs.
Alternatively, is there something more I can run that will help diagnose the issue when it occurs?
It occurred to me I might possibly be able to replicate the issue by doing something similar to what Docker does,
For the record, I did create a virtualized AMD64 Linux machine using UTM (without Rosetta). It was dog slow, taking a couple of hours to install and log in, but it finally completed I was able to confirm that the issue doesn't occur there. No big surprise, though, since it's not using Rosetta and has very limited performance (it was slow enough that the progress bars appeared during the 'installing' steps, which aren't observed on faster machines).
Since you're unable to replicate the issue but other people are, perhaps we could find some other people willing to run the test?
Yes, I'll ask my friend if he can run the test and report the findings here.
If I gave you a login on my mac mini, could you possibly use that to replicate the issue and bisect the differences between that machine and your own? If the issue doesn't reproduce in your profile on my machine, I could take over the login and bisect the differences between that profile and the one where the failure occurs.
Yes, I can do that and report back any findings that I see.
Alternatively, is there something more I can run that will help diagnose the issue when it occurs?
I would recommend posting a reference to this issue within the Docker Slack because the Docker core team has intimate knowledge about the Docker Engine and Docker Desktop for Mac tooling. Have you tried uninstalling and reinstalling Docker Desktop For Mac? If not, I would consider giving that a shot.
@jaraco I think I have roughly the same laptop as you: M3 max, 36GB, Sonoma 14.5.
First I reproduced with the default Docker Desktop settings and it went well. Then I noticed in the diagnostic that the OOM killer was triggered. It seems your Docker Desktop VM is running with 3.8GB.
I changed my settings to allocate only 3.8GB to the VM (Settings
- Resources
- Advanced
- Memory Limit
).
Rerunning your commandline provoked a seg fault.
Can you try to have a look at the Memory Limit setting and raise it? I tested with 8GB and it worked.
@jpbriend Good catch because I have an M1 Max, 64 GB, and Sonoma 14.5. My memory is set to 16 GB because I tend to run several containers for a given project.
I'm afraid that's not the issue for me. Although I have the default settings on my M1 mac mini, on my main system, I'd previously bumped the memory limit to 7.9 GB:
I bumped that limit a several weeks ago (maybe months) due to another project I was working on where I was hitting the memory limit, so most of the test reports I've done here were with that setting. Only on the mac mini were they set to the default (4GB). Is it possible the OOM in the diagnostic was from a couple of months ago?
I went ahead and bumped that to 8.8 GB, clicked Apply and Restart, then re-ran the command. The first attempt succeeded. The subsequent attempt once again triggered the segfault. I ran it twice again, both times with success. The next two attempts failed. For good measure, I bumped the memory limit to 16GB and it failed on the first attempt.
Moreover, I'd be surprised if installing rust docs was an operation requiring multiples of gigabytes to complete.
When I monitor the container memory usage during the run, it doesn't exceed 200MB.
That makes me think the memory limit is a red herring.
Still, that's great that you've managed to elicit the failure.
If I gave you a login on my mac mini, could you possibly use that to replicate the issue and bisect the differences between that machine and your own? If the issue doesn't reproduce in your profile on my machine, I could take over the login and bisect the differences between that profile and the one where the failure occurs.
Yes, I can do that and report back any findings that I see.
Can you share your SSH public key (or point me to where I can find it) and i'll set up your account. Do you have IPv6 or do I need to expose an IPv4 port?
Have you tried uninstalling and reinstalling Docker Desktop For Mac? If not, I would consider giving that a shot.
I have not, but I have installed clean on the mac mini, others have reproduced the issue, and I have downgraded and upgraded Docker, so I'm confident the issue isn't unique to the install this machine.
I've noticed that the issue is less severe for me today that yesterday, maybe only failing 50-60% of the time. The main difference is today I'm running on battery power instead of AC. I also tried turning down the number of CPU cores to 1 and I couldn't get it to fail after several attempts. Turning the CPU cores to 2 did trigger the failure, but less frequently, suggesting that concurrency is a concern. @conradwt could you try with cores set to 4 or 8 to see if a smaller number of cores might help replicate the issue on your system?
@jaraco Here are my results from running it using 4 and 8 CPUs:
4 CPUs (10 iterations): Pass: 7 Fail: 3
8 CPUs (10 iterations): Pass: 2 Fail: 8
BTW, I used the following script:
#!/bin/bash
# Loop 10 times
for i in {1..10}
do
echo "Iteration $i"
# Run the Docker command
docker run --platform linux/amd64 ubuntu:noble bash -c "apt update; apt install -y wget; wget https://sh.rustup.rs -O - | sh -s -- -y"
# Sleep for 10 seconds
sleep 10
done
Usage:
./github-7295 >& output.txt
cat output.txt | grep "Segmentation fault" | wc -l
The crash occurs in installing the rust-docs component, which is infamous for causing filesystem slowdowns during installation because it creates a lot of small files in parallel, as fast as possible. If you want a better way to hit the crash, run rustup component remove rust-docs; rustup component add rust-docs
a few times once rustup is installed.
I suspect that any kind of multithreaded stress testing tool for filesystems can hit this crash. I initially was concerned that this was a bug in rustup, but given that installing the docs under valgrind or strace makes the install run much slower and the crash go away, but it still crashes in gdb/lldb, this seems like a data race reached by stressing the filesystem. So, generally, anything that really stresses the filesystem is what I would try.
Here are my results from running it using 4 and 8 CPUs:
That's great! Am I right in thinking this is the first time you've been able to replicate the failure? How many CPUs did you have Docker configured before when the issue wouldn't occur? I'm guessing it was 14 or 16 depending on your chip. If it was 1, that might explain why you couldn't replicate the issue (as concurrency was effectively disabled). If it was 14 or 16, that's surprising, because my machine fails easily with 12 configured (and yours seems to be more prone to failure with 8 vs 4).
If you want a better way to hit the crash
This was really helpful. I used this idea to create [jaraco/for-mac-issue7295]() from this Dockerfile:
FROM ubuntu:noble
RUN apt update
RUN apt install -y curl
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rustup-init
RUN sh rustup-init -y --profile minimal
ENV PATH=$PATH:/root/.cargo/bin
CMD rustup component add rust-docs
I did try installing with the default profile and then enacting rustup component remove rust-docs
, but that approach fails when rustup tries to remove content added in an earlier layer.
Built and uploaded it with docker build --platform linux/amd64 --tag jaraco/for-mac-issue-7295 --push .
Now I (and others) can run a test quickly and easily with docker run --platform linux/amd64 -it jaraco/for-mac-issue-7295
.
Inspired by Conrad's script, I created this Python script to run the command multiple times and summarize the results:
#!/usr/bin/env python
import subprocess
import sys
codes = dict(
SIGSEGV=139,
SIGABRT=134,
OK=0,
)
code_names = {v: k for k, v in codes.items()}
def run_test():
cmd = ['docker', 'run', '--platform', 'linux/amd64', 'jaraco/for-mac-issue-7295']
proc = subprocess.run(cmd, capture_output=True)
return proc.returncode
def run(n_runs=10):
print(f"Running the command {n_runs} times")
codes = [run_test() for n in range(n_runs)]
results = [code_names.get(code, code) for code in codes]
failures = list(filter(None, codes))
n_failures = len(failures)
successes = n_runs - n_failures
pct = successes / n_runs
print(f"Success {pct:.0%} {results}")
__name__ == '__main__' and run(*map(eval, sys.argv[1:]))
And here's an example run:
@ py -m check-issue7295
Running the command 10 times
Success 30% ['OK', 'OK', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'OK', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV']
With n-cpus set to just 2, the success rate is higher:
@ py -m check-issue7295 20
Running the command 20 times
Success 85% ['SIGSEGV', 'OK', 'OK', 'SIGSEGV', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'SIGSEGV', 'OK', 'OK']
Repeating the test with and without Rosetta confirms the high failure rate with Rosetta and low without, but also reveals a dramatic difference in performance:
draft @ time py -m check-issue7295 20
Running the command 20 times
Success 100% ['OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK']
198.36 real 0.71 user 0.34 sys
draft @ time py -m check-issue7295 20
Running the command 20 times
Success 20% ['OK', 'SIGSEGV', 'OK', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'OK', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'OK', 'SIGSEGV']
33.90 real 0.72 user 0.32 sys
I don't think that indicates a performance difference. Crashing midway through is often faster than doing the entire task.
I don't think that indicates a performance difference. Crashing midway through is often faster than doing the entire task.
I'd considered that, but it also felt much slower. Since the performance is potentially a factor, I ran the test comparing two successful runs and it reported 3.1x latency when not using Rosetta:
draft @ time py -m check-issue7295 1
Running the command 1 times
Success 100% ['OK']
3.12 real 0.05 user 0.02 sys
draft @ time py -m check-issue7295 1
Running the command 1 times
Success 100% ['OK']
9.80 real 0.04 user 0.02 sys
So it seems about half of the 5.8x extra latency was due to the Rosetta change and half due to the early termination due to the fault.
Here are my results from running it using 4 and 8 CPUs:
That's great! Am I right in thinking this is the first time you've been able to replicate the failure? How many CPUs did you have Docker configured before when the issue wouldn't occur? I'm guessing it was 14 or 16 depending on your chip. If it was 1, that might explain why you couldn't replicate the issue (as concurrency was effectively disabled). If it was 14 or 16, that's surprising, because my machine fails easily with 12 configured (and yours seems to be more prone to failure with 8 vs 4).
Yes, this is the first time that I replicated the issue because I wasn't running the command back-to-back within a loop. Also, my default Docker Desktop CPU setting is 4.
Same issue here:
3.320 info: syncing channel updates for 'stable-x86_64-unknown-linux-gnu'
3.764 info: latest update on 2024-05-02, rust version 1.78.0 (9b00956e5 2024-04-29)
3.766 info: downloading component 'cargo'
5.038 info: downloading component 'clippy'
5.303 info: downloading component 'rust-docs'
7.312 info: downloading component 'rust-std'
10.74 info: downloading component 'rustc'
28.38 info: downloading component 'rustfmt'
28.95 info: installing component 'cargo'
29.59 info: installing component 'clippy'
29.81 info: installing component 'rust-docs'
31.18 Segmentation fault
It only happens with a docker amd64 build.
I've made the Python script available as part of the jaraco.docker package, which means it can be readily installed to and run from a Python environment, or run with pip-run. Here are my latest results on Docker 4.31:
@ pip-run jaraco.docker -- -m jaraco.docker.check-issue7295
Running the command 10 times
Success 30% ['OK', 'SIGSEGV', 'SIGSEGV', 'OK', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'SIGSEGV', 'OK']
Reporting the same issue with stable or nightly:
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly-x86_64-unknown-linux-gnu -y --verbose:
0.511 info: downloading installer
2.149 info: profile set to 'default'
2.149 info: default host triple is x86_64-unknown-linux-gnu
2.150 verbose: creating update-hash directory: '/root/.rustup/update-hashes'
2.151 verbose: installing toolchain 'nightly-x86_64-unknown-linux-gnu'
2.151 verbose: toolchain directory: '/root/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu'
2.152 info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
2.156 verbose: creating temp root: /root/.rustup/tmp
2.157 verbose: creating temp file: /root/.rustup/tmp/g0qjkpvwmr9wzdsv_file
2.157 verbose: downloading file from: 'https://static.rust-lang.org/dist/channel-rust-nightly.toml.sha256'
2.157 verbose: downloading with reqwest
2.426 verbose: deleted temp file: /root/.rustup/tmp/g0qjkpvwmr9wzdsv_file
2.426 verbose: no update hash at: '/root/.rustup/update-hashes/nightly-x86_64-unknown-linux-gnu'
2.426 verbose: creating temp file: /root/.rustup/tmp/jd2b20zi8o39a8ct_file.toml
2.426 verbose: downloading file from: 'https://static.rust-lang.org/dist/channel-rust-nightly.toml'
2.426 verbose: downloading with reqwest
2.515 verbose: checksum passed
2.541 verbose: deleted temp file: /root/.rustup/tmp/jd2b20zi8o39a8ct_file.toml
2.541 info: latest update on 2024-06-18, rust version 1.81.0-nightly (59e2c01c2 2024-06-17)
2.543 info: downloading component 'cargo'
2.543 verbose: creating Download Directory directory: '/root/.rustup/downloads'
2.544 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/cargo-nightly-x86_64-unknown-linux-gnu.tar.xz'
2.544 verbose: downloading with reqwest
3.146 verbose: checksum passed
3.147 info: downloading component 'clippy'
3.147 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/clippy-nightly-x86_64-unknown-linux-gnu.tar.xz'
3.147 verbose: downloading with reqwest
3.324 verbose: checksum passed
3.324 info: downloading component 'rust-docs'
3.324 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/rust-docs-nightly-x86_64-unknown-linux-gnu.tar.xz'
3.324 verbose: downloading with reqwest
4.640 verbose: checksum passed
4.640 info: downloading component 'rust-std'
4.640 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/rust-std-nightly-x86_64-unknown-linux-gnu.tar.xz'
4.640 verbose: downloading with reqwest
6.835 verbose: checksum passed
6.835 info: downloading component 'rustc'
6.836 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/rustc-nightly-x86_64-unknown-linux-gnu.tar.xz'
6.836 verbose: downloading with reqwest
12.54 verbose: checksum passed
12.54 info: downloading component 'rustfmt'
12.54 verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-06-18/rustfmt-nightly-x86_64-unknown-linux-gnu.tar.xz'
12.54 verbose: downloading with reqwest
12.72 verbose: checksum passed
12.72 info: installing component 'cargo'
12.72 verbose: creating temp directory: /root/.rustup/tmp/nzx2ftixnzeo5spb_dir
13.30 verbose: deleted temp directory: /root/.rustup/tmp/nzx2ftixnzeo5spb_dir
13.30 info: installing component 'clippy'
13.30 verbose: creating temp directory: /root/.rustup/tmp/jrp16eqls6od27vy_dir
13.54 verbose: creating temp file: /root/.rustup/tmp/r34bncoi5nkzoy6k_file
13.54 verbose: creating temp file: /root/.rustup/tmp/fc5jojgspc8rjxm4_file
13.54 verbose: deleted temp directory: /root/.rustup/tmp/jrp16eqls6od27vy_dir
13.54 info: installing component 'rust-docs'
13.54 verbose: creating temp directory: /root/.rustup/tmp/mg2ub6ot5otvqkf__dir
13.91 sh: line 570: 46 Segmentation fault "$@"
Description
In https://github.com/rust-lang/rust/issues/125430, I reported an issue where rust fails to install on a Linux amd64 container over Docker on macOS 14.5 on ARM. During the docs build, a Segmentation fault occurs. The issue only occurs with
--platform linux/amd64
. Analysis in that other issue suggests a root cause in the kernel or Docker.Reproduce
Expected behavior
The build should complete successfully as it does in other environments.
docker version
docker info
Diagnostics ID
31BF3D41-B6F8-43A0-8FBC-2021581F5862/20240526183503
Additional Info
No response