bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

"unlinkat ([path]) (Directory not empty)" errors with experimental_worker_for_repo_fetching #22680

Open jfirebaugh opened 4 months ago

jfirebaugh commented 4 months ago

Description of the bug:

After upgrading to Bazel 7.2.0 and removing --experimental_worker_for_repo_fetching=off from .bazelrc (because https://github.com/bazelbuild/bazel/issues/21803 was fixed), we started to observe failures of the following form:

ERROR: /home/circleci/figma/WORKSPACE.bazel:157:27: fetching node_repositories rule //external:nodejs_linux_amd64: java.io.IOException: unlinkat (/home/circleci/.cache/bazel/_bazel_circleci/a6e8cb0ce75ceb1ea1faa0cca0b0ae0d/external/nodejs_linux_amd64/bin/nodejs) (Directory not empty)
ERROR: no such package '@@nodejs_linux_amd64//': unlinkat (/home/circleci/.cache/bazel/_bazel_circleci/a6e8cb0ce75ceb1ea1faa0cca0b0ae0d/external/nodejs_linux_amd64/bin/nodejs) (Directory not empty)

These errors appear to be linked with repository fetch restarts, particularly "fetch interrupted due to memory pressure; restarting". In two of two cases where I have seen this error, it happened in a repo where the repo fetch got restarted, judging by either the presence of that log message or the JSON trace profile.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Bazel Slack discussion: https://bazelbuild.slack.com/archives/CA31HN1T3/p1718057176146129

Any other information, logs, or outputs that you want to share?

No response

jfirebaugh commented 4 months ago

I am presently confirming that --experimental_worker_for_repo_fetching=off mitigates the issue.

jfirebaugh commented 4 months ago

Still on 7.2.0, haven't seen this again with --experimental_worker_for_repo_fetching=off.

fmeum commented 4 months ago

I managed to reproduce this via https://github.com/bazelbuild/bazel/pull/22748#issuecomment-2167837460:

INFO: repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2' used the following cache hits instead of downloading the corresponding file.
 * Hash '16a51843ef28d79f06c864eb305266b3daa1dc2a932af02a82ab139e42c8f2c2aed34dbca2ba8187134c16415e9f4cc6ca0e9ea40083df6a63564e7c9e0204ad' for https://registry.npmjs.org/decode-uri-component/-/decode-uri-component-0.2.2.tgz
If the definition of 'repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2' was updated, verify that the hashes were also updated.
com.google.devtools.build.lib.rules.repository.RepositoryFunction$RepositoryFunctionException: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
    at com.google.devtools.build.lib.rules.repository.RepositoryFunction.setupRepoRoot(RepositoryFunction.java:139)
    at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.lambda$fetch$1(StarlarkRepositoryFunction.java:154)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
    at java.base/java.util.concurrent.ThreadPerTaskExecutor$TaskRunner.run(ThreadPerTaskExecutor.java:314)
    at java.base/java.lang.VirtualThread.run(VirtualThread.java:309)
Caused by: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
    at com.google.devtools.build.lib.unix.NativePosixFiles.remove(Native Method)
    at com.google.devtools.build.lib.unix.UnixFileSystem.delete(UnixFileSystem.java:444)
    at com.google.devtools.build.lib.vfs.FileSystem.deleteTree(FileSystem.java:232)
    at com.google.devtools.build.lib.vfs.Path.deleteTree(Path.java:594)
    at com.google.devtools.build.lib.rules.repository.RepositoryFunction.setupRepoRoot(RepositoryFunction.java:136)
    ... 6 more
ERROR: <builtin>: fetching npm_import_rule rule //:aspect_rules_js~~npm~npm__decode-uri-component__0.2.2: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
ERROR: no such package '@@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//': /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
ERROR: /Users/fmeum/git/examples/frontend/BUILD.bazel:10:22: //:.aspect_rules_js/node_modules/decode-uri-component@0.2.2/pkg depends on @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//:pkg in repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 which failed to fetch. no such package '@@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//': /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
Use --verbose_failures to see the command lines of failed build steps.
ERROR: Analysis of target '//react/src:test_lib' failed; build aborted: Analysis failed
INFO: Elapsed time: 38.951s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED:
    Fetching repository @@aspect_rules_swc~~swc~swc_darwin-arm64; fetch interrupted due to memory pressure; restarting. 19s
    Fetching repository @@aspect_rules_js~~npm~npm__buffer-from__1.1.2; fetch interrupted due to memory pressure; restarting. 11s
    Fetching https://registry.npmjs.org/@typescript-eslint/type-utils/-/type-utils-5.44.0.tgz; 11s
    Fetching repository @@aspect_rules_js~~npm~npm__esbuild-windows-64__0.15.18; fetch interrupted due to memory pressure; restarting. 10s
    Fetching repository @@aspect_rules_js~~npm~npm__esbuild-freebsd-arm64__0.15.18; fetch interrupted due to memory pressure; restarting. 8s
    Fetching https://registry.npmjs.org/is-weakset/-/is-weakset-2.0.2.tgz; 7s
    Fetching repository @@aspect_rules_js~~npm~npm__at_next_swc-win32-arm64-msvc__13.0.5; fetch interrupted due to memory pressure; restarting. 6s
    Fetching repository @@aspect_rules_js~~npm~npm__randombytes__2.1.0; fetch interrupted due to memory pressure; restarting. 5s ... (20 fetches)
meteorcloudy commented 4 months ago

@bazel-io fork 7.2.1

jjmaestro commented 3 months ago

I have a repro of this bug here: https://github.com/jjmaestro/bazel_issue_22680_repro

It always fails when I run it inside that Docker.

@fmeum and @iancha1992 & @moroten (from #22748) could you check it out and see if this also fails for you? Thanks!!

PS: also, sorry for the commit noise, I did a few push --force and didn't realize that GH would push the info here because I had linked the issue πŸ˜“ )

iancha1992 commented 3 months ago

@jjmaestro Did you test with 7.2.1? According to https://github.com/bazelbuild/bazel/issues/23029#issuecomment-2237340388, the fix may have bene included in 7.2.1.

jjmaestro commented 3 months ago

@jjmaestro Did you test with 7.2.1? According to #23029 (comment), the fix may have bene included in 7.2.1.

Yup, that's exactly what I did. The repro repo that I linked is set to 7.2.1rc2. Can you give it a quick check? :)

It's literally one git clone and make away πŸ‘Œ That will run a Docker image with Bazel 7.2.1rc2 and do one bazel build that works (cache dir inside Docker) and two that fail, with the cache directory outside Docker. One of those also uses the --experimental_worker_for_repo_fetching=off which was mentioned to mitigate this issue, but it still fails in the repro I built.

I hope this will help debug what's causing this!

jjmaestro commented 3 months ago

Ah, ok, sorry, my brain just caught on to what you are saying πŸ˜… I've update the repo and re-run it with 7.2.1 and it's still broken. I had tested it with both versions myself, dunno why my brain short-circuited and committed the rc2 version instead of the released version 😫 Sorry for the noise!

Can you check it out and see if that helps with identifying what's going on?

Thanks!

meteorcloudy commented 3 months ago

@jjmaestro Does it work for Bazel built from last_green? (Sorry I'm a bit occupied now to try to repro)

jjmaestro commented 3 months ago

@meteorcloudy not sure, haven't tested it yet with a custom built version. I was now beginning to clone the repo and read about how to build bazel from source, so I'll play with this a bit more and will get back to you!

fmeum commented 3 months ago

I ran the reproducer, but all three builds passed for me.

jjmaestro commented 3 months ago

@fmeum wow, that's surprising! It's a 100% repro for me, it always fails when I run it. And I had a friend run the repro and it also failed for them, that's why I thought it could be something with Bazel. But maybe it's with Docker and/or some incompatibility or configuration?

I'm running Docker 26.1.4, build 5650f9b on a Mac Studio (MacOS 13.6.6 on M1 Max), so Docker runs on linux/arm64.

What's your setup? Also, any ideas as to how I could debug this further? I'd love if I could get some help and see if I can dig into this further :)

Thanks!!

fmeum commented 3 months ago

I ran this in a Debian VM on a Mac M3 host.

Could you share the stack traces and build failures you got?

jjmaestro commented 3 months ago

I ran this in a Debian VM on a Mac M3 host.

Yeah, I just ran it on a Debian VM (debian12 running in UTM) in my mac and it does work... so my guess is that there's something going on with the virtualization that Docker uses on Mac.

Could you share the stack traces and build failures you got?

This is the error I get running build with --verbose_failures:

$ bazel --output_base /tmp/out/bazel build --verbose_failures //testimage:tarball

INFO: Repository rules_python~~python~python_3_11_aarch64-unknown-linux-gnu instantiated at:
  <builtin>: in <toplevel>
Repository rule python_repository defined at:
  /tmp/out/bazel/external/rules_python~/python/repositories.bzl:443:36: in <toplevel>
INFO: repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu' used the following cache hits instead of downloading the corresponding file.
 * Hash 'b3a7199ac2615d75fb906e5ba556432efcf24baf8651fc70370d9f052d4069ee' for https://github.com/indygreg/python-build-standalone/releases/download/20240415/cpython-3.11.9+20240415-aarch64-unknown-linux-gnu-install_only.tar.gz
If the definition of 'repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu' was updated, verify that the hashes were also updated.
ERROR: An error occurred during the fetch of repository 'rules_python~~python~python_3_11_aarch64-unknown-linux-gnu':
   Traceback (most recent call last):
        File "/tmp/out/bazel/external/rules_python~/python/repositories.bzl", line 255, column 28, in _python_repository_impl
                rctx.delete("share/terminfo")
Error in delete: java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: @@rules_pkg~//pkg/private/tar:build_tar depends on @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes in repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu which failed to fetch. no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 7.711s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 
    Fetching repository @@_main~_repo_rules~postgres_src; starting 7s
    Fetching repository @@rules_foreign_cc~~tools~cmake-3.23.2-linux-aarch64; starting 7s
    Fetching /tmp/out/bazel/external/_main~_repo_rules~postgres_src; Extracting REL_16_3.tar.gz 6s
    Fetching https://github.com/Kitware/CMake/releases/download/v3.23.2/cmake-3.23.2-linux-aarch64.tar.gz; 20.1 MiB (44.5%) 6s

Re. the stack traces, I'm a Bazel n00b so I don't know if there's any other flag or way to get more verbose output or logs. If there is, please let me know and I'll get them.

Thanks!

jjmaestro commented 3 months ago

It seems the offending line is L255rctx.delete("share/terminfo") in https://github.com/bazelbuild/rules_python/blob/084b877c98b580839ceab2b071b02fc6768f3de6/python/repositories.bzl#L237-L256

So, Bazel fails when trying to delete a file with a "Directory not empty" error or something like that :-?

I'll try to dig deeper and see if I can get more info on the Bazel side

fmeum commented 3 months ago

Since the stack trace is produced by an rctx.delete call in a Python repo rule and the crash still happens with virtual workers turned off, I wonder whether this may be a legitimate domain-specific non-Bazel issue. Have you raised this on the rules_python repo? I won't rule out that this is ultimately a Bazel issue though.

jjmaestro commented 3 months ago

No, I haven't yet. It could definitely be either but I was much more suspecting that the error could be somewhere in Bazel because that rctx.delete in rules_python works under certain conditions e.g. the directory / files not being inside a Docker volume running in the "Docker virtualization" (whatever that is, I need to check what's the VM and how it's different from e.g. UTM where it doesn't repro)... and that's something that, to me, points more to a potential bug somewhere deeper than a rule.

I'll try to dig a bit more into this and see where it goes!

jjmaestro commented 3 months ago

Small update, IMHO there's definitely something weird going on... to me, this definitely looks like some sort of "race condition / non-deterministic behavior. When I add printf statements to do some print-debugging, I get a different error than when I just run bazel from main!?

When I just compile main at [17eadaf366] and use that bazel-dev, I get: LOG-main_without_prints, which is the same error I've been getting when I use a 7.2.1 binary:

INFO: Repository rules_python~~python~python_3_11_aarch64-unknown-linux-gnu instantiated at:
  <builtin>: in <toplevel>
Repository rule python_repository defined at:
  /tmp/out/bazel/external/rules_python~/python/repositories.bzl:443:36: in <toplevel>
ERROR: An error occurred during the fetch of repository 'rules_python~~python~python_3_11_aarch64-unknown-linux-gnu':
   Traceback (most recent call last):
    File "/tmp/out/bazel/external/rules_python~/python/repositories.bzl", line 255, column 28, in _python_repository_impl
        rctx.delete("share/terminfo")
Error in delete: java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
Analyzing: target //testimage:tarball (277 packages loaded, 9517 targets configured)
[1 / 1] checking cached actions
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: @@rules_pkg~//pkg/private/tar:build_tar depends on @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes in repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu which failed to fetch. no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 63.166s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 

But when I run the repro in Docker using a compiled bazel-dev with this patch applied to main at [17eadaf366], I get a different error: LOG-main_with_prints:

<PRINT-DEBUG LOGS SHOWING DELETED PATHS>
(...)
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': error globbing [bin/**, extensions/**, include/**, lib/**, libs/**, share/**] - [**/* *, lib/libpython3.11.so, lib/**/*.a, lib/python3.11/**/test/**, lib/python3.11/**/tests/**, **/__pycache__/*.pyc.*, **/__pycache__/*.pyc, **/__pycache__/*.pyo] op=FILES: /tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/lib/python3.11/site-packages/pip/_internal/operations/build (No such file or directory)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: Target '@@rules_pkg~//pkg/private/tar:build_tar' depends on toolchain '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes', which cannot be found: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': error globbing [bin/**, extensions/**, include/**, lib/**, libs/**, share/**] - [**/* *, lib/libpython3.11.so, lib/**/*.a, lib/python3.11/**/test/**, lib/python3.11/**/tests/**, **/__pycache__/*.pyc.*, **/__pycache__/*.pyc, **/__pycache__/*.pyo] op=FILES: /tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/lib/python3.11/site-packages/pip/_internal/operations/build (No such file or directory)'
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 50.333s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 

This seems to me like the delete that failed before is now working when the logging / printf lines are executed, and then it fails somewhere else, or something like that.

In any case, whether this is what's happening or something else, it definitely seems quite weird to me that this error(s) were caused by rules_python and that I would get different results (different errors) just by having or not having printf statements in the Bazel code :-? Maybe I'm missing something and/or I'm doing something wrong, dunno.

I'll keep checking further, hopefully I'll be able to isolate / get closer to what's actually failing.

fmeum commented 3 months ago

CC @rickeylev @aignas in case you know of any logic in the rules_python repo rules that could result in concurrent access to the files it deletes.

jjmaestro commented 3 months ago

@fmeum BTW, just found this while looking for the implementation of deleteTreesBelow: https://github.com/bazelbuild/bazel/blob/d915b98376295bd010f669e7a60815a3eeca3d0c/src/main/native/unix_jni.cc#L1075-L1089

Part of the comment was added by @oquenchil in https://github.com/bazelbuild/bazel/commit/d6c79dbde3365c2e0e370eba4942ea49e3533395 and in the commit it was mentioned:

Asynchronous deletion may make it more likely for the problem to surface but the likelihood is still very low.

Could this be the issue? Maybe running the deletes in a Docker volume running in MacOS makes triggering the bug more likely :-?

rickeylev commented 3 months ago

We used to see concurrent write issues on Windows because of pyc creation happening, but fixed it. This would happen during the build phase, not repo phase.

Part of how the PYC creation process works is Python writes foo.pyc.NNNNN, where N is a timestamp. Then it does mv(foo.pyc.NNN, foo.pyc) to perform an atomic rename. This would happen after it created the pycache directory and within that directory (assuming it could; if it can't, it doesn't try to precompile)

To fix this, we:

  1. The globs ignore pyc files and pycache directories

  2. The repo files/dirs are all made read-only

On Windows, if a user is an admin, (2) doesn't actually work. Hence (1) is the defense in the case.

On Mac/Linux, we never saw issues.

Hope this helps

On Thu, Jul 25, 2024, 9:43β€―AM Javier Maestro @.***> wrote:

@fmeum https://github.com/fmeum BTW, just found this while looking for the implementation of deleteTreesBelow:

https://github.com/bazelbuild/bazel/blob/d915b98376295bd010f669e7a60815a3eeca3d0c/src/main/native/unix_jni.cc#L1075-L1089

Part of the comment was added by @oquenchil https://github.com/oquenchil in d6c79db https://github.com/bazelbuild/bazel/commit/d6c79dbde3365c2e0e370eba4942ea49e3533395 and in the commit it was mentioned:

Asynchronous deletion may make it more likely for the problem to surface but the likelihood is still very low.

Could this be the issue? Maybe running the deletes in a Docker volume running in MacOS makes triggering the bug more likely :-?

β€” Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/22680#issuecomment-2250950723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIEXQ6RKFPUBALRE6CGB7PLZOETKHAVCNFSM6AAAAABJDGQYW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQHE2TANZSGM . You are receiving this because you were mentioned.Message ID: @.***>

jjmaestro commented 3 months ago

Digging into why this happens in Docker running on MacOS, I see that it's using virtiofs to share the filesystem in those volumes. And, looking into virtiofs and Docker... well, there's quite a bunch of present and past issues, with quite the repertoir like https://github.com/docker/for-mac/issues/7004 and a lot of permissions issues.

I've also checked past issues with Bazel and virtiofs and sure enough, found a past permission issue: MacOS + Docker + Cache on the VirtioFS volume => "ERROR: Failed to initialize sandbox: /sandbox/inaccessibleHelperDir (Permission denied)" #16364 .

I'm trying to "prove this" by using Docker with gRPC-FUSE instead but... that also fails the build with a different error 🀣 😭

# bazel --output_base /tmp/out/bazel build --verbose_failures //testimage:tarball
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Analyzed target //testimage:tarball (278 packages loaded, 18569 targets configured).
ERROR: /tmp/out/bazel/external/rules_foreign_cc~/toolchains/private/BUILD.bazel:23:10: output tree artifact external/rules_foreign_cc~/toolchains/private/make was not created
ERROR: /tmp/out/bazel/external/rules_foreign_cc~/toolchains/private/BUILD.bazel:23:10: BootstrapGNUMake external/rules_foreign_cc~/toolchains/private/make [for tool] failed: not all outputs were created or valid
Target //testimage:tarball failed to build
INFO: Elapsed time: 148.828s, Critical Path: 24.86s
INFO: 110 processes: 32 internal, 78 processwrapper-sandbox.
ERROR: Build did NOT complete successfully

And... after that failure, I can't remove the bazel cache! πŸ˜…

$ rm -rf /tmp/out/bazel 
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/bin/make': Operation not permitted
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/include/gnumake.h': Operation not permitted
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/share/info/make.info': Operation not permitted
(...)
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/share/man/man1': Operation not permitted

I get the same errors even when trying to remove the folder outside Docker. I need to sudo rm in MacOS to remove it. Something seems very broken in Docker, both with virtiofs and gRPC-FUSE.

So, I have a feeling there's one or more bugs in virtiofs and/or docker-on-mac with virtiofs and that's what causing the break 😞 I'll probably just avoid all of this and work on a Linux VM directly so that there's no volume shared with MacOS 🀷

Thanks for all your help!

jjmaestro commented 3 months ago

BTW, for completion, I've also tried removing Docker Desktop and running docker cli with Lima:

So, my guess remains as I mentioned in my last comment: something's going on with Docker Desktop and virtiofs, maybe there's a bug in the version that Docker Desktop uses or there's something that triggers the issue in how Docker Desktop uses virtiofs or something else.