TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.13k stars 102 forks source link

Upload never finishes (or takes a **very** long time) #1298

Open avdv opened 2 weeks ago

avdv commented 2 weeks ago

I am currently trying to use nativelink 0.5.1 with buck2. It works fine when building our backend, but building the frontend on MacOS aarch64 never finishes since it is stuck in the re_upload phase:

[2024-08-30T09:32:37.761+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
...

[2024-08-30T09:59:15.457+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
...
[2024-08-30T10:05:15.465+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]

(it's still running and I'll see if it finishes at all; I'll report back what happened)

I have started nativelink locally, using the basic config, just setting additional_environment.

In the build, there are these genrules:

genrule(
    name = "node_modules",
    srcs = [
        "package.json",
        "yarn.lock",
    ],
    out = "node_modules",
    cmd = "$(exe toolchains//:yarn) install && mv node_modules ../out",
    labels = [
        "yarn_install",  # causes this genrule to run locally (needs network access)
    ],
)

genrule(
    name = "tsc_generated",
    srcs = [
        ":generate_api_types",
        ":node_modules",
        "package.json",
        "tsconfig.json",
    ],
    out = "dist",
    cmd = """\
$(exe toolchains//:yarn) run tsc --outDir ${OUT} --rootDir . --declaration true
""",
    default_outs = ["dist"],
)

So, yes the node_modules output is indeed large:

$ fd . -tf buck-out/v2/gen/root/904931f735703749/frontend/__node_modules__/out/node_modules | wc -l
92822

node_modules is heavier than the universe

avdv commented 2 weeks ago

(it's still running and I'll see if it finishes at all; I'll report back what happened)

It just failed because it ran out of disk space. This was on a Github Actions macos-latest-xlarge runner which, according to the docs, should have 14 GiB of SSD storage available. So it seems it used up all the space when trying to upload the node_modules directory.

adam-singer commented 2 weeks ago

@avdv is it possible to create an isolated project setup that fails or point to something open source that uses buck2 / node that exhibits this failure?

avdv commented 2 weeks ago

Hi @adam-singer, here's a repository which contains a reproducer: https://github.com/avdv/nl-repro/

I cancelled the workflow run after ~1h, see https://github.com/avdv/nl-repro/actions/runs/10664811435/job/29556846974

So, yes the node_modules output is indeed large:

$ fd . -tf buck-out/v2/gen/root/904931f735703749/frontend/__node_modules__/out/node_modules | wc -l
92822

BTW, these numbers are from my local Linux system. There were no symlinks involved, but maybe that is different on MacOS?

avdv commented 1 week ago

Note, I also created a workflow running on Linux. It shows the same symptom: https://github.com/avdv/nl-repro/actions/runs/10679208408/job/29597917819

MarcusSorealheis commented 1 week ago

Are we sure this is a NativeLink issue?

avdv commented 1 week ago

Are we sure this is a NativeLink issue?

We are currently using bazel-remote-worker and BuildBuddy RE with our project. Both work fine for us (although bazel-remote-worker is very slow). That leads me to think the problem is related to NativeLink...

MarcusSorealheis commented 1 week ago

@avdv I'm not totally sure about what happened here because we haven't seen it. My best guess would be a regression that we saw in 0.5.1 is the culprit, but I am not totally sure about the details here.

Can you attempt the same operation with 0.5.3 and let me know if the system behaves as expected?

If it works, feel free to close this issue.

MarcusSorealheis commented 1 week ago

On second thought, we'll attempt to run the reproducer on 0.5.3 and report back. Stay tuned!

avdv commented 1 week ago

On second thought, we'll attempt to run the reproducer on 0.5.3 and report back. Stay tuned!

Already did that today, I cancelled it after ~50 mins: https://github.com/avdv/nl-repro/actions/runs/10723077069/job/29735546354

allada commented 1 week ago

I started looking into this a couple days ago, but then had a conference to attend. I'll try to get some time tomorrow/this-weekend to deep-dive this. These kind of things usually end up being a config problem somewhere, but I'll keep you tuned.

allada commented 1 week ago

It appears the issue is that you are using the default nativelink container to run the jobs in, but this container is bare-bones and has pretty much nothing installed in it, not even bash. So buck2 tries to run:

/usr/bin/env bash -e buck-out/v2/gen/root/904931f735703749/__hello_world__/sh/genrule.sh

It ends up failing because bash is not a program known to where nativelink is deploying the command. What i would suggest doing is to instead use nativelink as a layer in your image, or ... since our published nativelink binaries have no dependencies are statically compiled, you can copy it from a host machine (ie: another container) and mount it into the container at runtime.

Buck2 did get a "I am not able to run this command" response, but seems to keep waiting anyway.

avdv commented 1 week ago

Thank you for looking into it, @allada

It appears the issue is that you are using the default nativelink container to run the jobs in, but this container is bare-bones and has pretty much nothing installed in it, not even bash. So buck2 tries to run:

/usr/bin/env bash -e buck-out/v2/gen/root/904931f735703749/__hello_world__/sh/genrule.sh

It ends up failing because bash is not a program known to where nativelink is deploying the command.

OK, good point. I have now removed the hello_world target (which was just a left-over from buck2 init).

Also, I am now extracting the nativelink binary from the docker image, see https://github.com/avdv/nl-repro/actions/runs/10744420726 (BTW, it reports version 0.5.1 although I am using the 0.5.3 Docker image). There are no errors reported, so I suspect that the action is just not run but waiting for the upload to finish...

Note, in my actual project we were running nativelink on MacOS and in my initial repro on Linux (https://github.com/avdv/nl-repro/actions/runs/10679208408) I was building nativelink with nix (https://github.com/avdv/nl-repro/commit/fef57740c45b211d10ab6c66e084b11c587179b1#diff-fde0e5d64aae13964fdda6d47af304cf1a7015cbc17e440ac4a5e662ee1d875eR25)

allada commented 1 week ago

I've been looking into this and when I run it locally I get:

Action failed: root//:tsc_generated (genrule)
Remote command returned non-zero exit code 2
stdout:
yarn run v1.22.22
$ /tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/node_modules/.bin/tsc --outDir ../out/dist --rootDir . --declaration true
error TS18003: No inputs were found in config file '/tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/tsconfig.json'. Specified 'include' paths were '["src/**/*"]' and 'exclude' paths were '[]'.
Files:                          0
Lines of Library:               0
Lines of Definitions:           0
Lines of TypeScript:            0
Lines of JavaScript:            0
Lines of JSON:                  0
Lines of Other:                 0
Nodes of Library:               0
Nodes of Definitions:           0
Nodes of TypeScript:            0
Nodes of JavaScript:            0
Nodes of JSON:                  0
Nodes of Other:                 0
Identifiers:                    0
Symbols:                        7
Types:                         57
Instantiations:                 0
Memory used:               21216K
Assignability cache size:       0
Identity cache size:            0
Subtype cache size:             0
Strict subtype cache size:      0
Program time:               0.00s
printTime time:             0.00s
Emit time:                  0.00s
Total time:                 0.00s
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
stderr:
error Command failed with exit code 2.
Build ID: b45ebc64-6f60-496c-a8fb-a8fac0c30988
Network: (GRPC-SESSION-ID)
Jobs completed: 5. Time elapsed: 12.7s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD FAILED
Failed to build 'root//:tsc_generated (prelude//platforms:default#904931f735703749)'

I want to make sure I'm debugging the right thing here, so just to clarify, you are not even able to get to this stage, correct?

If that is the case, one thing I do see just shy of 100k files, 16k directories and 677 symlinks for a total of ~750Mb

This leads me to think it is possible that the worker is being bound by IO and/or kernel calls. github runners give famously slow disk performance. Given this, what I might suggest is trying to mount a tmpfs or ramfs then configuring the nativelink worker to use that mount instead. This may not be the best long term solution, but at least it'll tell us if the github runners are being disk io bound.

avdv commented 1 week ago

I've been looking into this and when I run it locally I get:

Action failed: root//:tsc_generated (genrule)
Remote command returned non-zero exit code 2
stdout:
yarn run v1.22.22
$ /tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/node_modules/.bin/tsc --outDir ../out/dist --rootDir . --declaration true
error TS18003: No inputs were found in config file '/tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/tsconfig.json'. Specified 'include' paths were '["src/**/*"]' and 'exclude' paths were '[]'.
Files:                          0
Lines of Library:               0
Lines of Definitions:           0
Lines of TypeScript:            0
Lines of JavaScript:            0
Lines of JSON:                  0
Lines of Other:                 0
Nodes of Library:               0
Nodes of Definitions:           0
Nodes of TypeScript:            0
Nodes of JavaScript:            0
Nodes of JSON:                  0
Nodes of Other:                 0
Identifiers:                    0
Symbols:                        7
Types:                         57
Instantiations:                 0
Memory used:               21216K
Assignability cache size:       0
Identity cache size:            0
Subtype cache size:             0
Strict subtype cache size:      0
Program time:               0.00s
printTime time:             0.00s
Emit time:                  0.00s
Total time:                 0.00s
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
stderr:
error Command failed with exit code 2.
Build ID: b45ebc64-6f60-496c-a8fb-a8fac0c30988
Network: (GRPC-SESSION-ID)
Jobs completed: 5. Time elapsed: 12.7s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD FAILED
Failed to build 'root//:tsc_generated (prelude//platforms:default#904931f735703749)'

This is what I would expect to see too.

I want to make sure I'm debugging the right thing here, so just to clarify, you are not even able to get to this stage, correct?

Yes, correct :+1: .

Running the yarn command remotely is not really necessary, so I replaced the yarn command for tsc_generated with:

ls -lh node_modules/ >&2 ; mkdir ${OUT}; echo done > ${OUT}/stamp

(see https://github.com/avdv/nl-repro/commit/6a9865165169093026780e0036f3040645f3fbfe)

I ran buck build -v2,stderr --prefer-remote :tsc_generated and forgot about it... it ran for over 23 hours :open_mouth: on my local machine before I cancelled it.

$ buck2 build -v2,stderr --prefer-remote :tsc_generated
File changed: root//.#BUCK
File changed: root//BUCK
Build ID: 03766e8d-435a-4427-b54e-6798674ffcc3
Network: (GRPC-SESSION-ID)
Command: build.                      Remaining: 1/8. Time elapsed: 23:08:26.1s
--------------------------------------------------------------------------------
root//:tsc_generated -- action (genrule) [re_upload]               23:08:26.1s

Running the command again, it succeeded in about 55 seconds:

$ buck2 build -v2,stderr --prefer-remote :tsc_generated
stderr for root//:tsc_generated (genrule):
total 0
drwxr-xr-x 1 claudio users   18 Sep  8 16:29 @aashutoshrathi
drwxr-xr-x 1 claudio users   18 Sep  8 16:29 @ampproject
drwxr-xr-x 1 claudio users 2.8K Sep  8 16:29 @babel
drwxr-xr-x 1 claudio users   92 Sep  8 16:29 @bazel
...
Build ID: 207239c9-5bea-45de-8edc-970b0cfae932
Network: (GRPC-SESSION-ID)
Jobs completed: 4. Time elapsed: 54.0s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD SUCCEEDED

If that is the case, one thing I do see just shy of 100k files, 16k directories and 677 symlinks for a total of ~750Mb

This leads me to think it is possible that the worker is being bound by IO and/or kernel calls. github runners give famously slow disk performance. Given this, what I might suggest is trying to mount a tmpfs or ramfs then configuring the nativelink worker to use that mount instead. This may not be the best long term solution, but at least it'll tell us if the github runners are being disk io bound.

Since it also happens on my machine, I don't think this is the issue here. Rather it looks like it got stuck somehow, although everything is already uploaded, since a subsequent run "immediately" succeeds...

Here's the latest run on Github: https://github.com/avdv/nl-repro/actions/runs/10760776248/job/29839142839

allada commented 1 week ago

Oh, I wounder if you are hitting max open files limits.

try running (in same shell):

ulimit -n 65000

Then also increase max_open_files in the nativelink json config to something very high as well, see if that helps.

You are probably running on ?mac, and I believe it has much lower default limits than linux.

avdv commented 1 week ago

Oh, I wounder if you are hitting max open files limits.

try running (in same shell):

ulimit -n 65000

Then also increase max_open_files in the nativelink json config to something very high as well, see if that helps.

You are probably running on ?mac, and I believe it has much lower default limits than linux.

No, I am using Linux (NixOS actually). File open descriptor limits are already set quite high, at least the hard limit is:

$ ulimit -H -n
524288
$ ulimit -S -n
1024

I tried to set the soft limit to 65000 anyway, and also changed the global.max_open_files setting to 61512, rm -rf /tmp/nativelink, restarted nativelink, run buck2 clean and buck2 build :tsc_generated but the issue persists.

avdv commented 1 week ago

BTW, I also tried to increase the limit on MacOS too (https://github.com/avdv/nl-repro/commits/main/) . On the first run (https://github.com/avdv/nl-repro/actions/runs/10768327053/job/29857259444) it helped and I got:

[2024-09-09T07:24:41.843+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
[2024-09-09T07:24:48.941+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:24:55.945+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:03.040+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:10.041+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:15.630+00:00] Action failed: root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) (genrule)
[2024-09-09T07:25:15.630+00:00] Remote command returned non-zero exit code 127
[2024-09-09T07:25:15.630+00:00] Stdout: <empty>
[2024-09-09T07:25:15.630+00:00] Stderr:
buck-out/v2/gen/root/200212f73efcd57d/__tsc_generated__/sh/genrule.sh: line 5: yarn: command not found

[2024-09-09T07:25:15.643+00:00] Cache hits: 0%
[2024-09-09T07:25:15.643+00:00] Commands: 3 (cached: 0, remote: 2, local: 1)
[2024-09-09T07:25:15.643+00:00] Network: (GRPC-SESSION-ID)
[2024-09-09T07:25:15.644+00:00]
[2024-09-09T07:25:15.644+00:00] BUILD ERRORS (1)
[2024-09-09T07:25:15.644+00:00] The following actions failed during the execution of this command:
[2024-09-09T07:25:15.644+00:00] Action failed: root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) (genrule)
[2024-09-09T07:25:15.644+00:00] Remote command returned non-zero exit code 127
...

So I replaced the yarn command with ls -lh node_modules/ >&2 ; mkdir ${OUT}; echo done > ${OUT}/stamp similar to the linux branch and it got stuck again (https://github.com/avdv/nl-repro/actions/runs/10768573202/job/29857960048) :sob:

allada commented 1 week ago

Remember you now need to install yarn.

If you run buck2 a second time does it start working? When I was testing this, I did notice that it looked like buck2 stopped talking to nativelink after the first handshake, but didn't think much of it, so restarted the buck2 build and it started working. Is this similar to what you are seeing?

Edit: It appears that eventually buck2 does upload the files and it does execute.

FYI: The fact that it even got to:

buck-out/v2/gen/root/200212f73efcd57d/__tsc_generated__/sh/genrule.sh: line 5: yarn: command not found

Means that nativelink did try to execute the command.

I suggest running nativelink with RUST_LOG=info. This will be very verbose, but will give a lot of info into what the worker/scheduler/cas is doing