facebook / buck2

Build system, successor to Buck
https://buck2.build/
Apache License 2.0
3.58k stars 221 forks source link

Remote configuration is not reloaded with build, must restart the server. #156

Closed stagnation closed 1 year ago

stagnation commented 1 year ago

Hi!

I was tinkering a little bit with the engflow example and a local BuildBarn deployment (https://github.com/buildbarn/bb-deployments)*. At first I just got errors about the url missing a scheme. The same with every modification to the config. But building the same project with a debug build from source I got a new error. So it must be that the remote execution configuration is read and cached by the server.

WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 4 seconds: Error {
    context: "RE: creating client",
    source: Error {
        context: "Error creating CAS client",
        source: Error {
            context: "Error connecting",
            source: tonic::transport::Error(
                Transport,
                hyper::Error(
                    Connect,
                    "invalid URL, scheme is missing",
                ),
            ),
        },
    },
}

I expect buck2 build //... to pick up changes. But restarting the server works fine: buck2 kill; buck2 server; buck2 build //.... [EDIT]: https://github.com/facebook/buck2/issues/156#issuecomment-1506769115 : it is better to use an explicit server call in the pipeline, so other subcommands can work with the server. This composes better than just kill and build.


Forgive the nested questions, if you wish I can open a new issue.

Now I have a very minimal config that is used, but does not work for CorruptMessage error-reasons.

~/tas/buc/exa/rem/engflow $ cat .buckconfig
[repositories]
root = .
prelude = prelude

[buck2]
# Adjust as needed.
digest_algorithms = SHA256

[buck2_re_client]
# Adjust as needed
action_cache_address = grpc://127.0.0.1:8991
engine_address = grpc://127.0.0.1:8991
cas_address = grpc://127.0.0.1:8991

[build]
  execution_platforms = root//platforms:platforms

The error:

WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 3 seconds: Error {
    context: "RE: creating client",
    source: Error {
        context: "Error creating CAS client",
        source: Error {
            context: "Error connecting",
            source: tonic::transport::Error(
                Transport,
                hyper::Error(
                    Connect,
                    Custom {
                        kind: InvalidData,
                        error: CorruptMessage,
                    },
                ),
            ),
        },
    },
}

Grpc service listing:

$ grpcurl -plaintext 127.0.0.1:8991 list
build.bazel.remote.execution.v2.ActionCache
build.bazel.remote.execution.v2.Capabilities
build.bazel.remote.execution.v2.ContentAddressableStorage
build.bazel.remote.execution.v2.Execution
google.bytestream.ByteStream
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection

I can use this BuildBarn deployment with Bazel, so I think this is some confusion about the configuration. I do not want TLS here and did not configure any certs, could that be related to the corruption?


I'm not familiar with debugging Tokio code, and did not manage to hit any breakpoints in REClientBuilder::build_and_connect with rr or lldb, so I do not know much more about the problem. I would like the error messages to echo the grpc-address (at least with a high verbosity) that would make the initial problem of reloading the config easier to understand.


Versions:

buck2 --version
buck2 8432bb6e5967c530067997d8adddf798997b2898 <local>
target/debug/buck2 --version
buck2 05abd9003675537b6f33c0566ac87532e4f468cd <local>

* Not exactly the same configuration, but if you want I can spend the time to reproduce this with the open source code.

krallin commented 1 year ago

I expect buck2 build //... to pick up changes. But restarting the server works fine: buck2 kill; buck2 build //....

Yeah, that's expected. I think it could be better documented but while your expectation is understandable, I don't think that's something we can realistically support at this time.

Now I have a very minimal config that is used, but does not work for CorruptMessage error-reasons.

I think the port might be http://localhost:8980, try that? (I don't quite recall why though, but it seems to be what i've got in a local config I used for testing with build barn)

krallin commented 1 year ago

I would like the error messages to echo the grpc-address (at least with a high verbosity)

Sure, that seems rather easy to do.

stagnation commented 1 year ago

Thanks, I'm happy for the sanity check. That port would be the regular frontend, and not my proxy, sounds good.

~/tas/buc/exa/rem/engflow $ git diff
diff --git a/examples/remote_execution/engflow/.buckconfig b/examples/remote_execution/engflow/.buckconfig
index d8d0063e03..a7950f11c6 100644
--- a/examples/remote_execution/engflow/.buckconfig
+++ b/examples/remote_execution/engflow/.buckconfig
@@ -8,9 +8,9 @@ digest_algorithms = SHA256

 [buck2_re_client]
 # Adjust as needed
-action_cache_address = grpc://127.0.0.1:8991
-engine_address = grpc://127.0.0.1:8991
-cas_address = grpc://127.0.0.1:8991
+action_cache_address = http://localhost:8980
+engine_address = http://localhost:8980
+cas_address = http://localhost:8980

 [build]
   execution_platforms = root//platforms:platforms
~/tas/buc/exa/rem/engflow $ grpcurl -plaintext 127.0.0.1:8980 list
build.bazel.remote.execution.v2.ActionCache
build.bazel.remote.execution.v2.Capabilities
build.bazel.remote.execution.v2.ContentAddressableStorage
build.bazel.remote.execution.v2.Execution
google.bytestream.ByteStream
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection

But I get the same error.

WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 2 sec
onds: Error {
    context: "RE: creating client",
    source: Error {
        context: "Error creating CAS client",
        source: Error {
            context: "Error connecting",
            source: tonic::transport::Error(
                Transport,
                hyper::Error(
                    Connect,
                    Custom {
                        kind: InvalidData,
                        error: CorruptMessage,
                    },
                ),
            ),
        },
    },
}

I'll go and setup this up with the unmodified deployment to reduce the surface area.

krallin commented 1 year ago

It might help to run:

buck2 debug set-log-filter hyper=debug,tonic=debug

And then see if you get more debug info out of them?

(bear in mind this does not persist across restarts)

krallin commented 1 year ago

But I get the same error.

I'm guessing that's after having restarted the daemon, correct?

stagnation commented 1 year ago

Yes :)

~/tas/buc/exa/rem/engflow buck2 kill; buck2 debug set-log-filter hyper=debug,tonic=debug; buck2 build //...
killing buckd server
Command failed: Failed to connect to buck daemon. Try running `buck2 clean` and your command afterwards. Alternatively, try running `rm -rf ~/.buck/buckd` and your command afterwards

Caused by:
    0: Error connecting to the daemon, daemon stderr follows:
       2023-04-13T10:45:20.169409Z  INFO daemon_listener: buck2::commands::daemon: Daemonized.
       2023-04-13T10:45:20.169471Z  INFO daemon_listener: buck2::commands::daemon: Starting tokio runtime...
       2023-04-13T10:45:20.170971Z  INFO daemon_listener: buck2::commands::daemon: Listening.
       2023-04-13T10:45:20.171070Z  INFO daemon_listener: buck2_server::daemon::state: Reading config...
       2023-04-13T10:45:20.171311Z  INFO daemon_listener: buck2_server::daemon::state: Starting...
       2023-04-13T10:45:20.173314Z  INFO daemon_listener: buck2_server::daemon::state: Daemon state is ready.
       2023-04-13T10:45:20.221033Z  WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 1 seconds: Error {
           context: "RE: creating client",
           source: Error {
               context: "Error creating CAS client",
               source: Error {
                   context: "Error connecting",
                   source: tonic::transport::Error(
                       Transport,
                       hyper::Error(
                           Connect,
                           Custom {
                               kind: InvalidData,
                               error: CorruptMessage,
                           },
                       ),
                   ),
               },
           },
       }
       2023-04-13T10:45:21.223567Z  WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 2 seconds: Error {
           context: "RE: creating client",
           source: Error {
               context: "Error creating CAS client",
               source: Error {
                   context: "Error connecting",
                   source: tonic::transport::Error(
                       Transport,
                       hyper::Error(
                           Connect,
                           Custom {
                               kind: InvalidData,
                               error: CorruptMessage,
                           },
                       ),
                   ),
               },
           },
       }

    1: establishing connection to Buck daemon or start a daemon
    2: connecting to existing buck daemon
    3: No existing connection and not asked to start one
    4: failed to connect to port 40359
    5: transport error
    6: error trying to connect: tcp connect error: Connection refused (os error 111)
    7: tcp connect error: Connection refused (os error 111)
    8: Connection refused (os error 111)

Thanks for this, it seems more tractable. Though I could read this as two possible causes, a problem with the buck client-server connection (over port 40359) or that it attempts to do remote-execution over that port instead of the expected 8980.

krallin commented 1 year ago

No no, that's not quite right, the issue here is just that you need to run the the set-log-filter once the daemon is running (not before).

Try this:

buck2 kill
buck2 server
buck2 debug set-log-filter hyper=debug,tonic=debug
buck2 build //...
krallin commented 1 year ago

FWIW, if you wanted to patch your client so it logs the address:

diff --git a/buck2/remote_execution/oss/re_grpc/src/client.rs b/fbcode/remote_execution/oss/re_grpc/src/client.rs
--- a/buck2/remote_execution/oss/re_grpc/src/client.rs
+++ b/buck2/remote_execution/oss/re_grpc/src/client.rs
@@ -156,11 +156,11 @@
             let address = substitute_env_vars(address).context("Invalid address")?;

             anyhow::Ok(
-                Channel::from_shared(address)?
+                Channel::from_shared(address.clone())?
                     .tls_config(tls_config.clone())?
                     .connect()
                     .await
-                    .context("Error connecting")?,
+                    .with_context(|| format!("Error connecting to `{}`", address))?,
             )
         };

(I'm currently landing that)

stagnation commented 1 year ago

Thank you, this has most of what we want, it is clear where it tries to connect.

~/tas/buc/exa/rem/engflow buck2 kill; and buck2 server; and buck2 debug set-log-filter hyper=debug,tonic=debug; and buck2 build //...
killing buckd server
buckd.endpoint=tcp:34007
...
2023-04-13T12:24:43.512747Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.512784Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:43.513601Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.513790Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.513898Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:43.514080Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:43.514147Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516058Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516139Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516203Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516233Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516424Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516465Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516519Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516718Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516794Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
Action failed: root//tests:tests (stage0)
Internal error (stage: remote_action_cache): RE: creating client: Error creating CAS client: Error connecting: transport error: error trying to connect: received corrupt message: received corrupt message
stdout:
stderr:
Build ID: 8c5a62cb-b9a5-4412-95f2-9db4f89e95b9
Jobs completed: 10. Time elapsed: 45.0s.
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518278Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518321Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518346Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518365Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518521Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518550Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518580Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518646Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518684Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
BUILD FAILED
Failed to build 'root//tests:tests (<unspecified>)'

And your commit is great for the initial problem :)

debug-buck build //...
2023-04-13T12:26:49.309385Z  WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 1 seconds: Error {
    context: "RE: creating client",
    source: Error {
        context: "Error creating CAS client",
        source: Error {
            context: "Error connecting to `http://localhost:8980`",
            source: tonic::transport::Error(
                Transport,
                hyper::Error(
                    Connect,
                    Custom {
                        kind: InvalidData,
                        error: CorruptMessage,
                    },
                ),
            ),
        },
    },
}

Everything looks as it should. So I think it would be best to create a reproduction with the unmodified deployment. I'm working on that. Stay tuned.

stagnation commented 1 year ago

I get the same when using the open source docker-compose deployment. With no modifications to the configuration.

Start bb-deployments:

# start
cd bb-deployments/docker-compose
./run.sh

# In another terminal
bb-deployments/docker-compose $ docker-compose ps | grep 8980
docker-compose_frontend_1                         /app/cmd/bb_storage/bb_sto ...   Up       0.0.0.0:8980->8980/tcp,:::8980->8980/tcp, 9980/tcp
bb-deployments/docker-compose $ grpcurl -plaintext 127.0.0.1:8980 list
build.bazel.remote.execution.v2.ActionCache
build.bazel.remote.execution.v2.Capabilities
build.bazel.remote.execution.v2.ContentAddressableStorage
build.bazel.remote.execution.v2.Execution
google.bytestream.ByteStream
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection

With the code from master and a fix: https://github.com/buildbarn/bb-deployments https://github.com/buildbarn/bb-deployments/pull/88

commit 270ce2daa8a3f930ad0d54db6928cceef15a366f (HEAD, origin/feature/shellcheck-docker-compose-run)
Author: Nils Wireklint <nils@meroton.com>
Date:   Thu Apr 13 14:49:09 2023 +0200

    Correct the permissions for the storage dirs

    Shellcheck warns that mkdir -p combined with -m only sets the permission
    on the innermost directory. To set the permissions for all directories
    in the path a separate call to chmod is required.
    https://www.shellcheck.net/wiki/SC2174

commit d63c032b2b4d96f93cb889f95add15c26118d771 (origin/master, origin/HEAD)
Author: Fredrik Medley <fredrik@meroton.com>
Date:   Wed Feb 22 09:17:15 2023 +0100

    Add missing scheduler ingress in k8s deployment

Using .buckconfig (and unmodified toolchain)

 [repositories]
root = .
prelude = prelude

[buck2]
# Adjust as needed.
digest_algorithms = SHA256

[buck2_re_client]
# Adjust as needed
action_cache_address = http://localhost:8980
engine_address = http://localhost:8980
cas_address = http://localhost:8980

[build]
  execution_platforms = root//platforms:platforms

pretty-vertical-oneliner

git log --oneline -20 \ | choose 0 \ | xargs -n1 sh -c ' commit=$1; echo -n $commit " "; git show $commit:build/bazel/remote/execution/v2/remoteexecution.proto | sha256sum ' \ | grep 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62 \ | choose 0 \ | xargs -n1 git log -1

krallin commented 1 year ago

I don't really have a great answer for you right now, it may be related to TLS but if you specify http:// I don't think we'd normally expect that.

That said, you could go and try to remove .tls_config(tls_config.clone())? in buck2/remote_execution/oss/re_grpc/src/client.rs and see if that fixes it?

(if so we should probably add a flag for it but also figure out why tonic is trying to connect using TLS if the protocol is http://)

stagnation commented 1 year ago

Yep!

~/tas/buck2 $ git diff
diff --git a/remote_execution/oss/re_grpc/src/client.rs b/remote_execution/oss/re_grpc/src/client.rs
index a34b006d71..ab22b9d756 100644
--- a/remote_execution/oss/re_grpc/src/client.rs
+++ b/remote_execution/oss/re_grpc/src/client.rs
@@ -157,7 +157,6 @@ impl REClientBuilder {

             anyhow::Ok(
                 Channel::from_shared(address.clone())?
-                    .tls_config(tls_config.clone())?
                     .connect()
                     .await
                     .with_context(|| format!("Error connecting to `{}`", address))?,
~/tas/buck2 $ cargo build
   Compiling remote_execution v0.1.0 (/home/nils/task/buck2/remote_execution/oss/re_grpc)
warning: unused variable: `tls_config`
   --> remote_execution/oss/re_grpc/src/client.rs:152:13
    |
152 |         let tls_config = &tls_config;
    |             ^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_tls_config`
    |
    = note: `#[warn(unused_variables)]` on by default

   Compiling buck2_execute v0.1.0 (/home/nils/task/buck2/app/buck2_execute)
   ...

Then it works, we get an in-band error from the scheduler that there are no workers for the requested image

debug-buck kill; debug-buck server; debug-buck build //...
killing buckd server
buckd.endpoint=tcp:46665
Action failed: root//tests:tests (stage1)
Internal error (stage: remote_call_error): Remote Execution Error (GRPC-SESSION-ID): RE: execution with digest c35fac36089b9ed90f33f250edb9bc3d301a5670b1474d3151861a6cd29dcab8:142: Failed to start remote execution: status: FailedPrecondition, message: "No workers exist for instance name prefix \"\" platform {\"properties\":[{\"name\":\"container-image\",\"value\":\"docker://ubuntu@sha256:b2175cd4cfdd5cdb1740b0e6ec6bbb4ea4892801c0ad5101a81f694152b6c559\"}]}", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
stdout:
stderr:
Build ID: 0d566783-29ee-4f92-a683-bd28d958545e
RE: GRPC-SESSION-ID
Jobs completed: 11. Time elapsed: 0.0s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
BUILD FAILED
Failed to build 'root//tests:tests (<unspecified>)'

We may want to polish the formatting of the error but it is a familiar error from BuildBarn. And I can work that out, just change docker image and set an instance name.

Also: I love the starlark-rust linter and LSP, here's my vim

       # This could be any Docker image.
       # This one is Ubuntu 22.04 (i.e. its current LTS release): https://hub.docker.com/_/ubuntu
I      image = "docker://ubuntu@sha256:b2175cd4cfdd5cdb1740b0e6ec6bbb4ea4892801c0ad5101a81f694152b6c559"     ■ Unused assignment of `image`
┃      image = "docker://ghcr.io/catthehacker/ubuntu:act-22.04@sha256:5f9c35c25db1d51a8ddaae5c0ba8d3c163c5e9a4a6cc97acd409ac7eae239448"
krallin commented 1 year ago

Cool, I'll try to debug what's up with Tonic

avdv commented 4 months ago

So it must be that the remote execution configuration is read and cached by the server.

This implies that one cannot use --config-file (nor --flagfile -- BTW why the inconsistency between the options, one uses a dash, the other not?) for RE options, since there is no way to give these options to the server when it is started, right?