Closed stagnation closed 1 year ago
I expect buck2 build //... to pick up changes. But restarting the server works fine: buck2 kill; buck2 build //....
Yeah, that's expected. I think it could be better documented but while your expectation is understandable, I don't think that's something we can realistically support at this time.
Now I have a very minimal config that is used, but does not work for CorruptMessage error-reasons.
I think the port might be http://localhost:8980
, try that? (I don't quite recall why though, but it seems to be what i've got in a local config I used for testing with build barn)
I would like the error messages to echo the grpc-address (at least with a high verbosity)
Sure, that seems rather easy to do.
Thanks, I'm happy for the sanity check. That port would be the regular frontend, and not my proxy, sounds good.
~/tas/buc/exa/rem/engflow $ git diff
diff --git a/examples/remote_execution/engflow/.buckconfig b/examples/remote_execution/engflow/.buckconfig
index d8d0063e03..a7950f11c6 100644
--- a/examples/remote_execution/engflow/.buckconfig
+++ b/examples/remote_execution/engflow/.buckconfig
@@ -8,9 +8,9 @@ digest_algorithms = SHA256
[buck2_re_client]
# Adjust as needed
-action_cache_address = grpc://127.0.0.1:8991
-engine_address = grpc://127.0.0.1:8991
-cas_address = grpc://127.0.0.1:8991
+action_cache_address = http://localhost:8980
+engine_address = http://localhost:8980
+cas_address = http://localhost:8980
[build]
execution_platforms = root//platforms:platforms
~/tas/buc/exa/rem/engflow $ grpcurl -plaintext 127.0.0.1:8980 list
build.bazel.remote.execution.v2.ActionCache
build.bazel.remote.execution.v2.Capabilities
build.bazel.remote.execution.v2.ContentAddressableStorage
build.bazel.remote.execution.v2.Execution
google.bytestream.ByteStream
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
But I get the same error.
WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 2 sec
onds: Error {
context: "RE: creating client",
source: Error {
context: "Error creating CAS client",
source: Error {
context: "Error connecting",
source: tonic::transport::Error(
Transport,
hyper::Error(
Connect,
Custom {
kind: InvalidData,
error: CorruptMessage,
},
),
),
},
},
}
I'll go and setup this up with the unmodified deployment to reduce the surface area.
It might help to run:
buck2 debug set-log-filter hyper=debug,tonic=debug
And then see if you get more debug info out of them?
(bear in mind this does not persist across restarts)
But I get the same error.
I'm guessing that's after having restarted the daemon, correct?
Yes :)
~/tas/buc/exa/rem/engflow buck2 kill; buck2 debug set-log-filter hyper=debug,tonic=debug; buck2 build //...
killing buckd server
Command failed: Failed to connect to buck daemon. Try running `buck2 clean` and your command afterwards. Alternatively, try running `rm -rf ~/.buck/buckd` and your command afterwards
Caused by:
0: Error connecting to the daemon, daemon stderr follows:
2023-04-13T10:45:20.169409Z INFO daemon_listener: buck2::commands::daemon: Daemonized.
2023-04-13T10:45:20.169471Z INFO daemon_listener: buck2::commands::daemon: Starting tokio runtime...
2023-04-13T10:45:20.170971Z INFO daemon_listener: buck2::commands::daemon: Listening.
2023-04-13T10:45:20.171070Z INFO daemon_listener: buck2_server::daemon::state: Reading config...
2023-04-13T10:45:20.171311Z INFO daemon_listener: buck2_server::daemon::state: Starting...
2023-04-13T10:45:20.173314Z INFO daemon_listener: buck2_server::daemon::state: Daemon state is ready.
2023-04-13T10:45:20.221033Z WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 1 seconds: Error {
context: "RE: creating client",
source: Error {
context: "Error creating CAS client",
source: Error {
context: "Error connecting",
source: tonic::transport::Error(
Transport,
hyper::Error(
Connect,
Custom {
kind: InvalidData,
error: CorruptMessage,
},
),
),
},
},
}
2023-04-13T10:45:21.223567Z WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 2 seconds: Error {
context: "RE: creating client",
source: Error {
context: "Error creating CAS client",
source: Error {
context: "Error connecting",
source: tonic::transport::Error(
Transport,
hyper::Error(
Connect,
Custom {
kind: InvalidData,
error: CorruptMessage,
},
),
),
},
},
}
1: establishing connection to Buck daemon or start a daemon
2: connecting to existing buck daemon
3: No existing connection and not asked to start one
4: failed to connect to port 40359
5: transport error
6: error trying to connect: tcp connect error: Connection refused (os error 111)
7: tcp connect error: Connection refused (os error 111)
8: Connection refused (os error 111)
Thanks for this, it seems more tractable. Though I could read this as two possible causes, a problem with the buck client-server connection (over port 40359) or that it attempts to do remote-execution over that port instead of the expected 8980.
No no, that's not quite right, the issue here is just that you need to run the the set-log-filter once the daemon is running (not before).
Try this:
buck2 kill
buck2 server
buck2 debug set-log-filter hyper=debug,tonic=debug
buck2 build //...
FWIW, if you wanted to patch your client so it logs the address:
diff --git a/buck2/remote_execution/oss/re_grpc/src/client.rs b/fbcode/remote_execution/oss/re_grpc/src/client.rs
--- a/buck2/remote_execution/oss/re_grpc/src/client.rs
+++ b/buck2/remote_execution/oss/re_grpc/src/client.rs
@@ -156,11 +156,11 @@
let address = substitute_env_vars(address).context("Invalid address")?;
anyhow::Ok(
- Channel::from_shared(address)?
+ Channel::from_shared(address.clone())?
.tls_config(tls_config.clone())?
.connect()
.await
- .context("Error connecting")?,
+ .with_context(|| format!("Error connecting to `{}`", address))?,
)
};
(I'm currently landing that)
Thank you, this has most of what we want, it is clear where it tries to connect.
~/tas/buc/exa/rem/engflow buck2 kill; and buck2 server; and buck2 debug set-log-filter hyper=debug,tonic=debug; and buck2 build //...
killing buckd server
buckd.endpoint=tcp:34007
...
2023-04-13T12:24:43.512747Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.512784Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:43.513601Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.513790Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:43.513898Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:43.514080Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:43.514147Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516058Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516139Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516203Z DEBUG hyper::client::connect::dns: resolving host="localhost"
2023-04-13T12:24:51.516233Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516424Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516465Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
2023-04-13T12:24:51.516519Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516718Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
2023-04-13T12:24:51.516794Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
Action failed: root//tests:tests (stage0)
Internal error (stage: remote_action_cache): RE: creating client: Error creating CAS client: Error connecting: transport error: error trying to connect: received corrupt message: received corrupt message
stdout:
stderr:
Build ID: 8c5a62cb-b9a5-4412-95f2-9db4f89e95b9
Jobs completed: 10. Time elapsed: 45.0s.
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518278Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518321Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518346Z DEBUG hyper::client::connect::dns: resolving host="localhost"
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518365Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518521Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518550Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518580Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518646Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
[2023-04-13T14:25:00.520+02:00] 2023-04-13T12:25:00.518684Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:8980
BUILD FAILED
Failed to build 'root//tests:tests (<unspecified>)'
And your commit is great for the initial problem :)
debug-buck build //...
2023-04-13T12:26:49.309385Z WARN buck2_execute::re::client: Failed to connect to RE, retrying after sleeping 1 seconds: Error {
context: "RE: creating client",
source: Error {
context: "Error creating CAS client",
source: Error {
context: "Error connecting to `http://localhost:8980`",
source: tonic::transport::Error(
Transport,
hyper::Error(
Connect,
Custom {
kind: InvalidData,
error: CorruptMessage,
},
),
),
},
},
}
Everything looks as it should. So I think it would be best to create a reproduction with the unmodified deployment. I'm working on that. Stay tuned.
I get the same when using the open source docker-compose deployment. With no modifications to the configuration.
Start bb-deployments:
# start
cd bb-deployments/docker-compose
./run.sh
# In another terminal
bb-deployments/docker-compose $ docker-compose ps | grep 8980
docker-compose_frontend_1 /app/cmd/bb_storage/bb_sto ... Up 0.0.0.0:8980->8980/tcp,:::8980->8980/tcp, 9980/tcp
bb-deployments/docker-compose $ grpcurl -plaintext 127.0.0.1:8980 list
build.bazel.remote.execution.v2.ActionCache
build.bazel.remote.execution.v2.Capabilities
build.bazel.remote.execution.v2.ContentAddressableStorage
build.bazel.remote.execution.v2.Execution
google.bytestream.ByteStream
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
With the code from master and a fix: https://github.com/buildbarn/bb-deployments https://github.com/buildbarn/bb-deployments/pull/88
commit 270ce2daa8a3f930ad0d54db6928cceef15a366f (HEAD, origin/feature/shellcheck-docker-compose-run)
Author: Nils Wireklint <nils@meroton.com>
Date: Thu Apr 13 14:49:09 2023 +0200
Correct the permissions for the storage dirs
Shellcheck warns that mkdir -p combined with -m only sets the permission
on the innermost directory. To set the permissions for all directories
in the path a separate call to chmod is required.
https://www.shellcheck.net/wiki/SC2174
commit d63c032b2b4d96f93cb889f95add15c26118d771 (origin/master, origin/HEAD)
Author: Fredrik Medley <fredrik@meroton.com>
Date: Wed Feb 22 09:17:15 2023 +0100
Add missing scheduler ingress in k8s deployment
Using .buckconfig (and unmodified toolchain)
[repositories]
root = .
prelude = prelude
[buck2]
# Adjust as needed.
digest_algorithms = SHA256
[buck2_re_client]
# Adjust as needed
action_cache_address = http://localhost:8980
engine_address = http://localhost:8980
cas_address = http://localhost:8980
[build]
execution_platforms = root//platforms:platforms
Do you think this is related to TLS? As we do not have grpcs setup, and do not provide a cert to buck2.
I also checked that bb-deployments uses a comparable version of the remote api. The protos should be backward compatible, but the checked-in copy is from November. And there have been some changes since then. bb-deployments uses the API version that lived in a commit range spanning Oct 21 - Nov 9.
How to find the proto versions:
bb-deployments $ cd $(bazel info output_base)/external
.../external $ find . -name '*remote_apis*'
./com_github_bazelbuild_remote_apis
./com_github_bazelbuild_remote_apis/remote_apis_deps.bzl
./com_github_buildbarn_bb_storage/patches/com_github_bazelbuild_remote_apis
./@com_github_bazelbuild_remote_apis.marker
./bazel_gazelle_go_repository_cache/pkg/mod/github.com/buildbarn/bb-storage@v0.0.0-20221221163740-08ce6e432da0/patches/com_github_bazelbuild_remote_apis
./bazel_gazelle_go_repository_cache/pkg/mod/github.com/buildbarn/bb-storage@v0.0.0-20220912104841-b7137ecb5613/patches/com_github_bazelbuild_remote_apis
./bazel_gazelle_go_repository_cache/pkg/mod/github.com/bazelbuild/remote-apis@v0.0.0-20220727181745-aa29b91f336b/remote_apis_deps.bzl
./bazel_gazelle_go_repository_cache/pkg/mod/github.com/bazelbuild/remote-apis@v0.0.0-20221109204407-3a21deee813d/remote_apis_deps.bzl
.../external $ cd com_github_bazelbuild_remote_apis/
.../external/com_github_bazelbuild_remote_apis $ sha256sum build/bazel/remote/execution/v2/remote_execution.proto
1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62
# all versions in the last 20 commits
remote-apis $ git log --oneline -20 | choose 0 | xargs -n1 sh -c 'commit=$1; echo -n $commit " "; git show $commit:build/bazel/remote/execution/v2/remote_execution.proto | sha256sum' _
...
# matching the content
remote-apis $ git log --oneline -20 | choose 0 | xargs -n1 sh -c 'commit=$1; echo -n $commit " "; git show $commit:build/bazel/remote/execution/v2/remote_execution.proto | sha256sum' _ | grep 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62
3a21dee 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62 -
7d1354e 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62 -
f608918 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62 -
git log --oneline -20 \ | choose 0 \ | xargs -n1 sh -c ' commit=$1; echo -n $commit " "; git show $commit:build/bazel/remote/execution/v2/remoteexecution.proto | sha256sum ' \ | grep 1e410bef48eecad2126efbde7e898661aca999064120eb2d5ae2fc58ae8e3d62 \ | choose 0 \ | xargs -n1 git log -1
I don't really have a great answer for you right now, it may be related to TLS but if you specify http://
I don't think we'd normally expect that.
That said, you could go and try to remove .tls_config(tls_config.clone())?
in buck2/remote_execution/oss/re_grpc/src/client.rs
and see if that fixes it?
(if so we should probably add a flag for it but also figure out why tonic is trying to connect using TLS if the protocol is http://)
Yep!
~/tas/buck2 $ git diff
diff --git a/remote_execution/oss/re_grpc/src/client.rs b/remote_execution/oss/re_grpc/src/client.rs
index a34b006d71..ab22b9d756 100644
--- a/remote_execution/oss/re_grpc/src/client.rs
+++ b/remote_execution/oss/re_grpc/src/client.rs
@@ -157,7 +157,6 @@ impl REClientBuilder {
anyhow::Ok(
Channel::from_shared(address.clone())?
- .tls_config(tls_config.clone())?
.connect()
.await
.with_context(|| format!("Error connecting to `{}`", address))?,
~/tas/buck2 $ cargo build
Compiling remote_execution v0.1.0 (/home/nils/task/buck2/remote_execution/oss/re_grpc)
warning: unused variable: `tls_config`
--> remote_execution/oss/re_grpc/src/client.rs:152:13
|
152 | let tls_config = &tls_config;
| ^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_tls_config`
|
= note: `#[warn(unused_variables)]` on by default
Compiling buck2_execute v0.1.0 (/home/nils/task/buck2/app/buck2_execute)
...
Then it works, we get an in-band error from the scheduler that there are no workers for the requested image
debug-buck kill; debug-buck server; debug-buck build //...
killing buckd server
buckd.endpoint=tcp:46665
Action failed: root//tests:tests (stage1)
Internal error (stage: remote_call_error): Remote Execution Error (GRPC-SESSION-ID): RE: execution with digest c35fac36089b9ed90f33f250edb9bc3d301a5670b1474d3151861a6cd29dcab8:142: Failed to start remote execution: status: FailedPrecondition, message: "No workers exist for instance name prefix \"\" platform {\"properties\":[{\"name\":\"container-image\",\"value\":\"docker://ubuntu@sha256:b2175cd4cfdd5cdb1740b0e6ec6bbb4ea4892801c0ad5101a81f694152b6c559\"}]}", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }
stdout:
stderr:
Build ID: 0d566783-29ee-4f92-a683-bd28d958545e
RE: GRPC-SESSION-ID
Jobs completed: 11. Time elapsed: 0.0s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
BUILD FAILED
Failed to build 'root//tests:tests (<unspecified>)'
We may want to polish the formatting of the error but it is a familiar error from BuildBarn. And I can work that out, just change docker image and set an instance name.
Also: I love the starlark-rust linter and LSP, here's my vim
# This could be any Docker image.
# This one is Ubuntu 22.04 (i.e. its current LTS release): https://hub.docker.com/_/ubuntu
I image = "docker://ubuntu@sha256:b2175cd4cfdd5cdb1740b0e6ec6bbb4ea4892801c0ad5101a81f694152b6c559" ■ Unused assignment of `image`
┃ image = "docker://ghcr.io/catthehacker/ubuntu:act-22.04@sha256:5f9c35c25db1d51a8ddaae5c0ba8d3c163c5e9a4a6cc97acd409ac7eae239448"
Cool, I'll try to debug what's up with Tonic
So it must be that the remote execution configuration is read and cached by the server.
This implies that one cannot use --config-file
(nor --flagfile
-- BTW why the inconsistency between the options, one uses a dash, the other not?) for RE options, since there is no way to give these options to the server when it is started, right?
Hi!
I was tinkering a little bit with the engflow example and a local BuildBarn deployment (https://github.com/buildbarn/bb-deployments)*. At first I just got errors about the url missing a scheme. The same with every modification to the config. But building the same project with a debug build from source I got a new error. So it must be that the remote execution configuration is read and cached by the server.
I expect
buck2 build //...
to pick up changes. But restarting the server works fine:buck2 kill; buck2 server; buck2 build //...
. [EDIT]: https://github.com/facebook/buck2/issues/156#issuecomment-1506769115 : it is better to use an explicitserver
call in the pipeline, so other subcommands can work with the server. This composes better than just kill and build.Forgive the nested questions, if you wish I can open a new issue.
Now I have a very minimal config that is used, but does not work for
CorruptMessage
error-reasons.The error:
Grpc service listing:
I can use this BuildBarn deployment with Bazel, so I think this is some confusion about the configuration. I do not want TLS here and did not configure any certs, could that be related to the corruption?
I'm not familiar with debugging Tokio code, and did not manage to hit any breakpoints in
REClientBuilder::build_and_connect
withrr
orlldb
, so I do not know much more about the problem. I would like the error messages to echo the grpc-address (at least with a high verbosity) that would make the initial problem of reloading the config easier to understand.Versions:
* Not exactly the same configuration, but if you want I can spend the time to reproduce this with the open source code.