bazelbuild / bazel-buildfarm

Bazel remote caching and execution service
https://bazel.build
Apache License 2.0
635 stars 199 forks source link

Remote builds stuck on Buildfarm that is deployed with Helm. #1765

Closed monaka closed 2 weeks ago

monaka commented 3 weeks ago

I deployed Buildfarm with ArgoCD and the official Helm chart.

bazel-buildfarm.app.yaml ``` apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: bazel-buildfarm finalizers: - resources-finalizer.argocd.argoproj.io spec: destination: namespace: bazel-buildfarm server: in-cluster source: helm: releaseName: bazel-buildfarm parameters: - name: config.worker.limitGlobalExecution value: 'true' - name: config.worker.sandboxSettings.alwaysUseSandbox value: 'false' - name: config.worker.sandboxSettings.alwaysUseCgroups value: 'false' - name: redis.master.resources.limits.cpu value: 1000m - name: redis.master.resources.limits.memory value: 2Gi - name: redis.master.resources.requests.cpu value: 1000m - name: redis.master.resources.requests.memory value: 2Gi - name: redis.replica.resources.limits.cpu value: 1000m - name: redis.replica.resources.limits.memory value: 2Gi - name: redis.replica.resources.requests.cpu value: 1000m - name: redis.replica.resources.requests.memory value: 2Gi - name: server.image.tag value: '2.10.2' - name: server.resources.limits.cpu value: 2000m - name: server.resources.limits.memory value: 8Gi - name: server.resources.requests.cpu value: 2000m - name: server.resources.requests.memory value: 8Gi - name: server.serviceMonitor.enabled value: 'true' - name: shardWorker.image.tag value: '2.10.2' - name: shardWorker.resources.limits.cpu value: 16000m - name: shardWorker.resources.limits.memory value: 32Gi - name: shardWorker.resources.requests.cpu value: 16000m - name: shardWorker.resources.requests.memory value: 32Gi - name: shardWorker.serviceMonitor.enabled value: 'true' - name: shardWorker.tolerations[0].key value: service - name: shardWorker.tolerations[0].operator value: Equal - name: shardWorker.tolerations[0].value value: build_android_os - name: shardWorker.tolerations[0].effect value: NoSchedule path: kubernetes/helm-charts/buildfarm repoURL: 'https://github.com/bazelbuild/bazel-buildfarm' targetRevision: HEAD project: build-team syncPolicy: automated: prune: true selfHeal: true syncOptions: - ServerSideApply=true - CreateNamespace=true ```

After Pods are stable, I tried a remote build from another Pod like this.

bazel run --remote_executor=grpc://bazel-buildfarm-server.bazel-buildfarm.svc:8980 :main
main.cc ``` #include int main( int argc, char *argv[] ) { std::cout << "Hello, World!2" << std::endl; } ```
BUILD ``` cc_binary( name = "main", srcs = ["main.cc"], ) ```

(And an empty WORKSPACE file.)

The build should be finished in a few seconds. But it didn't finish. (In this log, I stopped with Ctrl-C )

$ ./bazel run --remote_executor=grpc://bazel-buildfarm-server.bazel-buildfarm.svc:8980 :main
INFO: Invocation ID: 92fe990e-5ebc-44cb-ae33-218fd3ae05bb
INFO: Analyzed target //:main (0 packages loaded, 0 targets configured).
[1 / 4] Compiling main.cc; 212s remote, remote-cache
^C
Bazel caught interrupt signal; cancelling pending invocation.
Target //:main failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: build interrupted
INFO: Elapsed time: 273.143s, Critical Path: 273.00s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
ERROR: Build failed. Not running target

I suppose these steps are almost same as https://bazelbuild.github.io/bazel-buildfarm/docs/quick_start/ Should I add some more settings to the Helm chart ?

monaka commented 3 weeks ago

I tried a remote cache (CAS) in the same environment also. I suppose CAS works well.

monaka commented 3 weeks ago

This issue might be somewhat related to https://github.com/bazelbuild/bazel-buildfarm/issues/1749

werkt commented 2 weeks ago

I just tried the helm install with minikube, with our recommended port forwarding specification, had no trouble building your example program.

Using bazel-buildfarm-server.bazel-buildfarm.svc as a name to contact the running service implies that you have some association between the cluster and the client bazel environment - are you sure that you can contact it - see bf-cat - at all for Capabilities (the most basic communication) on the port indicated? After verifying that bazel-buildfarm-server.bazel-buildfarm.svc resolves, of course.

monaka commented 2 weeks ago
bazel-buildfarm-server.bazel-buildfarm.svc is resolved. ``` % kubectl get svc -n bazel-buildfarm NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE bazel-buildfarm-redis-headless ClusterIP None 6379/TCP 112d bazel-buildfarm-redis-master ClusterIP 10.100.239.216 6379/TCP 112d bazel-buildfarm-redis-replicas ClusterIP 10.100.51.9 6379/TCP 112d bazel-buildfarm-server ClusterIP 10.100.76.80 8980/TCP,9090/TCP 112d bazel-buildfarm-shard-worker ClusterIP 10.100.6.122 8982/TCP,9090/TCP 112d % kubectl exec -it aaos-app-7f8479b66d-qwr9b -- nslookup bazel-buildfarm-server.bazel-buildfarm.svc Server: 10.100.0.10 Address: 10.100.0.10#53 Name: bazel-buildfarm-server.bazel-buildfarm.svc.cluster.local Address: 10.100.76.80 ```

And as I wrote above, it seems that CAS works fine. Just not working remote_execition only.

$ bazel clean
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.

$ bazel run --remote_cache=grpc://bazel-buildfarm-server.bazel-buildfarm.svc:8980 :main
INFO: Invocation ID: 4e7c1beb-3f46-4032-b6bb-f0061b2e42bd
INFO: Analyzed target //:main (83 packages loaded, 382 targets configured).
INFO: Found 1 target...
Target //:main up-to-date:
  bazel-bin/main
INFO: Elapsed time: 0.477s, Critical Path: 0.06s
INFO: 7 processes: 2 remote cache hit, 5 internal.
INFO: Build completed successfully, 7 total actions
INFO: Running command line: bazel-bin/main
Hello, World!2

INFO: 7 processes: 2 remote cache hit, 5 internal.

monaka commented 2 weeks ago

Hmm... possiblly it depends on the configuration of K8s cluster ??? My Buildfarm is on AWS EKS.

I'll try this on my AzureAKS and Minikube.

monaka commented 2 weeks ago

Additional info:

I have two EKS clusters that are installed Buildfarm. The one doesn't work RBE as I reported here. But ... the other works well.

Even though I can't catch the difference between each cluster, I agree that Buildfarm works well on EKS.

I close this until I get some information for now. And I'll explore some more details.