acorn-io / runtime

A simple application deployment framework built on Kubernetes
https://docs.acorn.io/
Apache License 2.0
1.13k stars 100 forks source link

Error: failed to list workers: Unavailable: websocket: close sent #967

Closed angelbarrera92 closed 1 year ago

angelbarrera92 commented 1 year ago

Hi!

Im having the same issue as https://github.com/acorn-io/acorn/issues/551 but...

I've install acorn by running: acorn install --ingress-class-name nginx --cluster-domain .k.microcloud.dev

$ acorn check
NAME                  PASSED    MESSAGE
RBAC                  true      User can create namespaces
NodesReady            true      All nodes are ready
DefaultStorageClass   true      Found default storage class managed-nfs-storage
IngressCapability     true      Ingress is ready
Exec                  true      Successfully executed command in container replica
  ✔  Checks PASSED

Then i follow the getting started example,

$ acorn run -n awesome-acorn .
W1207 13:01:03.425074   79023 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
  "Addr": "/run/buildkit/buildkitd.sock",
  "ServerName": "localhost",
  "Attributes": {},
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: websocket: close sent
Error: failed to list workers: Unavailable: websocket: close sent
Error: failed to list workers: Unavailable: connection closed before server preface received

debug logs:

I1207 13:02:19.040270   79340 loader.go:374] Config loaded from file:  /home/angel/personal/infra/k8s-config/kubeconfig.yaml
I1207 13:02:19.041368   79340 round_trippers.go:553]    in 0 milliseconds
I1207 13:02:19.062098   79340 round_trippers.go:553] GET https://192.168.113.1:6443/apis/api.acorn.io/v1/namespaces/acorn/apps/awesome-acorn 404 Not Found in 20 milliseconds
I1207 13:02:19.096168   79340 round_trippers.go:553] GET https://192.168.113.1:6443/apis/api.acorn.io/v1/namespaces/acorn/infos 200 OK in 33 milliseconds
I1207 13:02:19.159037   79340 loader.go:374] Config loaded from file:  /home/angel/personal/infra/k8s-config/kubeconfig.yaml
I1207 13:02:19.159996   79340 round_trippers.go:553]    in 0 milliseconds
I1207 13:02:19.253042   79340 round_trippers.go:553] GET https://192.168.113.1:6443/apis/api.acorn.io/v1/namespaces/acorn/builders 200 OK in 4 milliseconds
I1207 13:02:19.285763   79340 round_trippers.go:553] GET https://192.168.113.1:6443/apis/api.acorn.io/v1/namespaces/acorn-system/builders/buildkitd 200 OK in 32 milliseconds
I1207 13:02:19.286807   79340 round_trippers.go:553]    in 0 milliseconds
I1207 13:02:19.286940   79340 clientconn.go:178] "[core] [Channel #1] Channel created\n"
I1207 13:02:19.286977   79340 logging.go:43] "[core] [Channel #1] original dial target is: \"unix:///run/buildkit/buildkitd.sock\"\n"
I1207 13:02:19.287034   79340 logging.go:43] "[core] [Channel #1] parsed dial target is: {Scheme:unix Authority: Endpoint:run/buildkit/buildkitd.sock URL:{Scheme:unix Opaque: User: Host: Path:/run/buildkit/buildkitd.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}\n"
I1207 13:02:19.287059   79340 logging.go:43] "[core] [Channel #1] Channel authority set to \"localhost\"\n"
I1207 13:02:19.287263   79340 logging.go:43] "[core] [Channel #1] Resolver state updated: {\n  \"Addresses\": [\n    {\n      \"Addr\": \"/run/buildkit/buildkitd.sock\",\n      \"ServerName\": \"\",\n      \"Attributes\": {},\n      \"BalancerAttributes\": null,\n      \"Type\": 0,\n      \"Metadata\": null\n    }\n  ],\n  \"ServiceConfig\": null,\n  \"Attributes\": null\n} (resolver returned new addresses)\n"
I1207 13:02:19.287313   79340 logging.go:43] "[core] [Channel #1] Channel switches to new LB policy \"pick_first\"\n"
I1207 13:02:19.287367   79340 clientconn.go:725] "[core] [Channel #1 SubChannel #2] Subchannel created\n"
I1207 13:02:19.287563   79340 logging.go:43] "[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING\n"
I1207 13:02:19.287611   79340 logging.go:43] "[core] [Channel #1 SubChannel #2] Subchannel picks a new address \"/run/buildkit/buildkitd.sock\" to connect\n"
I1207 13:02:19.287659   79340 pickfirst.go:114] "[core] pickfirstBalancer: UpdateSubConnState: 0xc000c82090, {CONNECTING <nil>}\n"
I1207 13:02:19.287686   79340 logging.go:43] "[core] [Channel #1] Channel Connectivity change to CONNECTING\n"
I1207 13:02:19.287734   79340 picker_wrapper.go:161] "[core] blockingPicker: the picked transport is not ready, loop back to repick\n"
I1207 13:02:20.420832   79340 controlbuf.go:529] "[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"\n"
I1207 13:02:20.420908   79340 logging.go:43] "[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE\n"
I1207 13:02:20.420987   79340 pickfirst.go:114] "[core] pickfirstBalancer: UpdateSubConnState: 0xc000c82090, {TRANSIENT_FAILURE connection closed before server preface received}\n"
I1207 13:02:20.421035   79340 logging.go:43] "[core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE\n"
I1207 13:02:20.421460   79340 logging.go:43] "[core] [Channel #1] Channel Connectivity change to SHUTDOWN\n"
I1207 13:02:20.421520   79340 logging.go:43] "[core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN\n"
I1207 13:02:20.421567   79340 clientconn.go:1457] "[core] [Channel #1 SubChannel #2] Subchannel deleted\n"
I1207 13:02:20.421603   79340 clientconn.go:1056] "[core] [Channel #1] Channel deleted\n"
Error: failed to list workers: Unavailable: connection closed before server preface received

acorn build isn't working neither, same error.

angelbarrera92 commented 1 year ago

Im trying to troubleshoot the issue,

Im checking the buildkit pod is up and running along with the containerd-config ds:

$ kubectl get pods -n acorn-system -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP            NODE            NOMINATED NODE   READINESS GATES
acorn-controller-66778bbdb8-kltpx   1/1     Running   0          5m39s   10.244.3.14   kube-worker-3   <none>           <none>
acorn-api-6f866c5897-rcbj8          1/1     Running   0          5m39s   10.244.3.15   kube-worker-3   <none>           <none>
containerd-config-path-sxzgl        1/1     Running   0          4m7s    10.244.2.15   kube-worker-2   <none>           <none>
containerd-config-path-srvv7        1/1     Running   0          4m7s    10.244.1.35   kube-worker-1   <none>           <none>
containerd-config-path-j72bz        1/1     Running   0          4m7s    10.244.3.19   kube-worker-3   <none>           <none>
buildkitd-bb74c9fcc-54sqt           2/2     Running   0          4m7s    10.244.3.18   kube-worker-3   <none>           <none>

Then i've check the nodePort services

$ kubectl get svc -n acorn-system
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
acorn-api   ClusterIP   10.111.215.156   <none>        7443/TCP         3m27s
registry    NodePort    10.110.198.37    <none>        5000:32516/TCP   115s
$ nc -zv 192.168.113.4 32516
Ncat: Version 7.93 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.113.4:32516.
Ncat: 0 bytes sent, 0 bytes received in 0.07 seconds.
cjellick commented 1 year ago

Thanks for the excellent write up. We're looking into it

cjellick commented 1 year ago

Could you post the logs to the buildkit pod?

ibuildthecloud commented 1 year ago

Not immediately helpful, but this entire data path is changing in 0.5.0. This is literally what I'm doing right now. We will no longer be talking to buildkit remote anymore. Right now we are running the grpc socket over a API subresource, piped to a pod port forward. All of that madness will go away (and be replaced with new madness that is surely to be better).

angelbarrera92 commented 1 year ago

Not immediately helpful, but this entire data path is changing in 0.5.0. This is literally what I'm doing right now. We will no longer be talking to buildkit remote anymore. Right now we are running the grpc socket over a API subresource, piped to a pod port forward. All of that madness will go away (and be replaced with new madness that is surely to be better).

I'm really looking forward to see how is going to be implemented the build process. This is a topic I'm always interested in.

Could you post the logs to the buildkit pod?

Absolutely, forgot to paste them before, good point:

$ kubectl logs -f buildkitd-bb74c9fcc-54sqt -n acorn-system -c buildkitd
+ mount -t binfmt_misc none /proc/sys/fs/binfmt_misc
+ basename /usr/bin/qemu-aarch64
+ i=qemu-aarch64
+ '[' -e /proc/sys/fs/binfmt_misc/qemu-aarch64 ]
+ grep -q POCF /proc/sys/fs/binfmt_misc/qemu-aarch64
+ umount /proc/sys/fs/binfmt_misc
+ '[' -e /usr/bin/qemu-x86_64 ]
+ '[' -e /usr/bin/qemu-aarch64 ]
+ QEMU_PRESERVE_ARGV0=1 binfmt -install arm64
installing: arm64 qemu-aarch64 already registered
{
  "supported": [
    "linux/amd64",
    "linux/arm64",
    "linux/386"
  ],
  "emulators": [
    "qemu-aarch64"
  ]
}
+ exec /usr/local/bin/buildkitd --debug --addr unix:///run/buildkit/buildkitd.sock --addr tcp://0.0.0.0:8080
time="2022-12-07T12:15:13Z" level=info msg="auto snapshotter: using overlayfs"
time="2022-12-07T12:15:13Z" level=warning msg="using host network as the default"
time="2022-12-07T12:15:13Z" level=info msg="found worker \"poiqkbe0whz2trgf9i931ryod\", labels=map[org.mobyproject.buildkit.worker.executor:oci org.mobyproject.buildkit.worker.hostname:buildkitd-bb74c9fcc-54sqt org.mobyproject.buildkit.worker.network:host org.mobyproject.buildkit.worker.oci.process-mode:sandbox org.mobyproject.buildkit.worker.selinux.enabled:false org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/amd64/v2 linux/amd64/v3 linux/arm64 linux/386]"
time="2022-12-07T12:15:13Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
time="2022-12-07T12:15:13Z" level=info msg="found 1 workers, default=\"poiqkbe0whz2trgf9i931ryod\""
time="2022-12-07T12:15:13Z" level=warning msg="currently, only the default worker can be used."
time="2022-12-07T12:15:14Z" level=warning msg="TLS is not enabled for tcp://0.0.0.0:8080. enabling mutual TLS authentication is highly recommended"
time="2022-12-07T12:15:14Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2022-12-07T12:15:14Z" level=info msg="running server on [::]:8080"
cjellick commented 1 year ago

Based on this comment: https://github.com/acorn-io/acorn/issues/967#issuecomment-1341224876

I'm going to kick this to QA to see if we can repro

sangee2004 commented 1 year ago

@angelbarrera92 This issue should have been addressed with the latest acorn release. Could you please confirm that you dont see this problem anymore when testing with latest acorn release - https://github.com/acorn-io/acorn/releases/tag/v0.6.0

sangee2004 commented 1 year ago

Closing this issue. @angelbarrera92 please feel free to reopen this issue if you still encounter this issue.