Tencent / caelus

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs
Other
344 stars 83 forks source link

lighthouse 和 lighthouse-plugin 部署之后报错 #25

Closed GeorgeSen closed 2 years ago

GeorgeSen commented 2 years ago

lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错

kubelet 直接报错无法获取docker版本,

lighthouse 进程 也报错:

I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio

GeorgeSen commented 2 years ago

lighthouse 和 lighthouse-plugin进程配置和社区代码一致, lighthouse-plugin加入了两个配置项 --hostname-override= --kubeconfig=

GeorgeSen commented 2 years ago

lighthouse 配置:

/etc/lighthouse/config:

ARGS="--config=/etc/lighthouse/config.yaml --logtostderr=false --v=10 --log-dir=/data0/gelanjie/caelus_log/lighthouse"

/etc/lighthouse/config.yaml :

apiVersion: componentconfig.lighthouse.io/v1alpha1 kind: HookConfiguration timeout: 10 listenAddress: unix:///var/run/lighthouse.sock webhooks:

GeorgeSen commented 2 years ago

plugin-server:

/etc/plugin-server/config:

ARGS="--feature-gates=AllAlpha=true --logtostderr=false --hostname-override=sbd2-lgna2-a10-bcc-2 --v=10 --log-dir=/data0/gelanjie/caelus_log/plugin-server --listen-address=unix://@plugin-server --kubeconfig=/root/.kube/config"

GeorgeSen commented 2 years ago

kubelet 报错:

Connecting to docker on unix:///var/run/lighthouse.sock Start docker client with request timeout=2m0s failed to run Kubelet: failed to create kubelet: failed to get docker version: request returned Bad Gateway for API route and

ddongchen commented 2 years ago

lighthouse的配置修改为:

apiVersion: lighthouse.io/v1alpha1 kind: hookConfiguration timeout: 10 listenAddress: unix:///var/run/lighthouse.sock webhooks:

测试下?

mYmNeo commented 2 years ago

提供一下 docker info 的数据

GeorgeSen commented 2 years ago

提供一下 docker info 的数据

docker info:

Containers: 34 Running: 17 Paused: 0 Stopped: 17 Images: 17 Server Version: 18.09.6 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: systemd Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: nvidia runc Default Runtime: nvidia Init Binary: docker-init containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30-dirty init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 4.19.95-21 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 236 Total Memory: 957.9GiB Name: sbd2-lgna2-a10-bcc-2 ID: MSMC:CI72:UZC2:332D:O5SZ:MIJR:4DWF:5ITS:DDP7:EJ3A:PYDK:NNDY Docker Root Dir: /data0/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries:

Live Restore Enabled: true Product License: Community Engine

WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled

mYmNeo commented 2 years ago

你这边docker daemon监听的地址是/var/run/docker.sock吗?如果不是需要按照文档配置RemoteEndpoint, https://github.com/Tencent/caelus/blob/a57ba749dcd5cd92df622eee764b4201c91ef033/contrib/lighthouse/pkg/apis/componentconfig.lighthouse.io/v1alpha1/types.go#L38 你这边是docker无法访问

GeorgeSen commented 2 years ago

你这边docker daemon监听的地址是/var/run/docker.sock吗?如果不是需要按照文档配置RemoteEndpoint,

https://github.com/Tencent/caelus/blob/a57ba749dcd5cd92df622eee764b4201c91ef033/contrib/lighthouse/pkg/apis/componentconfig.lighthouse.io/v1alpha1/types.go#L38

你这边是docker无法访问

docker的监听地址确实是 /var/run/docker.sock

[root@sbd2-lgna2-a10-bcc-2] /data0$ ls -al /var/run/docker.sock srw-rw---- 1 root docker 0 12月 9 20:31 /var/run/docker.sock

有什么排错的工具和手段吗?多谢!

mYmNeo commented 2 years ago

你直接使用docker -H unix:///var/run/lighthouse.sock info看是否能显示数据,多长时间返回数据。你这边提供的log显示,请求被cancel掉了。要么是你这边kubelet设置的docker 请求时间过短,要么就是要加大配置的timeout时间

GeorgeSen commented 2 years ago

你直接使用docker -H unix:///var/run/lighthouse.sock info看是否能显示数据,多长时间返回数据。你这边提供的log显示,请求被cancel掉了。要么是你这边kubelet设置的docker 请求时间过短,要么就是要加大配置的timeout时间

[root@sbd2-lgna2-a10-bcc-2] /data0$ docker -H unix:///var/run/lighthouse.sock info request returned Bad Gateway for API route and version http://%2Fvar%2Frun%2Flighthouse.sock/v1.39/info, check if the server supports the requested API version

[root@sbd2-lgna2-a10-bcc-2] /data0$ docker -H unix:///var/run/docker.sock info Containers: 34 Running: 1 Paused: 0 Stopped: 33 Images: 17 Server Version: 18.09.6

mYmNeo commented 2 years ago

你直接使用docker -H unix:///var/run/lighthouse.sock info看是否能显示数据,多长时间返回数据。你这边提供的log显示,请求被cancel掉了。要么是你这边kubelet设置的docker 请求时间过短,要么就是要加大配置的timeout时间

[root@sbd2-lgna2-a10-bcc-2] /data0$ docker -H unix:///var/run/lighthouse.sock info request returned Bad Gateway for API route and version http://%2Fvar%2Frun%2Flighthouse.sock/v1.39/info, check if the server supports the requested API version

[root@sbd2-lgna2-a10-bcc-2] /data0$ docker -H unix:///var/run/docker.sock info Containers: 34 Running: 1 Paused: 0 Stopped: 33 Images: 17 Server Version: 18.09.6

这个时候 lighthouse 的日志发出来一下,要完整的请求日志

GeorgeSen commented 2 years ago

重新用社区代码编译了一个二进制试了一下

发送了两次 docker -H unix:///var/run/lighthouse.sock info

以及 curl -v --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info

这个时候 lighthouse 的完整日志:

[root@sbd2-lgna2-a10-bcc-2] /data0$ cat /data0/gelanjie/caelus_log/lighthouse/lighthouse.INFO Log file created at: 2021/12/10 15:16:43 Running on machine: sbd2-lgna2-a10-bcc-2 Binary: Built with gc go1.16.10 for linux/amd64 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I1210 15:16:43.107530 17467 util.go:69] FLAG: --add-dir-header="false" I1210 15:16:43.107680 17467 util.go:69] FLAG: --alsologtostderr="false" I1210 15:16:43.107683 17467 util.go:69] FLAG: --config="/etc/lighthouse/config.yaml" I1210 15:16:43.107693 17467 util.go:69] FLAG: --help="false" I1210 15:16:43.107695 17467 util.go:69] FLAG: --log-backtrace-at=":0" I1210 15:16:43.107699 17467 util.go:69] FLAG: --log-dir="/data0/gelanjie/caelus_log/lighthouse" I1210 15:16:43.107701 17467 util.go:69] FLAG: --log-file="" I1210 15:16:43.107703 17467 util.go:69] FLAG: --log-file-max-size="1800" I1210 15:16:43.107704 17467 util.go:69] FLAG: --log-flush-frequency="5s" I1210 15:16:43.107706 17467 util.go:69] FLAG: --logtostderr="false" I1210 15:16:43.107708 17467 util.go:69] FLAG: --skip-headers="false" I1210 15:16:43.107709 17467 util.go:69] FLAG: --skip-log-headers="false" I1210 15:16:43.107711 17467 util.go:69] FLAG: --stderrthreshold="2" I1210 15:16:43.107712 17467 util.go:69] FLAG: --v="10" I1210 15:16:43.107714 17467 util.go:69] FLAG: --version="false" I1210 15:16:43.107716 17467 util.go:69] FLAG: --vmodule="" I1210 15:16:43.108173 17467 hook_manager.go:124] Hook timeout: 10 seconds I1210 15:16:43.108178 17467 hook_manager.go:132] Register hook docker, endpoint unix://@plugin-server I1210 15:16:43.108182 17467 hook_manager.go:136] Register PreHook post /containers/create with unix://@plugin-server I1210 15:16:43.108186 17467 hook_manager.go:136] Register PreHook post /containers/{name:.}/update with unix://@plugin-server I1210 15:16:43.108188 17467 hook_manager.go:164] Build router: post /containers/create I1210 15:16:43.108202 17467 hook_manager.go:164] Build router: post /containers/{name:.}/update I1210 15:16:43.108328 17467 hook_manager.go:101] Hook manager is running I1210 15:17:32.250451 17467 hook_manager.go:343] Unhandled request GET /_ping I1210 15:17:32.250505 17467 log.go:184] http: proxy error: context canceled I1210 15:17:32.251220 17467 hook_manager.go:343] Unhandled request GET /v1.39/info I1210 15:17:32.251236 17467 log.go:184] http: proxy error: context canceled I1210 15:17:38.916794 17467 hook_manager.go:343] Unhandled request GET /_ping I1210 15:17:38.916843 17467 log.go:184] http: proxy error: context canceled I1210 15:17:38.917189 17467 hook_manager.go:343] Unhandled request GET /v1.39/info I1210 15:17:38.917221 17467 log.go:184] http: proxy error: context canceled I1210 15:18:03.286403 17467 hook_manager.go:343] Unhandled request GET /v1.39/info I1210 15:18:03.286440 17467 log.go:184] http: proxy error: context canceled I1210 15:18:07.702430 17467 hook_manager.go:343] Unhandled request GET /v1.39/info I1210 15:18:07.702468 17467 log.go:184] http: proxy error: context canceled

mYmNeo commented 2 years ago

你用什么用户启动 lighthouse?

GeorgeSen commented 2 years ago

你用什么用户启动 lighthouse?

root 用户

[root@sbd2-lgna2-a10-bcc-2] /data0$ ls -al /var/run/lighthouse.sock srwxr-xr-x 1 root root 0 12月 10 13:56 /var/run/lighthouse.sock

[root@sbd2-lgna2-a10-bcc-2] /data0$ ls -al /var/run/docker.sock srw-rw---- 1 root docker 0 12月 10 11:08 /var/run/docker.sock

mYmNeo commented 2 years ago

curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info, 执行这个有输出结果吗

GeorgeSen commented 2 years ago

curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info, 执行这个有输出结果吗

没结果

[root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info [root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/version [root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/info

GeorgeSen commented 2 years ago

curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info, 执行这个有输出结果吗

没结果

[root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info [root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/version [root@sbd2-lgna2-a10-bcc-2] /var/run$ curl --unix-socket /var/run/lighthouse.sock http://127.0.0.1/info

[root@sbd2-lgna2-a10-bcc-2] /data0$ curl -v --unix-socket /var/run/lighthouse.sock http://127.0.0.1/v1.39/info About to connect() to 127.0.0.1 port 80 (#0) Trying /var/run/lighthouse.sock... Failed to set TCP_KEEPIDLE on fd 3 Failed to set TCP_KEEPINTVL on fd 3 Connected to 127.0.0.1 (/var/run/lighthouse.sock) port 80 (#0) GET /v1.39/info HTTP/1.1 User-Agent: curl/7.29.0 Host: 127.0.0.1 Accept: /

HTTP/1.1 502 Bad Gateway Date: Fri, 10 Dec 2021 07:18:23 GMT Content-Length: 0

Connection #0 to host 127.0.0.1 left intact

mYmNeo commented 2 years ago

@GeorgeSen 拉一下新的代码再试一下

GeorgeSen commented 2 years ago

目前正常了

GeorgeSen commented 2 years ago

@mYmNeo 按照新代码试过了,运行没有问题,我创建一个pod的时候把注解加上了:

mixer.kubernetes.io/app-class: "greedy"

发现创建的pod的容器并没有将cgroup放在 /sys/fs/cgroup/cpu,cpuacct/kubepods/offline 目录下:

lighthouse 日志:

I1210 16:57:12.913446 118028 hook_manager.go:343] Unhandled request GET /images/nm-operator:v0_1/json I1210 16:57:12.914741 118028 hook_manager.go:343] Unhandled request GET /images/nm-operator:v0_1/json I1210 16:57:12.916540 118028 hook_manager.go:343] Unhandled request GET /version I1210 16:57:12.917098 118028 hook_manager.go:333] Handle request POST /containers/create I1210 16:57:12.917134 118028 hook_manager.go:313] PreHook request /containers/create, body: {"Hostname":"","Domainname":"","User":"0","AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["USER=root","PID_FILE=/tmp/hadoop-root-nodemanager.pid","CONTAINER_EXECUTOR=org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor","GINIT_PORT=10010","GROUP=root","HADOOP_CONF_DIR=/opt/module/hadoop-3.1.3/etc/hadoop","HADOOP_YARN_HOME=/opt/module/hadoop-3.1.3/","CGROUP_PATH=/sys/fs/cgroup","NM_LOCAL_DIRS=/hadoop-data","MY_NODE_IP=10.26.0.10","KUBERNETES_PORT=tcp://10.243.0.1:443","KUBERNETES_PORT_443_TCP=tcp://10.243.0.1:443","KUBERNETES_PORT_443_TCP_PROTO=tcp","KUBERNETES_PORT_443_TCP_PORT=443","KUBERNETES_PORT_443_TCP_ADDR=10.243.0.1","KUBERNETES_SERVICE_HOST=10.243.0.1","KUBERNETES_SERVICE_PORT=443","KUBERNETES_SERVICE_PORT_HTTPS=443"],"Cmd":null,"Healthcheck":{"Test":["NONE"]},"Image":"sha256:30ce197b3c5b0d06d3bf0e6a320f141067321af7845d588c2ea8d8bea3464bb9","Volumes":null,"WorkingDir":"","Entrypoint":null,"OnBuild":null,"Labels":{"annotation.io.kubernetes.container.hash":"bd015f2b","annotation.io.kubernetes.container.restartCount":"0","annotation.io.kubernetes.container.terminationMessagePath":"/dev/termination-log","annotation.io.kubernetes.container.terminationMessagePolicy":"File","annotation.io.kubernetes.pod.terminationGracePeriod":"30","io.kubernetes.container.logpath":"/var/log/pods/hadoop-yarn_node-manager-9rzvq_03ec582f-954a-4dee-a3dd-ea0ddf9b04b3/node-manager/0.log","io.kubernetes.container.name":"node-manager","io.kubernetes.docker.type":"container","io.kubernetes.pod.name":"node-manager-9rzvq","io.kubernetes.pod.namespace":"hadoop-yarn","io.kubernetes.pod.uid":"03ec582f-954a-4dee-a3dd-ea0ddf9b04b3","io.kubernetes.sandbox.id":"a7853f436d7aa84756f2a00d429f7df946099c98f952f122937c58e8c29e1435"},"HostConfig":{"Binds":["/var/lib/kubelet/pods/03ec582f-954a-4dee-a3dd-ea0ddf9b04b3/volumes/kubernetes.io~secret/default-token-lvn46:/var/run/secrets/kubernetes.io/serviceaccount:ro","/var/lib/kubelet/pods/03ec582f-954a-4dee-a3dd-ea0ddf9b04b3/etc-hosts:/etc/hosts","/var/lib/kubelet/pods/03ec582f-954a-4dee-a3dd-ea0ddf9b04b3/containers/node-manager/f78b29f6:/dev/termination-log"],"ContainerIDFile":"","LogConfig":{"Type":"","Config":null},"NetworkMode":"container:a7853f436d7aa84756f2a00d429f7df946099c98f952f122937c58e8c29e1435","PortBindings":null,"RestartPolicy":{"Name":"no","MaximumRetryCount":0},"AutoRemove":false,"VolumeDriver":"","VolumesFrom":null,"CapAdd":null,"CapDrop":null,"Capabilities":null,"Dns":null,"DnsOptions":null,"DnsSearch":null,"ExtraHosts":null,"GroupAdd":null,"IpcMode":"container:a7853f436d7aa84756f2a00d429f7df946099c98f952f122937c58e8c29e1435","Cgroup":"","Links":null,"OomScoreAdj":1000,"PidMode":"","Privileged":true,"PublishAllPorts":false,"ReadonlyRootfs":false,"SecurityOpt":["seccomp=unconfined"],"UTSMode":"","UsernsMode":"","ShmSize":67108864,"ConsoleSize":[0,0],"Isolation":"","CpuShares":2,"Memory":0,"NanoCpus":0,"CgroupParent":"kubepods-besteffort-pod03ec582f_954a_4dee_a3dd_ea0ddf9b04b3.slice","BlkioWeight":0,"BlkioWeightDevice":null,"BlkioDeviceReadBps":null,"BlkioDeviceWriteBps":null,"BlkioDeviceReadIOps":null,"BlkioDeviceWriteIOps":null,"CpuPeriod":100000,"CpuQuota":0,"CpuRealtimePeriod":0,"CpuRealtimeRuntime":0,"CpusetCpus":"","CpusetMems":"","Devices":[],"DeviceCgroupRules":null,"DeviceRequests":null,"DiskQuota":0,"KernelMemory":0,"KernelMemoryTCP":0,"MemoryReservation":0,"MemorySwap":0,"MemorySwappiness":null,"OomKillDisable":null,"PidsLimit":null,"Ulimits":null,"CpuCount":0,"CpuPercent":0,"IOMaximumIOps":0,"IOMaximumBandwidth":0,"MaskedPaths":null,"ReadonlyPaths":null},"NetworkingConfig":null} I1210 16:57:12.917140 118028 hook_manager.go:200] Send to PreHook handler 0 I1210 16:57:12.917148 118028 hook_connector.go:70] Send request POST /prehook/containers/create for non-versioned I1210 16:57:12.919436 118028 hook_connector.go:82] Decode response /prehook/containers/create for non-versioned I1210 16:57:12.919766 118028 hook_manager.go:174] Send data to backend path /containers/create I1210 16:57:12.926452 118028 hook_manager.go:177] Finish backend path /containers/create I1210 16:57:12.926467 118028 hook_manager.go:280] PostHook request /containers/create, body: {"statusCode":201,"body":{"Id":"9f17870589ebcc5593dccb0dc67b83c72815c360028f31061f597d6716d34f8e","Warnings":null} } I1210 16:57:12.926860 118028 hook_manager.go:343] Unhandled request POST /containers/9f17870589ebcc5593dccb0dc67b83c72815c360028f31061f597d6716d34f8e/start I1210 16:57:12.960252 118028 hook_manager.go:343] Unhandled request GET /containers/9f17870589ebcc5593dccb0dc67b83c72815c360028f31061f597d6716d34f8e/json I1210 16:57:13.006822 118028 hook_manager.go:343] Unhandled request GET /containers/9f17870589ebcc5593dccb0dc67b83c72815c360028f31061f597d6716d34f8e/json I1210 16:57:13.007458 118028 hook_manager.go:343] Unhandled request GET /containers/9f17870589ebcc5593dccb0dc67b83c72815c360028f31061f597d6716d34f8e/json I1210 16:57:13.008184 118028 hook_manager.go:343] Unhandled request GET /containers/a7853f436d7aa84756f2a00d429f7df946099c98f952f122937c58e8c29e1435/json

mYmNeo commented 2 years ago

你的 docker info 显示不是用的 cgroupfs 是 systemd

ddongchen commented 2 years ago

@GeorgeSen 更详细的文档已上传 https://github.com/Tencent/caelus/blob/master/doc/start.md https://github.com/Tencent/caelus/blob/master/doc/config.md 入口为: image

欢迎使用caelus