Cloud-and-Distributed-Systems / Erms

18 stars 4 forks source link

[Failed to deploy] the pod from stateless service can't running correctly #10

Closed qwz111 closed 1 month ago

qwz111 commented 1 month ago

I start a cluster with 4 nodes on Alibaba Cloud by ECS and deploy prometheus on it as follow:

(venv) root@k8s-master01:/home/ecs-user/Erms# kubectl get nodes
NAME           STATUS   ROLES           AGE    VERSION
k8s-master01   Ready    control-plane   148m   v1.28.13
k8s-worker01   Ready    <none>          143m   v1.28.13
k8s-worker02   Ready    <none>          143m   v1.28.13
k8s-worker03   Ready    <none>          143m   v1.28.13
NAME                                  READY   STATUS    RESTARTS   AGE
blackbox-exporter-56cdcfc64f-2kjhg    3/3     Running   0          28s
grafana-6b58c766c5-pl97s              1/1     Running   0          27s
kube-state-metrics-5c55c74596-825b8   3/3     Running   0          27s
node-exporter-6cdn6                   2/2     Running   0          27s
node-exporter-795jk                   2/2     Running   0          27s
node-exporter-cx85x                   2/2     Running   0          27s
node-exporter-qr7n4                   2/2     Running   0          27s
prometheus-adapter-77f8587965-bw2wn   1/1     Running   0          27s
prometheus-adapter-77f8587965-fm27g   1/1     Running   0          27s
prometheus-k8s-0                      2/2     Running   0          25s
prometheus-k8s-1                      2/2     Running   0          25s
prometheus-operator-8cdc4f659-7p9cf   2/2     Running   0          27s

I install the python dependencies in requirements.txt and change Erms-main/configs/media-global/yaml as follows

figure_path: figures_media
yaml_repo_path: yamlRepository/mediaMicroservice
namespace: media-microsvc
app_img: "nicklin9907/erms:mediamicroservice-1.0"
nodes_for_test:
- k8s-worker01
prometheus_host: http://localhost:30090
nodes_for_infra:
- k8s-worker02
- k8s-worker03
pod_spec:
  cpu_size: 0.1
  mem_size: 200Mi

When I run main.py following problems occurs:

Waiting for deployment finished...
Unfinished Pods: cast-info-service-679bddc67-dzlpw, nginx-web-server-9c9ccfbcb-flbx5, plot-service-77ff5cbd9b-mhds5, review-storage-service-556d7767c7-dnjvm, unique-id-service-cc97dd58c-bhx78, user-service-56659446f6-h5rl9
Unfinished Pods: 
Deployment finished! Used time: 10s
Traceback (most recent call last):
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1073, in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohappyeyeballs/impl.py", line 104, in start_connection
    raise first_exception
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohappyeyeballs/impl.py", line 81, in start_connection
    sock = await _connect_sock(
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohappyeyeballs/impl.py", line 166, in _connect_sock
    await loop.sock_connect(sock, address)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect
    return await fut
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 30092)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ecs-user/Erms/main.py", line 61, in <module>
    full_init("media", 30092)
  File "/home/ecs-user/Erms/testing/testCollection.py", line 29, in full_init
    DEPLOYER.full_init(app, configs.GLOBAL_CONFIG.nodes_for_infra, port)
  File "/home/ecs-user/Erms/deployment/deployer.py", line 65, in full_init
    main(server_address=f"http://localhost:{port}")
  File "/home/ecs-user/Erms/scripts/mediaMicroservice/write_movie_info.py", line 101, in main
    loop.run_until_complete(future)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/ecs-user/Erms/scripts/mediaMicroservice/write_movie_info.py", line 45, in write_cast_info
    resps = await asyncio.gather(*tasks)
  File "/home/ecs-user/Erms/scripts/mediaMicroservice/write_movie_info.py", line 8, in upload_cast_info
    async with session.post(addr + "/wrk2-api/cast-info/write", json=cast) as resp:
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/client.py", line 1353, in __aenter__
    self._resp = await self._coro
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/client.py", line 657, in _request
    conn = await self._connector.connect(
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 564, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 975, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1350, in _create_direct_connection
    raise last_exc
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1319, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
  File "/home/ecs-user/Erms/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1088, in _wrap_create_connection
    raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:30092 ssl:default [Connect call failed ('127.0.0.1', 30092)]

and I examine the state of pod, see:

root@k8s-master01:~# kubectl get pods -n media-microsvc
NAME                                       READY   STATUS             RESTARTS      AGE
cast-info-memcached-6546b8d9fc-5spgg       1/1     Running            0             5m6s
cast-info-mongodb-5cd7556875-x4m99         1/1     Running            0             5m5s
cast-info-service-679bddc67-dzlpw          0/1     CrashLoopBackOff   5 (43s ago)   4m56s
compose-review-memcached-f95b899bd-pblwg   1/1     Running            0             5m6s
compose-review-service-7bb8bd754c-qn2w4    0/1     CrashLoopBackOff   5 (64s ago)   4m56s
jaeger-98468fd56-zl67s                     1/1     Running            0             5m7s
movie-id-memcached-55df48b9d7-2nbvd        1/1     Running            0             5m6s
movie-id-mongodb-5fb5c4f7cf-2srbk          1/1     Running            0             5m6s
movie-id-service-f5797fb59-xq4vg           0/1     CrashLoopBackOff   5 (44s ago)   4m56s
movie-info-memcached-9bc8f9944-k8h95       1/1     Running            0             5m6s
movie-info-mongodb-65878b55b8-fvwkd        1/1     Running            0             5m6s
movie-info-service-6689459d7-vmcms         0/1     CrashLoopBackOff   5 (52s ago)   4m56s
movie-review-mongodb-fbb6cdd8-5n72k        1/1     Running            0             5m6s
movie-review-redis-987658f56-mk8lv         1/1     Running            0             5m5s
movie-review-service-6688b5cbf9-wndmw      0/1     CrashLoopBackOff   5 (49s ago)   4m55s
nginx-web-server-9c9ccfbcb-flbx5           0/1     CrashLoopBackOff   5 (84s ago)   4m56s
plot-memcached-5bd5446c8d-5tvm7            1/1     Running            0             5m6s
plot-mongodb-869955484d-v7gpg              1/1     Running            0             5m5s
plot-service-77ff5cbd9b-mhds5              0/1     CrashLoopBackOff   5 (33s ago)   4m56s
rating-redis-f6d9768fd-6rr65               1/1     Running            0             5m7s
rating-service-66bb4b5796-q4xld            0/1     CrashLoopBackOff   5 (54s ago)   4m56s
review-storage-memcached-d87678b65-wjh6s   1/1     Running            0             5m6s
review-storage-mongodb-55bb7bf6df-rh5nl    1/1     Running            0             5m4s
review-storage-service-556d7767c7-dnjvm    0/1     CrashLoopBackOff   5 (34s ago)   4m56s
text-service-868d7ff7b8-pxqwt              0/1     CrashLoopBackOff   5 (26s ago)   4m55s
unique-id-service-cc97dd58c-bhx78          0/1     CrashLoopBackOff   5 (75s ago)   4m56s
user-memcached-57c6bbc55b-872jv            1/1     Running            0             5m4s
user-mongodb-895c6b984-9767r               1/1     Running            0             5m5s
user-review-mongodb-7686748f8c-vps44       1/1     Running            0             5m6s
user-review-redis-676fc6bbd9-79lqh         1/1     Running            0             5m5s
user-review-service-56cbb74799-t5jq6       0/1     CrashLoopBackOff   5 (41s ago)   4m56s
user-service-56659446f6-h5rl9              0/1     CrashLoopBackOff   5 (45s ago)   4m56s

so the entry of media service----nginx-web-server's pod is not ready, and when I examine the log of nginx-web-server-9c9ccfbcb-flbx5, I see:

root@k8s-master01:~# kubectl logs nginx-web-server-9c9ccfbcb-flbx5 -n media-microsvc
2024/08/17 16:50:32 [error] 1#1: Failed to construct tracer: Error resolving address: Temporary failure in name resolution
nginx: [error] Failed to construct tracer: Error resolving address: Temporary failure in name resolution

It seems that media-microsvc cannot be deployed properly. I guess I didn't modify the configuration files (i.e., *-global.yaml and utils.py) correctly but I don't know how to fix it.

Nick-LCY commented 1 month ago

Hi, have you tried other applications like hotel reservation? Based on the YAML you provided, I think you are configuring things correctly, so I recommend you to test other application first to see if it's a problem about your k8s cluster.

qwz111 commented 1 month ago

Hi, have you tried other applications like hotel reservation? Based on the YAML you provided, I think you are configuring things correctly, so I recommend you to test other application first to see if it's a problem about your k8s cluster.

Hi, thank you for your advice! I try to test hotel reservation in my k8s cluster, but when I deploy only stateless deployments only(run init_app() only in main.py, It still has some errors, here is the log of the entry of hotel-reservation, exactly frontend:

root@master:~# kubectl logs frontend-6685f495bd-vbszm -n hotel-reserv
2024/08/18 06:48:35 TLS disabled
2024-08-18T06:48:35Z INF cmd/frontend/main.go:21 > Reading config...
2024-08-18T06:48:35Z INF cmd/frontend/main.go:36 > Read target port: 5000
2024-08-18T06:48:35Z INF cmd/frontend/main.go:37 > Read consul address: consul:8500
2024-08-18T06:48:35Z INF cmd/frontend/main.go:38 > Read jaeger address: jaeger:6831
2024-08-18T06:48:35Z INF cmd/frontend/main.go:45 > Initializing jaeger agent [service name: frontend | host: jaeger:6831]...
2024-08-18T06:48:35Z PNC cmd/frontend/main.go:48 > Got error while initializing jaeger agent: lookup jaeger on 10.96.0.10:53: no such host
panic: Got error while initializing jaeger agent: lookup jaeger on 10.96.0.10:53: no such host

goroutine 1 [running]:
github.com/rs/zerolog.(*Logger).Panic.func1({0xc000024120, 0x0})
        /go/src/github.com/harlow/go-micro-services/vendor/github.com/rs/zerolog/log.go:359 +0x2d
github.com/rs/zerolog.(*Event).msg(0xc00006a060, {0xc000024120, 0x57})
        /go/src/github.com/harlow/go-micro-services/vendor/github.com/rs/zerolog/event.go:149 +0x2b8
github.com/rs/zerolog.(*Event).Msgf(0xc00006a060, {0x8dfe90, 0xc00011e470}, {0xc000187c00, 0xb, 0x8cc659})
        /go/src/github.com/harlow/go-micro-services/vendor/github.com/rs/zerolog/event.go:129 +0x4e
main.main()
        /go/src/github.com/harlow/go-micro-services/cmd/frontend/main.go:48 +0xa3b

By the way, when I run (run full_init("hotel", 30096) only in main.py, It doesn't have this error(this case is deploying all services in hotel application). it seems that there is not a jeager host, but the service of jeager host is in stateful service set, in yamlRepository/hotelReservation/non-test, so I want to konw if I need to deploy the jeager service before I deploy the stateless service only?

Nick-LCY commented 1 month ago

Yes, jaeger is classified as a part of stateful services and you need to deploy stateful services first. Those stateful services can be deployed through full_init().

qwz111 commented 1 month ago

ok, it works now. Thank you very much!