[BUG] [OPS] HMI were unreacheable "Unexepective error when handling authentication request to identify provider"

suberti-ads commented 1 year ago

Environment:

Delivery tag:
Platform: OPS Orange Cloud
Configuration: Infrastucture 1.4.0 (with updatefluentd) Monitoring 1.5.0

Traçability:

Current Behavior: Mant HMI were unreacheable (apisix/elasticsearch/scdf/Graylog)

ElasticsearchUnreacheable

Expected Behavior: Connection successfull on both elasticsearch

Steps To Reproduce: See on Live

Test execution artefacts (i.e. logs, screenshots…) Tip: You can attach images or log files by dragging & dropping, selecting or pasting them.

Whenever possible, first analysis of the root cause sample for elasticsearch were unreacheable

This seems not to be an issue due to elasticsearch state:

 safescale  gw-cluster-ops  ~  ES_NAMESPACE="database";

 safescale  gw-cluster-ops  ~  ES_SVC="elasticsearch-processing-es-coordinating.${ES_NAMESPACE}.svc.cluster.local";

 safescale  gw-cluster-ops  ~  ES_PORT="9200";

 safescale  gw-cluster-ops  ~  curl -X GET http://${ES_SVC}:${ES_PORT}/_cat/health
1677227047 08:24:07 elasticsearch-processing green 5 3 876 438 0 0 0 165 9.1h 100.0%

Hereafter pod state:

 safescale  gw-cluster-ops  ~  kubectl get po -A -o wide | grep -v " Running"
NAMESPACE         NAME                                                              READY   STATUS             RESTARTS           AGE     IP                NODE                   NOMINATED NODE   READINESS GATES
infra             kafka-cluster-kafka-exporter-74676dbb6-l42xf                      1/2     CrashLoopBackOff   143 (2m21s ago)    36d     10.244.226.80     cluster-ops-node-9     <none>           <none>
infra             keda-operator-778cf49bcf-2t9lt                                    1/2     CrashLoopBackOff   1040 (69s ago)     3d21h   10.244.142.62     cluster-ops-node-139   <none>           <none>
infra             prometheus-operator-kube-p-admission-create--1-cpxpq              0/1     Completed          0                  162d    10.244.24.3       cluster-ops-node-8     <none>           <none>
logging           loki-ingester-10                                                  0/2     Init:0/1           0                  8h      <none>            cluster-ops-node-5     <none>           <none>
logging           loki-ingester-11                                                  0/2     Init:0/1           0                  8h      <none>            cluster-ops-node-6     <none>           <none>
logging           loki-ingester-12                                                  0/2     Init:0/1           0                  8h      <none>            cluster-ops-node-8     <none>           <none>
logging           loki-ingester-13                                                  0/2     Init:0/1           0                  4h4m    <none>            cluster-ops-node-9     <none>           <none>
logging           loki-ingester-14                                                  0/2     Init:0/1           0                  4h4m    <none>            cluster-ops-node-4     <none>           <none>
logging           loki-ingester-4                                                   0/2     Init:0/1           0                  9h      <none>            cluster-ops-node-146   <none>           <none>
logging           loki-ingester-5                                                   0/2     Init:0/1           0                  9h      <none>            cluster-ops-node-146   <none>           <none>
logging           loki-ingester-6                                                   0/2     Init:0/1           0                  9h      <none>            cluster-ops-node-146   <none>           <none>
logging           loki-ingester-7                                                   0/2     Init:0/1           0                  9h      <none>            cluster-ops-node-146   <none>           <none>
logging           loki-ingester-8                                                   0/2     Init:0/1           0                  8h      <none>            cluster-ops-node-5     <none>           <none>
logging           loki-ingester-9                                                   0/2     Init:0/1           0                  8h      <none>            cluster-ops-node-146   <none>           <none>
monitoring        grafana-deployment-65b58f694f-bkc2s                               1/2     Terminating        0                  3d17h   10.244.226.131    cluster-ops-node-9     <none>           <none>
monitoring        grafana-deployment-65b58f694f-xwj6p                               0/2     Init:0/2           0                  68m     <none>            cluster-ops-node-7     <none>           <none>
processing        ingestion-auxip-part1-ingestion-auxip-filter-v11-75d768c64h88cv   1/2     CrashLoopBackOff   95 (4m33s ago)     7d15h   10.244.144.10     cluster-ops-node-141   <none>           <none>
processing        ingestion-auxip-part1-ingestion-auxip-trigger-v11-856dc7c458dct   1/2     CrashLoopBackOff   95 (4m32s ago)     7d15h   10.244.118.84     cluster-ops-node-103   <none>           <none>
processing        ingestion-auxip-part1-ingestion-auxip-uncompress-v11-768dd5f27c   1/2     CrashLoopBackOff   97 (4m46s ago)     8d      10.244.118.58     cluster-ops-node-103   <none>           <none>
processing        ingestion-auxip-part1-ingestion-auxip-worker-v11-76c6b57c9n7vj5   1/2     CrashLoopBackOff   95 (4m32s ago)     7d15h   10.244.118.83     cluster-ops-node-103   <none>           <none>
processing        ingestion-pedc-part1-ingestion-edip-filter-v66-f9894d8f9-nvt8f    1/2     CrashLoopBackOff   95 (4m16s ago)     15h     10.244.4.25       cluster-ops-node-138   <none>           <none>
processing        rs-user-web-client-test-connection                                1/2     NotReady           0                  84d     10.244.23.56      cluster-ops-node-14    <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-2trjv             0/2     Evicted            0                  10d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-4frj2             0/2     Evicted            0                  3d23h   <none>            cluster-ops-node-103   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-d85jh             0/2     Evicted            0                  4d      <none>            cluster-ops-node-141   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-dw4dw             0/2     Evicted            0                  4d8h    <none>            cluster-ops-node-103   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-gh8sz             0/2     Evicted            0                  11d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-jk5cc             0/2     Evicted            0                  10d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-l69qb             0/2     Evicted            0                  3d23h   <none>            cluster-ops-node-103   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-m29p8             0/2     Evicted            0                  11d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-n9zzz             0/2     Evicted            0                  11d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-nrm7h             0/2     Evicted            0                  4d      <none>            cluster-ops-node-141   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-shtnh             0/2     Evicted            0                  11d     <none>            cluster-ops-node-104   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-tjwfs             0/2     Evicted            0                  4d7h    <none>            cluster-ops-node-141   <none>           <none>
processing        s1-l0aiop-part1-execution-worker-v26-75f77776f6-zc2mj             0/2     Evicted            0                  4d      <none>            cluster-ops-node-103   <none>           <none>
rook-ceph         rook-ceph-osd-prepare-cluster-ops-node-18--1-ntg9k                0/1     Completed          0                  7h7m    10.244.143.169    cluster-ops-node-18    <none>           <none>
rook-ceph         rook-ceph-osd-prepare-cluster-ops-node-19--1-lj4mz                0/1     Completed          0                  7h7m    10.244.74.170     cluster-ops-node-19    <none>           <none>
rook-ceph         rook-ceph-osd-prepare-cluster-ops-node-20--1-kb9dw                0/1     Completed          0                  7h7m    10.244.210.167    cluster-ops-node-20    <none>           <none>
rook-ceph         rook-ceph-osd-prepare-cluster-ops-node-21--1-cqhgt                0/1     Completed          0                  7h7m    10.244.218.172    cluster-ops-node-21    <none>           <none>
rook-ceph         rook-ceph-osd-prepare-cluster-ops-node-22--1-wxpwc                0/1     Completed          0                  7h7m    10.244.185.72     cluster-ops-node-22    <none>           <none>
security          falco-exporter-w47db                                              1/2     CrashLoopBackOff   208 (2m59s ago)    34d     10.244.141.58     cluster-ops-node-16    <none>           <none>
security          graylog-provisioner-job--1-fhd4t                                  0/1     Completed          4                  141d    10.244.22.28      cluster-ops-node-7     <none>           <none>
security          nmap-job-27953640--1-6c9r2                                        0/1     Completed          0                  3h15m   10.244.144.43     cluster-ops-node-141   <none>           <none>

It seems to be link to authentification/routing issue

Hereafter log see on pod apisix-68fbd7c95c-64zpm :

2023/02/24 09:27:54 [error] 47#47: *382169795 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:54 [error] 48#48: *382154503 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:54 [error] 44#44: *382142602 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 8s, context: ngx.timer
2023/02/24 09:27:54 [error] 46#46: *382211737 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:55 [error] 54#54: *382165707 stream [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 8s, context: ngx.timer
2023/02/24 09:27:55 [error] 50#50: *382195401 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 16s, context: ngx.timer
2023/02/24 09:27:55 [error] 54#54: *382147271 stream [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:55 [error] 44#44: *382162511 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:55 [error] 54#54: *382169928 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer
2023/02/24 09:27:55 [error] 49#49: *382211818 [lua] config_etcd.lua:567: no healthy etcd endpoint available, next retry after 4s, context: ngx.timer

Hereafter etcd statu:

apisix-etcd-0                              1/2     Running   0                34d
apisix-etcd-1                              1/2     Running   0                34d
apisix-etcd-2                              1/2     Running   0                34d

all on these pod failed to work yesterday at same 2023/02/23 23:11:05

{"level":"warn","ts":"2023-02-23T23:11:05.123Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"4.999397595s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"health\" ","response":"","error":"context deadline exceeded"}
{"level":"info","ts":"2023-02-23T23:11:05.124Z","caller":"traceutil/trace.go:171","msg":"trace[569336086] range","detail":"{range_begin:health; range_end:; }","duration":"4.999507227s","start":"2023-02-23T23:11:00.124Z","end":"2023-02-23T23:11:05.124Z","steps":["trace[569336086] 'agreement among raft nodes before linearized reading'  (duration: 4.999395102s)"],"step_count":1}
{"level":"warn","ts":"2023-02-23T23:11:05.124Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2023-02-23T23:11:00.124Z","time spent":"4.999572615s","remote":"10.244.150.74:57816","response type":"/etcdserverpb.KV/Range","request count":0,"request size":8,"response count":0,"response size":0,"request content":"key:\"health\" "}
{"level":"warn","ts":"2023-02-23T23:11:05.132Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11653774113246473325,"retry-timeout":"500ms"}
{"level":"info","ts":"2023-02-23T23:11:05.160Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
{"level":"info","ts":"2023-02-23T23:11:05.162Z","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"apisix-etcd-2","data-dir":"/bitnami/etcd/data","advertise-peer-urls":["http://apisix-etcd-2.apisix-etcd-headless.networking.svc.cluster.local:2380"],"advertise-client-urls":["http://apisix-etcd-2.apisix-etcd-headless.networking.svc.cluster.local:2379","http://apisix-etcd.networking.svc.cluster.local:2379"]}
{"level":"info","ts":"2023-02-23T23:11:05.166Z","caller":"etcdserver/server.go:1444","msg":"skipped leadership transfer; local server is not leader","local-member-id":"3cebe7ced66da1ba","current-leader-member-id":"d762e7c0665ea948"}
WARNING: 2023/02/23 23:11:05 [core] grpc: addrConn.createTransport failed to connect to {0.0.0.0:2379 0.0.0.0:2379 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused". Reconnecting...

apisix-etcd-0.log.gz

Try to restart one To be continue

Bug Generic Definition of Ready (DoR)

[ ] The affect version in which the bug has been found is mentioned
[ ] The context and environment of the bug is detailed
[ ] The description of the bug is clear and unambiguous
[ ] The procedure (steps) to reproduce the bug is clearly detailed
[ ] The tested User Story / features is linked to the bug if available
[ ] Logs are attached if available
[ ] A data set attached if available

Bug Generic Definition of Done (DoD)

[ ] the modification implemented (the solution to fix the bug) is described in the bug.
[ ] Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
[ ] Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
[ ] Code is compliant with coding rules (SONAR Report as evidence)
[ ] Acceptance criteria of the related User story are checked and Passed

Woljtek commented 1 year ago

The described behavior is a side effect of the crash of apisix-etcd pods which disable HMI login. Logs apisix-etcd-1 :

{"level":"warn","ts":"2023-02-23T23:11:30.426Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}                                                                                                                                                                                                  
{"level":"warn","ts":"2023-02-23T23:11:30.927Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}                                                                                                                                                                                                  
{"level":"warn","ts":"2023-02-23T23:11:31.428Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}                                                                                                                                                                                                  
{"level":"warn","ts":"2023-02-23T23:11:31.918Z","caller":"etcdserver/v3_server.go:852","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}

These crashed came from no eougth space left in rook-cepf-fs disks:

[root@rook-ceph-tools-64c596cd69-npv4p /]# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 15 TiB 1.9 TiB 13 TiB 13 TiB 87.32
TOTAL 15 TiB 1.9 TiB 13 TiB 13 TiB 87.32

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 389 KiB 2 1.1 MiB 100.00 0 B
ceph-blockpool 2 32 1.9 TiB 516.19k 5.8 TiB 100.00 0 B
ceph-fs-metadata 3 32 16 GiB 1.33M 48 GiB 100.00 0 B
cephfs-replicated 4 32 3.3 TiB 29.21M 6.9 TiB 100.00 0 B

The current deployment of the 3To in PV of S2-GRID explains the high occupancy rates.

To workaround this issue, CS-Team reduced the nb of rook-ceph replicas (2 -> 1) Result:

A workshop shall be planned in order to review the new rook-cepf-fs OPS configuration;

Increase the disk pool-size
Set expected replicas (minimun 2)

Woljtek commented 1 year ago

For CCB: As the issue is worked around, I reduced priority to minor. I proposed to accept this issue to OPS for the organization of workshop.

LAQU156 commented 1 year ago

IVV_CCB_2023_w09 : Moved into "Accepted OPS", action on OPS side, Priority minor (cf previous comment)

pcuq-ads commented 1 year ago

SYS_CCB_2023_w12 : The workaround is still active. There is a need to set replication to 1.

pcuq-ads commented 1 year ago

RSRRv2_SystemCCB : to be fixed in phase 1.

suberti-ads commented 1 year ago

Fix in this commit : https://github.com/COPRS/rs-config/commit/13aba5356dae12e141bd89676245faede91f0e25

On my opinion we could fix this issue: It was an incident because we don't have enough disk on rook-ceph pool. This change should not be on default configuration.

COPRS / rs-issues

[BUG] [OPS] HMI were unreacheable "Unexepective error when handling authentication request to identify provider" #840