Closed suberti-ads closed 1 year ago
The described behavior is a side effect of the crash of apisix-etcd pods which disable HMI login. Logs apisix-etcd-1 :
{"level":"warn","ts":"2023-02-23T23:11:30.426Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}
{"level":"warn","ts":"2023-02-23T23:11:30.927Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}
{"level":"warn","ts":"2023-02-23T23:11:31.428Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":12198146718204584367,"retry-timeout" :"500ms"}
{"level":"warn","ts":"2023-02-23T23:11:31.918Z","caller":"etcdserver/v3_server.go:852","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}
These crashed came from no eougth space left in rook-cepf-fs disks:
[root@rook-ceph-tools-64c596cd69-npv4p /]# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 15 TiB 1.9 TiB 13 TiB 13 TiB 87.32
TOTAL 15 TiB 1.9 TiB 13 TiB 13 TiB 87.32
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 389 KiB 2 1.1 MiB 100.00 0 B
ceph-blockpool 2 32 1.9 TiB 516.19k 5.8 TiB 100.00 0 B
ceph-fs-metadata 3 32 16 GiB 1.33M 48 GiB 100.00 0 B
cephfs-replicated 4 32 3.3 TiB 29.21M 6.9 TiB 100.00 0 B
The current deployment of the 3To in PV of S2-GRID explains the high occupancy rates.
To workaround this issue, CS-Team reduced the nb of rook-ceph replicas (2 -> 1) Result:
A workshop shall be planned in order to review the new rook-cepf-fs OPS configuration;
For CCB: As the issue is worked around, I reduced priority to minor. I proposed to accept this issue to OPS for the organization of workshop.
IVV_CCB_2023_w09 : Moved into "Accepted OPS", action on OPS side, Priority minor (cf previous comment)
SYS_CCB_2023_w12 : The workaround is still active. There is a need to set replication to 1.
RSRRv2_SystemCCB : to be fixed in phase 1.
Fix in this commit : https://github.com/COPRS/rs-config/commit/13aba5356dae12e141bd89676245faede91f0e25
On my opinion we could fix this issue: It was an incident because we don't have enough disk on rook-ceph pool. This change should not be on default configuration.
Environment:
Traçability:
Current Behavior: Mant HMI were unreacheable (apisix/elasticsearch/scdf/Graylog)
Expected Behavior: Connection successfull on both elasticsearch
Steps To Reproduce: See on Live
Test execution artefacts (i.e. logs, screenshots…) Tip: You can attach images or log files by dragging & dropping, selecting or pasting them.
Whenever possible, first analysis of the root cause sample for elasticsearch were unreacheable
This seems not to be an issue due to elasticsearch state:
Hereafter pod state:
It seems to be link to authentification/routing issue
Hereafter log see on pod apisix-68fbd7c95c-64zpm :
Hereafter etcd statu:
all on these pod failed to work yesterday at same 2023/02/23 23:11:05
apisix-etcd-0.log.gz
Try to restart one To be continue
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)