galasa-dev / projectmanagement

Project Management repo for Issues and ZenHub
7 stars 4 forks source link

etcd DB ran out of space #1084

Open hbunchgithub opened 2 years ago

hbunchgithub commented 2 years ago
Problem: Galassa test runs in Jenkins are failing with Failed to schedule runs 'wazi-vtp-galasa-ecoystem1.fyre.ibm.com:8080 failed to respond , schedule response is null
Solution: Gathered log for galasa_api docker container, log showed:
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: etcdserver: mvcc: database space exceeded
etcdserver DB is out of space.
get a session in the container docker exec -it galasa_cps sh
Get status of etcd DB issue ETCDCTL_API=3 etcdctl --write-out=table endpoint status likelit it will show a NOSPACE alarm
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ | 127.0.0.1:2379 | 8e9e05c52164694d | 3.4.9 | 2.1 GB | true | false | 73 | 9229920 | 9229920 | memberID:10276657743932975437 | | | | | | | | | | | alarm:NOSPACE | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+ Issue the following to compact and defrag:
$ rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]' | egrep -o '[0-9].')
$ ETCDCTL_API=3 etcdctl compact $rev
compacted revision 1516
$ ETCDCTL_API=3 etcdctl defrag
Finished defragmenting etcd member[127.0.0.1:2379]

Defrag make take quite a while, when its done reset alarm ETCDCTL_API=3 etcdctl alarm disarm Check Status again ETCDCTL_API=3 etcdctl --write-out=table endpoint status +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 127.0.0.1:2379 | 8e9e05c52164694d | 3.4.9 | 41 kB | true | false | 73 | 9230139 | 9230139 | | +----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ DB size should be very small. Alarm should be gone.

techcobweb commented 1 year ago

Appologies for not getting to this one before now... Has this happened since please ? Wondering if it is still an issue ?

hbunchgithub commented 1 year ago

Unless you have fixed the problem its still an issue, it was opened at the request of the galasa team, Mr Davies I think.