Seagate / cortx-hare

CORTX Hare configures Motr object store, starts/stops Motr services, and notifies Motr of service and device faults.
https://github.com/Seagate/cortx
Apache License 2.0
13 stars 80 forks source link

CORTX-32095: Problem: hax entrypoint reply may delay on restart #2124

Closed mssawant closed 2 years ago

mssawant commented 2 years ago

Hare sends OFFLINE event on whenever process restarts, it does not check if the process is a hax or motr client process. If hax process restarts, it sends FirstEntrypointRequest and will try to send offline for itself, since no halink is established, the delivery will timeout and can further delay the entrypoint processing. Which may cause other side effects. There's no need to send OFFLINE for hax and motr client processes.

Solution: Skip Hax and motr client processes while sending OFFLINE for FirstEntrypoint requests.

Signed-off-by: Mandar Sawant mandar.sawant@seagate.com

mssawant commented 2 years ago

retest this please

vaibhavparatwar commented 2 years ago

please retest this

yeshpal-jain-seagate commented 2 years ago

Tested with the ongoing IO with m0d failures and rgw failures and no issues observed. [root@ssc-vm-g4-rhev4-0717 ~]# s3bench -accessKey=sgiamadmin -accessSecret=ldapadmin -bucket=test-40039-bucket-1-healthy-$(dat e +%d%m-%H%M%S) -endpoint=https://192.168.47.71:30443 -numClients=10 -numSamples=100 -objectNamePrefix=object-degraded -object Size=16Mb -skipSSLCertVerification=True -s3MaxRetries=3 -region us-east-1 -validate -skipCleanup Write done in 16s with 0 errors Read done in 9s with 0 errors Validate done in 10s with 0 errors

With custom build, I have started a script which runs s3bench in parallel, wait for sometime io to complete than kill one of the ioservice - This happens in a loop. so far test has completed few rounds and do not see any issue, cortx-data-ssc-vm-g2-rhev4-3290-79ddd7768d-h6nkz 4/4 Running 4 (76m ago) 171m cortx-data-ssc-vm-g4-rhev4-1669-5b4ff96464-t5gb9 4/4 Running 3 (8m20s ago) 171m cortx-data-ssc-vm-g4-rhev4-1714-f854fb64c-44hhm 4/4 Running 1 (25m ago) 171m cortx-data-ssc-vm-g4-rhev4-1715-869779dbbd-2vrfq 4/4 Running 4 (3m22s ago) 171m cortx-data-ssc-vm-g4-rhev4-1719-69b59b9c6-l92jn 4/4 Running 1 (19m ago) 171m cortx-ha-785fd4968f-r2bcg 3/3 Running 0 167m cortx-kafka-0 1/1 Running 2 (174m ago) 175m cortx-kafka-1 1/1 Running 2 (174m ago) 175m cortx-kafka-2 1/1 Running 1 (174m ago) 175m cortx-server-ssc-vm-g2-rhev4-3290-5849954ddf-q8q9q 2/2 Running 1 (108m ago) 169m

commit details for reference:[root@cortx-data-headless-svc-ssc-vm-g2-rhev4-3290 /]# rpm -qa | grep cortx cortx-provisioner-2.0.0-5045_325a3e0b.noarch cortx-hare-2.0.0-6887_gitab286e2.el8.x86_64 cortx-motr-2.0.0-6887_gite1f5e80e.el8.x86_64 cortx-py-utils-2.0.0-6887_76bd9e4b.noarch