Closed ghouscht closed 2 weeks ago
Hi @ghouscht. Thanks for your PR.
I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Attention: Patch coverage is 50.00000%
with 2 lines
in your changes missing coverage. Please review.
Project coverage is 68.75%. Comparing base (
3de0018
) to head (8369f07
). Report is 14 commits behind head on main.:exclamation: Current head 8369f07 differs from pull request most recent head 04c042c
Please upload reports for the commit 04c042c to get more accurate results.
Files with missing lines | Patch % | Lines |
---|---|---|
server/storage/backend/backend.go | 50.00% | 2 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
The e2e test looks good.
The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu
The e2e test looks good.
The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu
I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with TODO
in the code - feedback appreciated 🙂
/retest
/ok-to-test
I think we still need to handle the error if any during defragdb
, something like below, also add one more failpoint and a new [sub]test case.
$ git diff -l10
diff --git a/server/storage/backend/backend.go b/server/storage/backend/backend.go
index 95f5cf96f..5a9361ae8 100644
--- a/server/storage/backend/backend.go
+++ b/server/storage/backend/backend.go
@@ -522,6 +522,11 @@ func (b *backend) defrag() error {
if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil {
b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr))
}
+
+ // restore the bbolt transactions if defragmentation fails
+ b.batchTx.tx = b.unsafeBegin(true)
+ b.readTx.tx = b.unsafeBegin(false)
+
return err
}
Note we need to resolve https://github.com/etcd-io/etcd/pull/18822#discussion_r1830692111 in a separate PR. Could you please raise a new issue to track it? Thanks.
Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.
Overall looks good now. Please signoff the second commit. Refer to https://github.com/etcd-io/etcd/pull/18822/checks?check_run_id=32589384806
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ahrtr, ghouscht, serathius
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@ghouscht can you please backport this PR to 3.5 and 3.4?
PR contains an e2e test, gofailpoint and a fix for the issue described in https://github.com/etcd-io/etcd/issues/18810.
Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:
I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.As mentioned in https://github.com/etcd-io/etcd/pull/18822#issuecomment-2453561758 the PR now restores the environment and lets etcd continue to run.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.