etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.92k stars 9.78k forks source link

fix(defrag): handle no space left error #18822

Closed ghouscht closed 2 weeks ago

ghouscht commented 3 weeks ago

PR contains an e2e test, gofailpoint and a fix for the issue described in https://github.com/etcd-io/etcd/issues/18810.

Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: execute job failed
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x787ad8]

goroutine 136 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x80?, 0x2?, {0xf?, 0x0?, 0x0?})
    go.uber.org/zap@v1.27.0/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40005d81a0, {0x400058a780, 0x2, 0x2})
    go.uber.org/zap@v1.27.0/zapcore/entry.go:262 +0x1c4
go.uber.org/zap.(*Logger).Panic(0xcd8746?, {0xce7440?, 0xb54f60?}, {0x400058a780, 0x2, 0x2})
    go.uber.org/zap@v1.27.0/logger.go:285 +0x54
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1()
    go.etcd.io/etcd/pkg/v3@v3.6.0-alpha.0/schedule/schedule.go:202 +0x24c
panic({0xb54f60?, 0x1681ec0?})
    runtime/panic.go:785 +0x124
go.etcd.io/bbolt.(*Tx).Bucket(...)
    go.etcd.io/bbolt@v1.4.0-alpha.1/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.unsafeForEach(0x0, {0xe9b988?, 0x1695e60?}, 0x4000580460)
    go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:235 +0x38
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeForEach(...)
    go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:231
go.etcd.io/etcd/server/v3/storage/backend.unsafeVerifyTxConsistency({0xea55e0, 0x40000bafc0}, {0xe9b988, 0x1695e60})
    go.etcd.io/etcd/server/v3/storage/backend/verify.go:97 +0xa4
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll.VerifyBackendConsistency.func1()
    go.etcd.io/etcd/server/v3/storage/backend/verify.go:90 +0x244
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0x40000efdf8)
    go.etcd.io/etcd/client/pkg/v3@v3.6.0-alpha.0/verify/verify.go:71 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.VerifyBackendConsistency(...)
    go.etcd.io/etcd/server/v3/storage/backend/verify.go:75
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0x40003a3c08, 0x4000172180, 0x40005da000)
    go.etcd.io/etcd/server/v3/etcdserver/server.go:972 +0xcc
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6({0x4000196850?, 0x4000494a80?})
    go.etcd.io/etcd/server/v3/etcdserver/server.go:847 +0x28
go.etcd.io/etcd/pkg/v3/schedule.job.Do(...)
    go.etcd.io/etcd/pkg/v3@v3.6.0-alpha.0/schedule/schedule.go:41
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob(0x40000eff70?, {0xe943f8?, 0x40005b6330?}, 0x0?)
    go.etcd.io/etcd/pkg/v3@v3.6.0-alpha.0/schedule/schedule.go:206 +0x78
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0x400039a0e0)
    go.etcd.io/etcd/pkg/v3@v3.6.0-alpha.0/schedule/schedule.go:187 +0x15c
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler in goroutine 155
    go.etcd.io/etcd/pkg/v3@v3.6.0-alpha.0/schedule/schedule.go:101 +0x178

I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.

As mentioned in https://github.com/etcd-io/etcd/pull/18822#issuecomment-2453561758 the PR now restores the environment and lets etcd continue to run.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

k8s-ci-robot commented 3 weeks ago

Hi @ghouscht. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
codecov-commenter commented 3 weeks ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.75%. Comparing base (3de0018) to head (8369f07). Report is 14 commits behind head on main.

:exclamation: Current head 8369f07 differs from pull request most recent head 04c042c

Please upload reports for the commit 04c042c to get more accurate results.

Files with missing lines Patch % Lines
server/storage/backend/backend.go 50.00% 2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files | [Files with missing lines](https://app.codecov.io/gh/etcd-io/etcd/pull/18822?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [server/storage/backend/backend.go](https://app.codecov.io/gh/etcd-io/etcd/pull/18822?src=pr&el=tree&filepath=server%2Fstorage%2Fbackend%2Fbackend.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c2VydmVyL3N0b3JhZ2UvYmFja2VuZC9iYWNrZW5kLmdv) | `82.82% <50.00%> (-0.52%)` | :arrow_down: | ... and [19 files with indirect coverage changes](https://app.codecov.io/gh/etcd-io/etcd/pull/18822/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) ```diff @@ Coverage Diff @@ ## main #18822 +/- ## ========================================== - Coverage 68.76% 68.75% -0.01% ========================================== Files 420 420 Lines 35523 35525 +2 ========================================== - Hits 24426 24425 -1 - Misses 9665 9678 +13 + Partials 1432 1422 -10 ``` ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/etcd-io/etcd/pull/18822?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/etcd-io/etcd/pull/18822?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None). Last update [3de0018...04c042c](https://app.codecov.io/gh/etcd-io/etcd/pull/18822?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None).
ahrtr commented 3 weeks ago

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

ghouscht commented 3 weeks ago

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with TODO in the code - feedback appreciated 🙂

serathius commented 2 weeks ago

/retest

serathius commented 2 weeks ago

/ok-to-test

ahrtr commented 2 weeks ago

I think we still need to handle the error if any during defragdb, something like below, also add one more failpoint and a new [sub]test case.

$ git diff -l10
diff --git a/server/storage/backend/backend.go b/server/storage/backend/backend.go
index 95f5cf96f..5a9361ae8 100644
--- a/server/storage/backend/backend.go
+++ b/server/storage/backend/backend.go
@@ -522,6 +522,11 @@ func (b *backend) defrag() error {
                if rmErr := os.RemoveAll(tmpdb.Path()); rmErr != nil {
                        b.lg.Error("failed to remove db.tmp after defragmentation completed", zap.Error(rmErr))
                }
+
+               // restore the bbolt transactions if defragmentation fails
+               b.batchTx.tx = b.unsafeBegin(true)
+               b.readTx.tx = b.unsafeBegin(false)
+
                return err
        }

https://github.com/etcd-io/etcd/blob/bd8896364a1d9adb554b538232004e5b2d6c850a/server/storage/backend/backend.go#L520-L526

ahrtr commented 2 weeks ago

Note we need to resolve https://github.com/etcd-io/etcd/pull/18822#discussion_r1830692111 in a separate PR. Could you please raise a new issue to track it? Thanks.

ghouscht commented 2 weeks ago

Note we need to resolve #18822 (comment) in a separate PR. Could you please raise a new issue to track it? Thanks.

https://github.com/etcd-io/etcd/issues/18841

ahrtr commented 2 weeks ago

Overall looks good now. Please signoff the second commit. Refer to https://github.com/etcd-io/etcd/pull/18822/checks?check_run_id=32589384806

k8s-ci-robot commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, ghouscht, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/etcd-io/etcd/blob/main/OWNERS)~~ [ahrtr,serathius] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
ahrtr commented 2 weeks ago

@ghouscht can you please backport this PR to 3.5 and 3.4?

ahrtr commented 2 weeks ago

@ghouscht Please also update the changelog for 3.5 and 3.4. Thanks