Open ahrtr opened 11 months ago
UPDATED:
When server closes the backend, the backend will stop background commit goroutine and reset read transaction.
For baseReadTx
, it doesn't check the tx
in UnsafeRange and update txWg
.
The tx
can be set to nil during UnsafeRange
.
Another case with https://github.com/etcd-io/bbolt/issues/715
@fuweid would you be able to propose a fix? The issue showed up in robustness tests, which I would prefer to keep flake free.
Hi @serathius Sure. Will file pull request later.
Hi @ahrtr @serathius
sorry for taking so long on this issue. This issue has been fixed by gRPC layer https://github.com/grpc/grpc-go/commit/61eab37838ce213237ecb31aa7cdf95241851431 (released by v1.61.0 to fix regression): All the requests are tracked by handlerWG
wait group. When we call GracefulStop
, it will block until all the inflight requests are closed, including streaming type RPC, like Watch/Snapshot/LeaseRenew.
We call gracefulstop when we received SIGTERM signal, except cmux-mode.
We don't need to setup timeout for draining things because GracefulStop
always blocks until all the RPCs finished.
Since GracefulStop
isn't applied for cmux-mode, I filed a pull request https://github.com/etcd-io/etcd/pull/17790 to support cmux-mode's graceful shutdown. Even if we run ETCD without #17790, only ongoing Snapshot/Watch RPCs might panic, because we stop applier channel and scheduler before stop backend and all the unary RPCs won't use closed backend.
And the https://github.com/etcd-io/etcd/pull/17757 is also enhancement for failpoint test. PTAL. Thanks
Side note: I was using old version (1.60.1) so that previous approach is to introduce txRef
object to maintain reference count about all opening TX. The caller must call txPut
to release the reference explict. If the backend has been closed, the ReadTx/ConcurrentReadTx/BatchTx
should return closed error. However, in our codebase, both mvcc and auth layers assume the backend is always valid. The ReadTx/ConcurrenReadTx/BatchTx
are kind of direct pointer. And UnsafeRange
doesn't intend to return error. I tried to update all the interfaces to force it return error. It passes all the e2e and UT. However, there are too many changes.
If the server layer can track active RPCs, it will be better. So, I revisit the gRPC code and find that WaitForHandlers
can help us.
type txRef struct {
sync.RWMutex
wg sync.WaitGroup
}
type Backend interface {
ReadTx() (ReadTx, TxRefReleaseFunc, error)
ConcurrentReadTx() (ReadTx, TxRefReleaseFunc, error)
BatchTx() (BatchTx, TxRefReleaseFunc, error)
}
tx, txPut, err := ReadTx() // ConncurrentReadTx() / BatchTx()
if err != nil {
return err
}
defer txPut()
..
Bug report criteria
What happened?
Test case
TestMaintenanceSnapshotCancel
failed and panicking.Refer to https://github.com/etcd-io/etcd/actions/runs/7463174417/job/20307221683?pr=17220
Based on the log, the reason should be that the backend has already been closed (the member is being stopped) before the snapshot operation, https://github.com/etcd-io/etcd/blob/a2eb17c8091893796e835cd564c78a7b4c917c21/server/storage/backend/backend.go#L331
What did you expect to happen?
No panicking on processing any client requests
How can we reproduce it (as minimally and precisely as possible)?
Write an integration test to stop a member before call the snapshot api.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response