etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.83k stars 9.77k forks source link

Change prevKv to prevKvForUpdate and prevKvForDelete in Op struct #15353

Open d3c3mber opened 1 year ago

d3c3mber commented 1 year ago

What would you like to be added?

Would it be feasible to add 1 more field in Op struct to give more flexibility for clients? prevKv for updates may take long time and may not be needed by clients.

// Op represents an Operation that kv can execute.
type Op struct {
.....
    // for watch, put, delete
    prevKV bool
.....
}
type Op struct {
.....
    // for put
    prevKVforUpdate bool

    // for delete
    prevKVforDelete bool
......
}

Why is this needed?

For etcd, whether it is native etcd or other customized version of etcd, if K8s apiserver wants to obtain PrevKv, etcd needs to do a Range operation for every event. Range for native etcd is operated on local boltdb, it takes time in less than 1s, on the order of ms. But the aggregation of all range operations can take seconds.

Given the below code in etcd3/watcher.go where apiserver is taking all objects of a kind (e.g, Pod, Node, ConfigMap, etc), oldObj is not needed. So in this case, etcd's operation for getting PrevKv can be removed.

lavacat commented 1 year ago

Sorry, the use case isn't clear to me. Can you link apiserver code that will use this new flag?

d3c3mber commented 1 year ago

Hi @lavacat , please refer to https://github.com/kubernetes/kubernetes/blob/release-1.26/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L236

apiserver is doing a watch using the following Option

opts := []clientv3.OpOption{clientv3.WithRev(wc.initialRev + 1), clientv3.WithPrevKV()}

And in https://github.com/kubernetes/kubernetes/blob/release-1.26/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L354

in the default case(which is for Update events), prevKv is not needed. Because most of the time, apiserver is watching all resources (e.g. all pods, all nodes, all configmaps etc), so wc.acceptAll() for all these resources is true

    default:
        if wc.acceptAll() {
            res = &watch.Event{
                Type:   watch.Modified,
                Object: curObj,
            }
            return res
        }
......
larryck commented 1 year ago

What would you like to be added?

Would it be feasible to add 1 more field in Op struct to give more flexibility for clients? prevKv for updates may take long time and may not be needed by clients.

// Op represents an Operation that kv can execute.
type Op struct {
.....
  // for watch, put, delete
  prevKV bool
.....
}
type Op struct {
.....
  // for put
  prevKVforUpdate bool

  // for delete
  prevKVforUpdate bool
......
}

Why is this needed?

For etcd, whether it is native etcd or other customized version of etcd, if K8s apiserver wants to obtain PrevKv, etcd needs to do a Range operation for every event. Range for native etcd is operated on local boltdb, it takes time in less than 1s, on the order of ms. But the aggregation of all range operations can take seconds.

Given the below code in etcd3/watcher.go where apiserver is taking all objects of a kind (e.g, Pod, Node, ConfigMap, etc), oldObj is not needed. So in this case, etcd's operation for getting PrevKv can be removed.

The latter attribute should be prevKVforDelete instead of prevKVforUpdate? A typo?

larryck commented 1 year ago

Sounds like a good idea. I would be such a waste if etcd ranges to find the prevkv for Update operation and Apiserver does not even use it. Normally the number of Update requests is much larger that Delete requests.

tjungblu commented 1 year ago

Can you explain what the semantics and use cases for those new fields are?

I honestly also don't think I understood the proposal fully, is this about moving this code block into etcd: https://github.com/kubernetes/kubernetes/blob/release-1.26/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L355-L380

because it's faster to do that locally with bbolt?

d3c3mber commented 1 year ago

What would you like to be added?

Would it be feasible to add 1 more field in Op struct to give more flexibility for clients? prevKv for updates may take long time and may not be needed by clients.

// Op represents an Operation that kv can execute.
type Op struct {
.....
    // for watch, put, delete
    prevKV bool
.....
}
type Op struct {
.....
    // for put
    prevKVforUpdate bool

    // for delete
    prevKVforUpdate bool
......
}

Why is this needed?

For etcd, whether it is native etcd or other customized version of etcd, if K8s apiserver wants to obtain PrevKv, etcd needs to do a Range operation for every event. Range for native etcd is operated on local boltdb, it takes time in less than 1s, on the order of ms. But the aggregation of all range operations can take seconds. Given the below code in etcd3/watcher.go where apiserver is taking all objects of a kind (e.g, Pod, Node, ConfigMap, etc), oldObj is not needed. So in this case, etcd's operation for getting PrevKv can be removed.

The latter attribute should be prevKVforDelete instead of prevKVforUpdate? A typo?

Thanks @larryck , yes it is a typo and it should be prevKVforDelete

d3c3mber commented 1 year ago

Can you explain what the semantics and use cases for those new fields are?

I honestly also don't think I understood the proposal fully, is this about moving this code block into etcd: https://github.com/kubernetes/kubernetes/blob/release-1.26/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L355-L380

because it's faster to do that locally with bbolt?

Hi @tjungblu , it doesn't mean to move apiserver code into etcd. the proposal is about adding 1 more field in the type Op struct, and then the clients (e.g. K8s Apiserver) can specify whether it needs prevKv for update events or prevKv for delete events

d3c3mber commented 1 year ago

@tjungblu the reason for adding one additional field prevKVforUpdate is because for some clients (especially K8s Apiserver), the prevKv for Update events is not used. We'd like to set prevKVforUpdate to false, and etcd does not have to do the prevKv calculation, which is quite time consuming.

ahrtr commented 1 year ago

I understand the feature request is to add separate prevKV flags for PUT and DELETE watch event separately. Note it changes the protocol/API between client and server, so we should be very cautious about this.

@d3c3mber have you done any benchmark test to compare the performance benefit?

@liggitt @wojtek-t any interests on this feature request?

wojtek-t commented 1 year ago

Hmm - we could consider using that if it exists, but that's definitely not a priority from k8s perspective.

vitoxming commented 1 year ago

I understand the feature request is to add separate prevKV flags for PUT and DELETE watch event separately. Note it changes the protocol/API between client and server, so we should be very cautious about this.

@d3c3mber have you done any benchmark test to compare the performance benefit?

@liggitt @wojtek-t any interests on this feature request?

From the code from ectd below, we can see a Range() call to get the prevKV for EACH watch event. If we don't need the prevKV for 'PUT' event, we might save some time if we can disable the separate prevKVforUpdate flag, especially the 'PUT' changes a batch of KVs. https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/watch.go#L415-L424

Yes, the etcd watch protocol/API should be changed from my understanding. To achieve it, prevKVforUpdate and prevKVforDelete flags should be added into WatchCreateRequest. https://github.com/etcd-io/etcd/blob/main/api/etcdserverpb/rpc.proto#L779

larryck commented 1 year ago

@d3c3mber could you do some benchmark test? It would be more obvious if we can take a look at the performance benefit.

serathius commented 1 year ago

Based on @wojtek-t response don't think we want to invest in this feature as there are many consistency and correctness issues in etcd that are in need of addressing first.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

fcantournet commented 10 months ago

I'm very interested in this proposal. We use etcd as kv for service-discovery @Datadog and some objects around configuration of sharding are fairly high throughput (large objects that change often). prevKV would essentially double our outgoing bandwidth. However on deletion it would significantly simplify our code if we had access to the previous value, for very little increase BW cost.