Quentin-M / etcd-cloud-operator

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.
Apache License 2.0
233 stars 42 forks source link

How to troubleshoot memory issues in etcd #82

Open iamnst19 opened 2 months ago

iamnst19 commented 2 months ago

Hi, I would like to know how we can troubleshoot memory issue in etcd and how and how to mitigate such memory issues?

Quentin-M commented 2 months ago

Hey!

Like you said - you'd be looking at etcd itself - as the operator's own memory usage is going to be very minimal, best to refer to their repository / docs / code. Etcd is started as an embedded server though as part of the etcd-cloud-operator, so it may first seem as if the operator is taking up memory.

iamnst19 commented 1 month ago

I think the memory spike is due to S3 backup. How do I disable S3 backup? Also how and where do I need to add profiling --> https://github.com/google/pprof to check the memory profile?

Quentin-M commented 1 month ago

Th snapshot providers streams the data from etcd towards the snapshot destination, so I'd think it'd be ok if everything is implemented alright - unless etcd itself has a memory spike as part of the save somehow. Do you have a memory chart?

Disabling S3 snapshots is not recommended as this will cripple your ability to do disaster recovery, unless you enable the file backup provider with a separate and reliable storage to use. By default, the operator requires a snapshot provider.

To enable pprof, you'd want to inject it in the main here behind a command-line flag:

import (
  pprof "net/http/pprof"
)

if flagPprof != nil && len(flagPprof) > 0 {
  go func() {
    zap.S().Infof("enabling pprof on %s", flagPprof)
    pprof.ListenAndServe(flagPprof, nil)
  }
}
iamnst19 commented 1 month ago
Screenshot 2024-07-11 at 11 18 51 AM

The baseline has shifted and memory is heaping and I can see that these spike happening during the backup to S3 can I like make an adjustment to this

snapshot:
    provider: s3 # This should be configured to S3 in any real environments.
    interval: 30m
    ttl: 24h

So the backup is not very aggressive? Maybe increase the interval or reduce the TTL. If then what need to be the desired values here?

iamnst19 commented 1 month ago

Ideally this backup activity should be happening in non peak hours. How to set the time to do the backup once in a week during off peak hours?

iamnst19 commented 1 month ago

Can you please help here?