dragonflydb / dragonfly-operator

A Kubernetes operator to install and manage Dragonfly instances.
https://www.dragonflydb.io/docs/managing-dragonfly/operator/installation
Apache License 2.0
146 stars 34 forks source link

S3 backup crashes dragonfly after ~12 hrs or so #263

Open jonathon2nd opened 1 week ago

jonathon2nd commented 1 week ago

Sound strange, but we had an issue in preprod that we spent a long time debugging.

One of the deployments, we had 6 like this, and it would happen to all of them. During this time we had one that was not used by app, and never had s3 enabled and we never had any issues out of it.

At time of suspected crash, no more logs would print, and the logs that were printed were not that useful nor stated any major error.

Also, s3 backup was working. I could see the backups in s3.

I was curious to know if there was an obvious issue with my yaml, or is someone could reproduce it.

---
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: api-cache
  namespace: api-cache
spec:
  args:
    - '--default_lua_flags=allow-undeclared-keys'
    - '--s3_endpoint=s3.ca-central-1.wasabisys.com/'
  env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: awsAccessKeyId
          name: wasabi-credentials
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: awsSecretAccessKey
          name: wasabi-credentials
    - name: AWS_REGION
      value: ca-central-1
    - name: AWS_BUCKET_NAME
      value: preprod-dragonfly-operator
  authentication:
    passwordFromSecret:
      key: password
      name: db-password
      optional: false
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.25.1
  replicas: 3
  resources:
    requests:
      cpu: 250m
      memory: 250Mi
    limits: {}
  snapshot:
    cron: "0 1 * * *"
    dir: s3://preprod-dragonfly-operator/api-cache/dragonfly/snapshots
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: linstor-replica-one-local

After cutting it down to the following, the dragonfly crashes stopped.

---
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: api-cache
  namespace: api-cache
spec:
  args:
    - '--default_lua_flags=allow-undeclared-keys'
  authentication:
    passwordFromSecret:
      key: password
      name: db-password
      optional: false
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.25.1
  replicas: 3
  resources:
    requests:
      cpu: 250m
      memory: 250Mi
    limits: {}
  snapshot:
    cron: "0 1 * * *"
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: linstor-replica-one-local
Abhra303 commented 1 day ago

Hi @jonathon2nd, I don't see any issues with the yamls. Considering that your backups are uploaded, I guess the aws credentials are ok too.

You didn't share Dragonfly CPU/memory metrics, so I am only guessing that the CPU or RSS memory was probably higher than it should have been, and it crashed. Maybe try increasing CPU/memory and keeping an eye on Dragonfly RSS/CPU usage metrics when this happens next time?

jonathon2nd commented 14 hours ago

I have reapplied the s3 config. I removed it for my tests, and we have not had any issues with the s3 removed. I run pods with limits: {} so probably not that :(

Also applied patch.

If I get another failure will capture all logs and metrics and will post them here.

jonathon2nd commented 5 hours ago

It is dead again, lasted about 8 hours.

No idea why there are no logs not too long after startup. api-cache-0_dragonfly.log api-cache-1_dragonfly.log api-cache-2_dragonfly.log pp-dragonfly-op-77f48f6b5-nkft9_manager.log

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: api-cache
  namespace: api-cache
spec:
  args:
    - '--default_lua_flags=allow-undeclared-keys'
    - '--s3_endpoint=s3.ca-central-1.wasabisys.com/'
  authentication:
    passwordFromSecret:
      key: password
      name: db-password
      optional: false
  env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: awsAccessKeyId
          name: wasabi-credentials
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: awsSecretAccessKey
          name: wasabi-credentials
    - name: AWS_REGION
      value: ca-central-1
    - name: AWS_BUCKET_NAME
      value: preprod-dragonfly-operator
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.25.2
  replicas: 3
  resources:
    limits: {}
    requests:
      cpu: 250m
      memory: 250Mi
  snapshot:
    cron: 0 1 * * *
    dir: s3://preprod-dragonfly-operator/api-cache/dragonfly/snapshots
    persistentVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: linstor-replica-one-local
status:
  phase: ready

Strangely not seeing any snapshots this time. image

You can see when it breaks in the graph image

Nothing strange with metrics either image