Open jonathon2nd opened 1 week ago
Hi @jonathon2nd, I don't see any issues with the yamls. Considering that your backups are uploaded, I guess the aws credentials are ok too.
You didn't share Dragonfly CPU/memory metrics, so I am only guessing that the CPU or RSS memory was probably higher than it should have been, and it crashed. Maybe try increasing CPU/memory and keeping an eye on Dragonfly RSS/CPU usage metrics when this happens next time?
I have reapplied the s3 config. I removed it for my tests, and we have not had any issues with the s3 removed. I run pods with limits: {}
so probably not that :(
Also applied patch.
If I get another failure will capture all logs and metrics and will post them here.
It is dead again, lasted about 8 hours.
No idea why there are no logs not too long after startup. api-cache-0_dragonfly.log api-cache-1_dragonfly.log api-cache-2_dragonfly.log pp-dragonfly-op-77f48f6b5-nkft9_manager.log
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
name: api-cache
namespace: api-cache
spec:
args:
- '--default_lua_flags=allow-undeclared-keys'
- '--s3_endpoint=s3.ca-central-1.wasabisys.com/'
authentication:
passwordFromSecret:
key: password
name: db-password
optional: false
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: awsAccessKeyId
name: wasabi-credentials
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: awsSecretAccessKey
name: wasabi-credentials
- name: AWS_REGION
value: ca-central-1
- name: AWS_BUCKET_NAME
value: preprod-dragonfly-operator
image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.25.2
replicas: 3
resources:
limits: {}
requests:
cpu: 250m
memory: 250Mi
snapshot:
cron: 0 1 * * *
dir: s3://preprod-dragonfly-operator/api-cache/dragonfly/snapshots
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: linstor-replica-one-local
status:
phase: ready
Strangely not seeing any snapshots this time.
You can see when it breaks in the graph
Nothing strange with metrics either
Sound strange, but we had an issue in preprod that we spent a long time debugging.
One of the deployments, we had 6 like this, and it would happen to all of them. During this time we had one that was not used by app, and never had s3 enabled and we never had any issues out of it.
At time of suspected crash, no more logs would print, and the logs that were printed were not that useful nor stated any major error.
Also, s3 backup was working. I could see the backups in s3.
I was curious to know if there was an obvious issue with my yaml, or is someone could reproduce it.
After cutting it down to the following, the dragonfly crashes stopped.