danihodovic / celery-exporter

A Prometheus exporter for Celery metrics
MIT License
377 stars 83 forks source link

Possibility of clearing metrics every X seconds (memory problem) #280

Open gciria opened 8 months ago

gciria commented 8 months ago

I am using version v0.9.2, with the variables CE_WORKER_TIMEOUT and CE_PURGE_OFFLINE_WORKER_METRICS modified, the time was changed to 20 seconds.

In my structure every X minutes, several nodes in batches are started in Kubernetes with dozens of pods/celery consuming X queues. Prometheus scrapes the metrics from the celery-exporter (9808/metrics) and stores them. Apparently the purge variables don't work very well in my structure. In the logs I see purge of 1, 2 pods after many hours.

Would you like to know if there is a possibility to add a new parameter to purge all /metrics every X seconds? Or any tips for another solution.

image

Thanks and crongrats on the great project.

danihodovic commented 8 months ago

@adinhodovic

adinhodovic commented 8 months ago

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

DvdChe commented 5 months ago

Hey,

I have same problem on my side,

I tried to activate CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true and some metrics has their hostname set as generic but there is still other that are labelled with pod name. I also tried to cutomize CE_PURGE_OFFLINE_WORKER_METRICS and CE_WORKER_TIMEOUT as well but there is no purge.

I tried to find how garbage collecting is working and I think i partially found the cause :

On my side, problem is that self.worker_last_seen remains empty and it never get updated so metrics are never purged.

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

What do you mean by go offline ? Is it a gracefull disconnection made by workers or something like that ? ( sorry for this question but I absolutely know nothing about celery )