contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.66k stars 226 forks source link

Question: What is the unit of "time" for Ent Metrics #478

Closed rocktavious closed 1 week ago

rocktavious commented 1 month ago

In reading through the https://github.com/contribsys/faktory/wiki/Ent-Metrics#latency and also the job execution metrics its unclear what the unit of "time" is for the metrics. Is it milliseconds? microseconds? nanoseconds?

I tried finding this in the codebase but I came up empty handed its hard to figure out where the metrics are being produced and searching for perform or latency didn't yield anything.

mperham commented 1 month ago

It is seconds.

rocktavious commented 1 month ago

@mperham are you sure? We have faktory emitting this data to datadog and are seeing values of 5000 - 6000 which if it were milliseconds would be about 5 seconds. Additionally I just added some logging to our golang worker and it seems to agree with the 5 seconds calculation. Maybe i don't understand the metric correctly but my understanding of perform is that its "job duration" because thats the time between FETCH and ACK

Sorry just looking for 100% clarification because we intend to setup an SLO with this since we are performing jobs for customers of ours and have SLAs to uphold.

runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Starting job '1463' runner=faktory
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Finished job '1462' took '5.496016395s' and had outcome 'success' outcome=
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Starting job '1464' runner=faktory
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Finished job '1463' took '4.81213557s' and had outcome 'success' outcome=
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Starting job '1465' runner=faktory
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Finished job '1464' took '5.543517213s' and had outcome 'success' outcome=
runner-faktory-785b46877f-pj26n runner-faktory 4:09PM INF Finished job '1465' took '5.508035371s' and had outcome 'success' outcome=
runner-faktory-785b46877f-pj26n runner-faktory 4:10PM INF Starting job '1466' runner=faktory
runner-faktory-785b46877f-pj26n runner-faktory 4:10PM INF Starting job '1467' runner=faktory
runner-faktory-785b46877f-pj26n runner-faktory 4:10PM INF Finished job '1466' took '6.81928801s' and had outcome 'success' outcome=
runner-faktory-785b46877f-pj26n runner-faktory 4:10PM INF Finished job '1467' took '15.665768976s' and had outcome 'success' outcome=

Screen Shot 2024-06-06 at 11 46 45 AM

mperham commented 1 month ago
latency = float64(time.Since(tm)) / float64(time.Second)
c.Gauge(fmt.Sprintf("latency.%s", name), latency, nil, 1)

You asked about latency but are talking about perform. Latency is in seconds because it uses c.Gauge directly. perform uses c.Timing, which is in milliseconds.

https://github.com/DataDog/datadog-go/blob/1139efe797e4760daf3d30a60e3d30afc1687df8/statsd/statsd.go#L790

mperham commented 1 month ago

And yeah, that's.... less than ideal.

rocktavious commented 1 month ago

Ahh sorry ya was talking about both metrics and wires got crossed.

rocktavious commented 1 month ago

FWIW - i've updated the wiki to call this out. Up to you if you want to try and fix them to both be seconds.