Open jeluard opened 4 months ago
node:perf_hooks that only works for the main thread
I would expect perf hooks to work inside worker threads, where did you read this?
Specifically, woker_threads support is yet missing- https://github.com/siimon/prom-client/issues/401 although we incorrectly relies on [prom-client] for workers monitoring
This isn't true, not in the way its suggested here. prom-client
doesn't automagically instrument workers. We just do it manually, so we are not incorrectly relying prom-client
.
The PR linked in that issue does this too. It has the main thread communicate with the worker to get data from the worker and include it in the metric registry of the main thread.
This is exactly what we do in lodestar. See https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/node/nodejs.ts#L277
worker_threads defines eventLoopUtilization in addition to the perf_hooks one, with this comment:
The same call as `perf_hooks eventLoopUtilization`, except the values of the worker instance are returned.
Its implementation itself required low-level changes. There is no equivalent to worker_threads
for monitorEventLoopDelay so I would lean towards thinking it reports main thread info when called from within a worker
. I couldn't find conclusive info on this though.
My main feeling is that eventLoopUtilization provides extra interesting information, and explicitly supports worker_threads
.
I will update this PR so that to clarify this and correct imprecisions.
I would lean towards thinking it reports main thread info when called from within a worker
This would be really bad if nodejs just does this without having a note in the docs, which they usually do a great job of noting limitations. This is really simple to reproduce as well, if this is true we might also wanna report this upstream to node.
Based on previous data https://github.com/ChainSafe/lodestar/issues/5604, the event loop lag seems to be reported correctly.
eventLoopUtilization ... explicitly supports worker_threads
This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".
Would agree with @nflaig that based on the data we've seen and used in the past, we can assume that monitorEventLoopDelay works properly in workers (just that the data must be manually communicated to the main thread).
eventLoopUtilization provides extra interesting information
Yeah, agree there, could be worth adding as an additional metric
This is another case where "supports worker_threads" means "automagically handles communication between a worker and main thread".
By this I mean that there are 2 distinct API, performance.eventLoopUtilization and worker.performance.eventLoopUtilization
Both can be accessed from the main thread (forworker.performance.eventLoopUtilization
, via the worker reference) and worker threads.
Both can be accessed from the main thread
I don't think we want to use this feature. (it gets more complicated when workers spawn other workers, like we currently do)
Rather we can just use the current pattern of having each worker collect its own metrics (and as a byproduct, reuse the existing mechanism for communicating those to the reporting thread).
Recommend appending new metrics collection here, which will automatically be applied to all our workers: https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/metrics/nodeJsMetrics.ts
collectNodeJSMetrics relies on prom-client to collect
eventLoopMonitoring
. This in turns relies on node:perf_hooks/monitorEventLoopDelay that only works for themain
thread orworker
from thecluster
module (as documented here).woker_threads
support is unclear, although we relies on this forworker_threads
monitoring (network and discv5).Proposed action items
Improve event loop monitoring, especially for
workers
. It looks preferable not to rely onprom-client
for this specific monitoring and add support directly inlodestar
.