beam-telemetry / telemetry_metrics_prometheus_core

Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
Apache License 2.0
35 stars 30 forks source link

Question: Design Considerations #30

Closed spencerdcarlson closed 2 years ago

spencerdcarlson commented 4 years ago

Thanks for making this library. The abstractions seem really great. We are looking at using this library in our enterprise systems and I have a few design questions.

Multiple Handler Processes

Have you considered using a dynamic supervisor and spawning a child for every metric that is subscribed to? I'm thinking along the lines of multiple multiple Registry children instead of just one.

My reasoning being that every call to telemetry.execute is synchronous and has to wait for the registry to processes the message. A high volume of events being emitted could cause a call in Plug to block slowing down web request response times.

Maybe this is a non-issue because event handling is so efficient that it never gets bottlenecked in practice. I'm just curious.

Best-effort reporting

When an event is handled it looks like it is writing to an :ets table that has {:write_concurrency, true}, so it should be extremely fast. The only vectors that I can see that could potentially cause an issue is the :keep and :tag_values options:

Telemetry.Metrics.counter("http.request.count",
  keep: fn _metadata ->
    # something with high latency, resource consumption, or error prone 
    true
  end,
  tag_values: fn metadata ->
    # something with high latency, resource consumption, or error prone 
    metadata
  end
)

I'm not sure of the likelihood of this happening, but Murphy's Law. Given that a developer could cause self harm here if they are not aware of the complete impact of those functions, have you considered reporting metrics in a best-effort type of fashion?

My thought would be something along the lines of spawning a task using Task.Supervisor.async_nolink/2 immediately when an event is received. My reasoning is that I would prefer my application to continue to respond to web requests quickly regardless of my metrics being reported.

If a developer did do something resource intensive in those functions I'm thinking it would eventually manifest itself in an OOM or some other issue, but response latency would not be affected until a complete crash. The host application would be completely protected from a failed task because of a runtime exception or a timeout.

Sorry for the long post and thank you if you made it this far reading. I'm curious what your thoughts are on these questions -- even if for my own learning.

Thanks, Spencer

spencerdcarlson commented 4 years ago

My first question is pointless since I didn't realize that

All the handlers are executed by the process dispatching event.

So Registry would not be a bottleneck