beam-telemetry / telemetry

Dynamic dispatching library for metrics and instrumentations.
https://hexdocs.pm/telemetry
Apache License 2.0
872 stars 66 forks source link

Event measurements #41

Closed arkgil closed 5 years ago

arkgil commented 5 years ago

A single event value has been replaced by a map of measurements. Now it is up to the consumer of the event to decide what part of the payload is important. This is useful in cases where event indicates that a thing happened but there are many properties describing it.

For example, a database query event may include total time, decode time, wait time and other measurements.

hauleth commented 5 years ago

I do not like this change. I think that current API is what we should be using right now (I am aware of the discussion in Telemetry.Metrics). In this form it is more like (structured) logging framework instead of being metrics dispatching bus.

arkgil commented 5 years ago

@hauleth I'm not sure I understand your reluctance. I don't think that having one value treated in a special way changes anything about this library, except the confusion it introduces. My biggest concern with it is that it didn't have any label, you kind of had to guess that [:phoenix, :controller, :call] event carries latency as the value. And what if the user cares more about the payload size?

I'd say that this library was about "structured logs", or events, from the very beginning. It was clear that it won't be responsible for any kind of computation or aggregation of event stream, so to me, it doesn't really make sense to have this special value anyway. Push any data you have available, let the consumer decide what to do with it. If the API screams "metrics", then there is no need for a separation between event publishers and consumers as we have it today.

hauleth commented 5 years ago

My main problem is that this will make integrations much harder than today, because I will need to have some way to determine which value is "special" and which is not. In the same way I could attach myself directly to the Logger/logger and extract the measurements from the metadata that I have available there, in that case we do not need Telemetry at all as this brings nothing new to the table except separate dispatch system. TBH in your example I would probably just emit [:phoenix, :controller, :call, :latency] as a separate event. And if you really want to emit just one event with multiple metrics then I would keep the 3-ary function and just send map/keyword list as a 2nd argument with all metrics, as this would greatly simplify integrations like opencensus_telemetry where you can see that I need to define measurement for each metric here. Additionally this makes tooling like Telemetry.Metrics possible, how would you expect it to work with API like this?

For me this change looks like you have accounted only the producers without considering how consumers will work. It would be "okaish" change if end user would implement the consumers on their own, but if someone want to "improve" user experience by building integration, then such API will be PITA.

arkgil commented 5 years ago

@hauleth the user needs to provide event name anyway - what is wrong with providing a key for value which should be aggregated?

I'll use @josevalim 's example

distribution("phoenix.controller.call.latency")

and done. The call looks exactly the same as before. We could assume that a field name is the last segment, or use a different separator - the point is that it's stil quite easy to configure.

This change gives us:

But to me the biggest gain here is clear semantics of what an event is. I don't need to emit extra events to allow consumers to have extra aggregations. With the current (execute/3) API, event producer needs to know up front which fields are more important to the consumer and as such should become separate values in separate events.

arkgil commented 5 years ago

And I'd argue that what this library brings to the table is exactly this design - most solutions focus on publisher being aware of what happens with the data it provides, i.e. it needs to update a matric, record a measurement etc.

binaryseed commented 5 years ago

If it's the case that the core of telemetry is intended to be raw event dispatch, could we actually just use OTP 21 logger?

Telemetry would then consist mostly of the data structure definitions / conventions for emitting.

Dealing with log levels might throw it off, but perhaps that's something that could be dealt with

arkgil commented 5 years ago

@binaryseed good point! I have thought about it in the past, but the main reason is that it's "OTP 21" logger 😉And we still need to support at least OTP 19.

hauleth commented 5 years ago

@binaryseed that is what I meant, if we want it to be raw event dispatcher then do we need Telemetry at all? We could achieve the same thing using just logger. Even if logger is and OTP 21+ library it has advantage of coming as a built in, and in future, when OTP 21+ will become more broadly used version then what will be the point of bringing additional dependency as Telemetry to the table?

I thought that Telemetry wants to be "modern" version of the erlang-metrics not "specialised version of Registry".

binaryseed commented 5 years ago

In https://ferd.ca/erlang-otp-21-s-new-logger.html (see section titled "Metrics"), @ferd suggests not using logger for metrics as a general pattern, but I'm not sure the exact reasons why - possibly because it gets tangled up into any other log handler, including lots of possible noise on stdout

arkgil commented 5 years ago

@hauleth that's Telemetry.Metrics, not Telemetry library - it allows only to dispatch events.

We could achieve the same thing using just logger

The difference from logger is that you explicitly specify which event streams you consume, i.e. you provide an event name when attaching.

I thought that Telemetry wants to be "modern" version of the erlang-metrics not "specialised version of Registry".

I'm wondering why, to be honest - the description clearly states that "Telemetry is a dynamic dispatching library for metrics and instrumentations". We should modify it if it gives false impression that Telemetry is an interface to different metric systems, because it's not.

ferd commented 5 years ago

In https://ferd.ca/erlang-otp-21-s-new-logger.html (see section titled "Metrics"), @ferd suggests not using logger for metrics as a general pattern, but I'm not sure the exact reasons why - possibly because it gets tangled up into any other log handler, including lots of possible noise on stdout

Mostly, my hope was that this telemetry lib would be a good candidate for metrics as it would be better adapted to the task. Now that telemetry is aiming to just pipe in a map, well I'm not too sure it's offering that much instead of what logging would do anyway, aside from trying to play better with Elixir tooling since the Elixir logger is string-only.

binaryseed commented 5 years ago

To be fair, that still is the aim of the broader telemetry project through telemetry_metrics

arkgil commented 5 years ago

@ferd Vince is right, telemetry_metrics aims to define a standard set of metric aggregations (which for now are the same as in opencensus) which are implemented by reporters. Telemetry "core" is only event dispatch. We built it so that there is a single way to hook into, Ecto, Phoenix, etc. (and I hope more will follow :)). `

telemetry_metrics is only one of the possible sinks. One could hook into Phoenix events and emit access logs for relevant routes. Of course the same could be done with OTP 21 logger, but we don't want to exclude libraries which use older versions of OTP.

bryannaegele commented 5 years ago

Is there anything with this PR and the proposed new APIs that prevents telemetry and telemetry reporters from being switched over to or support OTP 21's logger? I don't see what makes this an either/or proposition. The use of logger and creating filters + handlers or using the current design both seem to be implementation details and do not detract from the value proposition the telemetry project puts forth overall.

If people feel logger is a good way to go, let's add a behaviour and someone can implement on top of logger. Ultimately, it is up to the reporter to implement event handler attaching, capturing, and handling of the event data. If a reporter decided to implement only using the logger implementation, that is completely abstracted from the user.

As for the API change proposed, this discussion, the one in telemetry_metrics, and @ferd's post (great post, btw) all seem to land in the same place - emit all the data and let individual handlers decide what is important out of it and what they'd like to do with it.

josevalim commented 5 years ago

My main problem is that this will make integrations much harder than today, because I will need to have some way to determine which value is "special" and which is not.

It won't. For example, we were considering to break the Ecto event apart, it means that Ecto would now emit:

I.e. you have to know exactly the path you are subscribing to. With this PR, you can simply treat the last element of the "path" to be the map key. The amount information needed is the same. This change is really useful for event publishers because now I can submit a single event instead of four. The consumer needs to know the same amount of information.


Mostly, my hope was that this telemetry lib would be a good candidate for metrics as it would be better adapted to the task. Now that telemetry is aiming to just pipe in a map, well I'm not too sure it's offering that much instead of what logging would do anyway

I think this is a very important question to answer. As I said above, I consider this change to be a small generalization to make publishing events easier, I don't believe it affects consumers because all consumers need to know an arbitrary path.

My point by saying this is not to claim that we should not replace telemetry by logger but rather that, if this version can be replaced by logger, the previous version could be equally replaced by logger too. At least in practical terms.

The counter-argument is that by having Measurement as an explicit part of telemetry:execute/3, the previous version gently nudges users towards metrics. But it is a conceptual change rather than a practical one - and conceptual nudges are important. So the alternative is indeed to settle on: execute(Name, MapOfMeasurements, MapOfData).

The other practical difference between logger and telemetry is that in telemetry you subscribe to each event individually. In logger, all events go to all handlers, unless you explicitly filter them out, which requires manually installing domain-based filters on every handler (if someone is aware of a better approach, please let me know). It is totally doable, but a gotcha everyone has to be wary of when installing new handlers. Logger also forces you to add a level, which is not relevant for metrics. But again, these are inconveniences and small conceptual mismatches, because it doesn't hurt in practical terms.


Assuming I made no mistakes in the assessment above, to me the big question then is: is having a explicit field for measurements is what makes telemetry "worth it"? I personally find the small nuisances with using logger for metrics to be enough to warrant telemetry but if the separate measurement is what makes telemetry tick for a lot of people, then we should proceed with execute(Name, MapOfMeasurements, MapOfData).

binaryseed commented 5 years ago

I believe that execute(Name, MapOfMeasurements, MapOfData) is very compelling. It makes the raw events themselves actionable. There's enough semantic information there to do something meaningful with the data, and on top of that telemetry_metrics would provide the information needed to do higher level behaviors like aggregate on metric dimensions, etc.

arkgil commented 5 years ago

Sounds good, let's go for measurements/metadata :) I think we should agree on naming now, though.

For the map of mesurements we could have:

For the map of other values:

Do you have other suggestions? Measurements/metadata seems like an obvious choice, but I want to explore alternatives :)

hauleth commented 5 years ago

I like the measurements/metadata, alternatively measurements/tags, but I think that metadata is better choice here.

binaryseed commented 5 years ago

Measurement is a data point produced after recording a quantity by a measure.

josevalim commented 5 years ago

:+1: for measurements and metadata. @ferd and @tsloughter, are you ok with telemetry:execute(Name, Measurements, Metadata)?

hauleth commented 5 years ago

@josevalim I think that will suit more with our vision. This would end as similar to our current oc_stat:record/2 function.

ferd commented 5 years ago

yeah, at least a clear distinction between the measurement and the metadata removes some ambiguity about what is a "discardable"/predictable type of information and what is not.

codecov-io commented 5 years ago

Codecov Report

Merging #41 into master will decrease coverage by 0.96%. The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #41      +/-   ##
==========================================
- Coverage    86.2%   85.24%   -0.97%     
==========================================
  Files           4        4              
  Lines          58       61       +3     
==========================================
+ Hits           50       52       +2     
- Misses          8        9       +1
Flag Coverage Δ
#otp18 88.13% <100%> (+0.41%) :arrow_up:
#otp19 88.13% <100%> (+0.41%) :arrow_up:
#otp20 88.13% <100%> (+0.41%) :arrow_up:
#otp21 85% <80%> (-0.97%) :arrow_down:
Impacted Files Coverage Δ
src/telemetry.erl 83.33% <80%> (-1.86%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ed63f5e...2cbdc88. Read the comment docs.

arkgil commented 5 years ago

I have updated the PR to measurements/metadata. Thanks to everyone contributing to the discussion! 😊

josevalim commented 5 years ago

Beautiful, let's merge it and ship v0.4!

bryannaegele commented 5 years ago

@arkgil could you tag the commit to v0.4 and cut a release? Some security concerns were raised in Slack over the Hex package being updated and pushed with no git tags or releases.

The changelog also lists the API change as "unreleased" when it has been released. We should probably remove that line in the changelog and probably add a migration guide in the readme.

josevalim commented 5 years ago

The lack of a tag is not really a security issue but it does make auditing harder. In general it is a best practice to tag releases, +1.

Regarding the migration guide in the README, we are fine without one. The project is before 1.0 and there is already information in the CHANGELOG and issues tracker. --

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

bryannaegele commented 5 years ago

Yeah, I didn't mean to imply the lack of a tag is in itself a security issue, just as a potential red flag in auditing. This was raised by @shanesveller in the Slack channel. No migration guide in the README is not important, but the CHANGELOG should be updated for accuracy with the release and a link to the forthcoming tag.

👍