Making execution order of handlers guaranteed

hubertlepicki commented 2 years ago

I have a use case, where I want to include span_id and trace_id, which come from a "distributed tracing" library of choice into logs. If it's OpenTelemetry or Spandex doesn't really matter, in both cases, I am facing the same possible issue.

I want to have span_id and trace_id attached to the log lines generated when web request starts already, but in order to achieve it, the handlers which start the OpenTelemetry/Spandex span/trace need to be executed before the handlers attached by Phoenix.Logger to the same event.

This seems to be work, as I can disable logger initialization on :phoenix application start, and then attach it manually after I attached OpenTelemetry/Spandex events, for example:

# config/config.exs
config :phoenix, :logger, false

# lib/application.ex
    SpandexPhoenix.Telemetry.install()
    Phoenix.Logger.install()

    # or

    OpenTelemetry.register_application_tracer(:my_app)
    OpentelemetryPhoenix.setup()
    Phoenix.Logger.install()

This works as I'd expect it to work but, the documentation of this project clearly says:

Note that you should not rely on the order in which handlers are invoked. https://github.com/beam-telemetry/telemetry/blob/d3ba8b24ac7b46ce03b644f96f143e8abed71d62/src/telemetry.erl#L72

And this is precisely what I am relying on (and others too, without a doubt).

It seems that the handlers are executed in the order they were inserted to underlying ETS table, as they are fetched with :ets.lookup, which does return them in order of insertion.

https://github.com/beam-telemetry/telemetry/blob/main/src/telemetry_handler_table.erl#L48

Should we make this behavior official? I think we don't have to change anything in the project code itself, maybe we should just add a couple of tests to make sure the order is preserved, and remove the warnings from the documentation that the order is not guaranteed?

josevalim commented 2 years ago

The order comes from an implementation detail and later on we may find more efficient implementations that do not provide consistent ordering and it would be a pity to not be able to change due to that. I also think relying on side-effects of telemetry handlers a poor practice and I definitely don't want to encourage it. :) So I am a strong :-1:.

josevalim commented 2 years ago

In your particular case, those libraries should have a plug that sets those traces and run before the loggers or use another non-telemetry related hook to set those up.

hubertlepicki commented 2 years ago

I know it's an implementation detail, but I suggest we make it an official behavior instead.

The solution to use Plug is okay, but it has to be a macro you use within your Endpoint, which then calls super/2 (like https://github.com/appsignal/appsignal-elixir-plug/blob/main/lib/appsignal_plug.ex#L52), otherwise there is no way to squeeze in your code before Phoenix emits [:phoenix, :endpoint, :start] event, and Phoenix.Logger logs "GET /".

I would love to get span_id/trace_id on the "GET /" line, and on anything that the default/used Plug pipeline produces between request starts and the user code as well.

I don't particularly like using the macro solution described above, but if that's the official way I'll have to stick to it.

bryannaegele commented 2 years ago

On the general proposal I agree with Jóse that this is an implementation detail of telemetry and not something we could facilitate even if we wanted to given how we'd like to change the implementation. The only reason it's even in ets is because we had to support OTP <21.

For the specific issue I'd encourage you to bring it up in the opentelemetry channel on the Elixir slack for help.

tsloughter commented 2 years ago

I can add a couple thoughts that sadly don't help with the immediate issue:

Why isn't Phoenix just using logger instead of telemetry+logger? I'm guessing the idea was to simply have 1 line in the code instead of 2 since you are going to need the telemetry event anyway? The same filtering functionality could generate logger filters to do the same I believe.

For handlers that don't have side effects the order certainly doesn't matter and as @josevalim says, those are the best use case for telemetry and shouldn't be encouraged. But starting an active span has a side effect, so I still wish there was a different way to wrap code -- so not with a Plug -- around a phoenix endpoint/plug so span starting and stopping could be handled that way and the underlying pdict context be guaranteed :)

hubertlepicki commented 2 years ago

@tsloughter you're actually bringing an interesting point, by chance, as the other Logger statements are being generated as well, based on telemetry events:

https://github.com/phoenixframework/phoenix/blob/ad4cd8b224c60a644090242e6a70e2b6404c9c1a/lib/phoenix/logger.ex#L132-L138

Logger is one thing, and if it's a side-effects free case or not, the way Phoenix/ecosystem embraced Telemetry library is that it replaced the old Phoenix.Endpoint.Instrumenter with it. Things like Spandex, AppSignal, NewRelic agent etc. etc. were previously all hooking into Instrumenter, now hook into Telemetry events and perform internal data collection/reporting as they happen.

The first example found in 2 mins, I just opened NewRelic agent and indeed they collect and report data in telemetry event handlers:

https://github.com/newrelic/elixir_agent/blob/master/lib/new_relic/telemetry/ecto/handler.ex#L54

It's probably not the use case Telemetry was designed for, but because the way Phoenix and ecosystem replaced own instrumentation with it, the ecosystem around tracing/errors/logging/reporting switched to using it and I'm not sure if that's a good thing entirely.

Part of the work I do is working on medical applications, where audits are critical. Currently, we're also listening to telemetry events to log things like request start/stop, or LiveView events as well. This is all good, but adds quite a lot of complexity if you want to do it properly, and by 'properly', and in that environment it means several things:

audits should fail the thing they're auditing if audits cannot be written
there shouldn't be actions with no audits

Telemetry's events are being detached when they fail, and system keeps running without them, and we had to guard against this by doing very little in the handlers and delegating the event handling to separate processes, but as you can see from the NewRelic example above, not everyone does it.

Ideally, now that I'm thinking about it, to support the real-life use cases that the events are being used in the wild already, and not the idealized way where people don't do side-effects in the handlers, we'd have:

event handlers being optionally "durable", i.e. surviving the handlers crashing
event handlers being executed in a predictable order (and this not being implementation detail)
having the ability for the programmer to insert before/after other events to ensure proper order, similar to what Absinthe does with it's pipelines https://hexdocs.pm/absinthe/Absinthe.Pipeline.html#insert_after/3

I agree we're probably not using Telemetry as intended, but here we are as a community it seems. I'm just looking for a solution here, maybe it's not to use Telemetry but to develop a separate mechanism for instrumentation these things, but it's very very tempting to just rely on telemetry as it is currently.

bryannaegele commented 2 years ago

@hubertlepicki since handlers can be attached and detached at any time by anything, guaranteeing some form of ordering becomes much more difficult. Questions like which caller's ordering takes precedence then have to be addressed. It would also box us in on other features which have been requested and we want to support which we couldn't efficiently handle before such as registering a handler by prefix, e.g. [:http, :request, *].

I don't mean to discount the functionality you're describing being valuable to someone's use case. It sounds like that kind of thing would be useful to you. However, telemetry was never intended to be a generic event dispatching library. There are already solutions for that. Perhaps some bridge using telemetry events to feed into a solution based on gen_event and gen_statem where you could manage more complex flows. But I'm not convinced this library which has always been meant to be a very basic library on which to build on should expand to these kinds of features.

Hopefully the new release with the failure events can help you at least handle re-attaching handlers on failure.

hubertlepicki commented 2 years ago

@bryannaegele I agree completely. For my use cases, the switch from own instrumenter to telemetry, made by Phoenix and friends, take a step in somewhat the wrong direction for that very reason that telemetry was never meant to be used the same way.

It's almost like we need a second event dispatch mechanism, along Telemetry, where events are durable (i.e. do not detach), execute in predictable and defined way, and failures in handlers result in overall request failure as well. I think Logger should probably hook in there too, by the way. The default console logger is pretty rock solid but that's not guaranteed people don't use custom logger backends and, correct me if I am wrong, an error in these handlers would result in detaching logger from Phoenix events altogether (although I am imagining this thing, never actually happened to me so maybe not an issue).

I can see this being especially useful for LiveView and channels, as with normal HTTP requests a lot of what I am describing here actually can be done with Plug.

hubertlepicki commented 2 years ago

@bryannaegele okay thank you so much. For now your solution seems to work, i.e. I added event handler that feeds the events I am interested in to another events dispatch mechanism. If a failure happens, I want to crash, so I do it, and re-attach the over all event handler.

I suspect, however, that there's a tiny chance for a race condition to happen here: https://github.com/beam-telemetry/telemetry/blob/main/src/telemetry.erl#L164:L170

This happens because detach removes the handler from ETS table, then executes some instructions and then we re-attach it, and I think I see this actually happening but fairly rarely, in the wild.

Would adding option to not detach event handlers on failures, i.e. "durable handler" be acceptable solution? I could work on a PR for that, but if it's a deal breaker I wouldn't bother. I can probably live with this race condition, it seems to happen extremely rarely.

tsloughter commented 2 years ago

Very good point. Probably a good idea in my opinion.

bryannaegele commented 2 years ago

Yeah, I'm not opposed to an opt-in like that. It could be set in the user provided config.

@josevalim what do you think? Allow the user to explicitly opt in to not detach a handler on exception?

telemetry:attach(<<"durable handler">>,
                 [some, event],
                 fun log_response_handler:handle_event/4,
                 #{durable => true}).

The only sticky bit is the config can be any term and we'd be enforcing them to use a map if they want this feature but I'm ok with that versus API changes or new functions

arkgil commented 2 years ago

@bryannaegele in my view this changes the purpose of the config. It's supposed to be the config of the handler itself, and not the configuration of how telemetry deals with the handler. What do you think about adding another argument. Too verbose?

tsloughter commented 2 years ago

I was going to suggest another function: telemetry:attach_durable.

bryannaegele commented 2 years ago

Given those two choices, I'd probably go with the additional map options arg for flexibility and introduce attach/5 attach_many/5.

hubertlepicki commented 2 years ago

It's funny how I went through the same options when looking at the code.

I agree the HandlerConfig is meant to be passed to the handler itself, so it's not great place to do so. Maybe it's not a great name either, as it's more of a "handler argument" rather than a config that would describe what to do with a handler. And modifying it to be a map would mean breaking the compatibility with current code so that's not great either.

I also was thinking about attach_durable but I also think I am leaning towards adding another argument. And, maybe, renaming the current one from "handler config" to "handler argument" - and calling the new one "handler config" or "options" ?

hubertlepicki commented 2 years ago

I have made a proposal in the PR but please do check my erlang style as I'm not used to writing it.

beam-telemetry / telemetry

Making execution order of handlers guaranteed #106