dashbitco / broadway_dashboard

Keep track of your Broadway pipelines from Phoenix LiveDashboard
https://elixir-broadway.org/
Apache License 2.0
220 stars 17 forks source link

[error] Handler {BroadwayDashboard.Telemetry, #PID<0.xxx.0>} has failed and has been detached. #5

Closed leductam closed 3 years ago

leductam commented 3 years ago

I tried integrate broadway_dashboard to live LiveDashboard but got the error:

[error] Handler {BroadwayDashboard.Telemetry, #PID<0.xxx.0>} has failed and has been detached.
Reason={:badkey, :time, %{system_time: 16274584813510685}}
Stacktrace=[
  {BroadwayDashboard.Telemetry, :handle_event, 4,
   [file: 'lib/broadway_dashboard/telemetry.ex', line: 47]},
  {:telemetry, :"-execute/3-fun-0-", 4, ...

My deps

defp deps do
    [
      {:phoenix, github: "phoenixframework/phoenix", override: true},
      {:phoenix_html, github: "phoenixframework/phoenix_html", override: true},
      {:phoenix_live_reload, "~> 1.2", only: :dev},
      {:phoenix_live_view, github: "phoenixframework/phoenix_live_view", override: true},
      {:floki, ">= 0.30.0", only: :test},
      {:phoenix_live_dashboard, github: "phoenixframework/phoenix_live_dashboard", override: true},
      {:esbuild, "~> 0.1", runtime: Mix.env() == :dev},
      {:swoosh, "~> 1.3"},
      {:telemetry_metrics, "~> 0.4"},
      {:telemetry_poller, "~> 0.4"},
      {:jason, "~> 1.0"},
      {:plug_cowboy, "~> 2.0"},
      {:broadway, github: "dashbitco/broadway", override: true},
      {:broadway_rabbitmq, github: "dashbitco/broadway_rabbitmq", override: true},
      {:broadway_dashboard, github: "dashbitco/broadway_dashboard", override: true}
    ]
  end

Please help to clarify this issue.

josevalim commented 3 years ago

You are using a too recent broadway version. You can use the same version as in the lockfile of this project or wait until we officially release it.

josevalim commented 3 years ago

Those are the relevant CHANGELOG notes we need to update:

  * Rename :events to :messages in batcher telemetry event
  * Rename :time to :system_time in telemetry event
  * Rename `[:broadway, :consumer, *]` to `[:broadway, :batch_processor, *]` in telemetry event

We may have already applied the first.

leductam commented 3 years ago

As your advice, adjusted some measurements.xyz at BroadwayDashboard.Telemetry module.

On the dashboard the Throughput is updated, but I saw that the Pipeline was not updated the % of proc_X.

philss commented 3 years ago

This problem was fixed in master. I'm trying to figure out another bug in the metrics that is making the pipeline to not be updated after a few events.

leductam commented 3 years ago

I got this error:

[error] Handler {BroadwayDashboard.Telemetry, #PID<0.4100.0>} has failed and has been detached. Class=:error
Reason=:badarith
Stacktrace=[
  {BroadwayDashboard.Telemetry, :calc_workload, 3,
   [file: 'lib/broadway_dashboard/telemetry.ex', line: 139]},
  {BroadwayDashboard.Telemetry, :handle_event, 4,
   [file: 'lib/broadway_dashboard/telemetry.ex', line: 75]},
  {:telemetry, :"-execute/3-fun-0-", 4,
   [
     file: '/deps/telemetry/src/telemetry.erl',
     line: 135
   ]},
  {:lists, :foreach, 2, [file: 'lists.erl', line: 1342]},
  {:telemetry, :span, 3,
   [
     file: '/deps/telemetry/src/telemetry.erl',
     line: 274
   ]},
  {GenStage, :consumer_dispatch, 6, [file: 'lib/gen_stage.ex', line: 2462]},
  {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 695]},
  {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 771]},
  {:proc_lib, :wake_up, 3, [file: 'proc_lib.erl', line: 236]}
]
leductam commented 3 years ago

@philss I think that we're facing with the problem in this function: defp calc_workload(start_time, last_end_time, duration) do idle_time = start_time - last_end_time round(duration / (idle_time + duration) * 100) end

Those variables start_time, last_end_time and duration are got/calculated from values retrieved by calling :erlang.monotonic_time() which return Erlang monotonic time. There are the statement about Erlang monotonic time at https://learnyousomeerlang.com/time :

Erlang monotonic time. This is Erlang's view of the OS monotonic time if available, or the VM's own monotonic version of the system time if not available. This is the adjusted clock that is used for events, timers, and so on. Its stability makes it ideal to count a time interval. Do note that this time is monotonic, but not strictly monotonic, meaning that the clock can't go backwards, but it can return the same value many times!

The Erlang monotonic time can return the same value many times. Then idle_time and duration might coincident are zero at unspecified point in time, and idle_time + duration might zero too. As I shown that I still got the errors Class=:error Reason=:badarith due to division by zero.

I tried with temporary fix and it is worked: efp calc_workload(start_time, last_end_time, duration) do if duration > 0 do idle_time = start_time - last_end_time round(duration / (idle_time + duration) * 100) else 0 end end

Please take a look. Thanks.

philss commented 3 years ago

@leductam thank you for the details.

I think in the case we are using :erlang.monotonic_time() it's nearly impossible to return the same value twice because it will always do some type of work between the two events (start and stop). I believe the problem was fixed. But can you test with the current master version to see if the problem can occur? Thanks!

leductam commented 3 years ago

@philss sure, I performed some tests after your committed new code to master and still got the errors due to division by zero.

Following https://learnyousomeerlang.com/time:

As mentioned earlier, the Erlang monotonic time is not strictly monotonic: it will possibly return the same number twice if it's called at the same time on two different cores, for example.

philss commented 3 years ago

@leductam oh no :/ do you mind to send a PR with your suggestion/fix?