getsentry / sentry-elixir

The official Elixir SDK for Sentry (sentry.io)
https://sentry.io
MIT License
625 stars 186 forks source link

Exceptions with breadcrumbs using Sentry.LoggerBackend #383

Closed maxx-coffee closed 4 years ago

maxx-coffee commented 4 years ago

Environment

Description

When a process crashes or an exception is raised sentry-elixir will capture the exception correctly. However, if you use breadcrumbs or other context sentry-elixir will not pass this correctly.

As I understand the code we are using the Process dictionary to hold all context. The problem comes up when you are utilizing Sentry.LoggerBackend as your logger backend. The logger and the process the exception occurred in are not the same resulting in no context being sent.

Expected Behavior

Context will be passed correctly when errors and exceptions happen.

Actual Behavior

Context is lost because the logger is a different process.

Steps to Reproduce

defmodule Test.MimicGenserver do
  alias GenServer
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def init(_) do
    {:ok, %{}}
  end

  def error, do: GenServer.call(__MODULE__, :error)

  def terminate(reason, state) do
    :normal
  end

  def handle_call(:error, _from, _state) do
    Sentry.Context.add_breadcrumb(%{category: "will not show"})
    {:stop, "crash", :error, nil}
  end
end

defmodule ClientTest do
  use ExUnit.Case

  alias Test.MimicGenserver
  import ExUnit.CaptureLog

  test "if breadcrumb exists" do
    # assert {:ok, %{"test" => "test"}} = Client.make_request(request())
    bypass = Bypass.open()
    pid = self()
    start_supervised!(MimicGenserver)

    Bypass.expect(bypass, fn conn ->
      {:ok, body, conn} = Plug.Conn.read_body(conn)
      decoded_body = Jason.decode!(body)
      assert body =~ "will not show"
      send(pid, "API called")
      Plug.Conn.resp(conn, 200, ~s<{"id": "340"}>)
    end)

    Application.put_env(:sentry, :dsn, "http://public:secret@localhost:#{bypass.port}/1")

    capture_log(fn ->
      MimicGenserver.error()
      assert_receive "API called"
    end)
  end

end

Solution?

I threw together a solution last night that I would be willing to clean up a bit more and submit if you are open to PR's and adding another dependency.

My idea is to move away from the process dictionary and to utilize something like cachex as the context store. When the context is set we could obtain the PID of the process and store that as the key in cachex. This would allow us to utilize 98% of the context module that's in place. When the exception is logged we can grab the PID from the metadata allowing us to obtain the context for the specific process. Allowing us to pass the context with the error.

The only limitation I can think of with this solution is how we would monitor the process and delete keys when the process is killed normally.

Other than that I think this could be a major version bump as you would need to also incorporate a supervisor tree and start sentry as an application within your application. If you didnt want to do a major version bump we could possibly break this out in to another module and switch between the process dictionary context and cachex context depending on if a user mounted the application.

Final thoughts...

You may already have plans to add this or have another idea. Just wanted to throw some thoughts your way in case you need a helping hand to implement this. It's something I feel we are going to need to do for multiple apps we use. That or we may implement some custom functionality that mimics the above. Utilizing a new logger, custom context module and a custom before_send event.

Thanks for the work! The code was extremely easy to follow and took minimal effort to understand.

mitchellhenke commented 4 years ago

@joshualcoffee thanks for opening an issue and the detailing the current limitations and possibilities!

I would like to expand what's possible with Context around processes, but it's unlikely I'll add a required dependency. The most likely case is something that allows configuration of where the Context is read from and written to. Default configuration being the same process dictionary functionality that exists now, and Sentry could include a configuration option to substitute a different module with an optional dependency on something like cachex. With that, it is potentially possible to avoid a major version change.

Does something along those lines fit with what you'd like to see?

maxx-coffee commented 4 years ago

I think that's a fair point regarding the dependency. I was really on the fence if it even belonged. Cachex was my first go to as reading and writing to context could be extensive.

I think utilizing the Process dictionary may lead to bleeding concerns. For the most part, the process dictionary should only be accessed via the process that called it. From my knowledge, I believe the only way to retrieve the process dictionary outside of the process is through Process.info(pid). This could result in pretty huge memory leaks as the dictionary is copied over and over when you do that.

Process.put("set", "me")
> Process.info(self())
> Process.info(self())

You will see the history keeps growing.

Outside of that, you cannot guarantee the process is alive at the time of a crash. Let's say a genserver crashes that you have set some context on. Once this error hits the logger the process could have already been killed.

I think the main concern is no one is able to utilize context or breadcrumbs on exceptions. This somewhat breaks the "let it crash" paradigm as you now need to trap exits to truly utilize this functionality.

I have a few ideas that may work:

Idea 1

Idea 2

I think this would require some rewriting as the process pid the error comes from will need to be passed to the capture_error.

Side note:

A nice feature would be to allow devs to pass a context key on capture exception or capture the message that would retrieve context. The use case for this is a complex genstage. You could set context as the genstage was being processed. If an error occurs the context would be passed to sentry and tagged accordingly.

This may be out of scope for this library as it could be a dev concern. However, it starts to move in to the gray area because of crashing processes.

Solution I've come up with

Just in case anyone else is facing the same problem. I've created a new context module that utilizes cachex. Im using the process PID to as the key in cachex.

defp set_context(current, key, new, context_key) when is_map(current) and is_map(new) do
    merged_context =
      current
      |> Map.get(key, %{})
      |> Map.merge(new)

    new_context = Map.put(current, key, merged_context)
    set_and_get_context(context_key, new_context)
  end

  def clear_all(key) do
    Cachex.del(@cachex_key, key)
  end

  defp get_context(key) do
    case Cachex.get(@cachex_key, key) do
      {:ok, %{} = result} -> result
      _ -> %{}
    end
  end

I also copied the Sentry.LoggerBackend and modified it a bit to pass the PID that the error occurred in. I can then retrieve that in the before_send. In the before send I modify the event adding the proper context. The event is then processed accordingly.

I've not fully tested this but so far it seems like it's working.

mitchellhenke commented 4 years ago

I think one of the more difficult parts will be clearing a key when it's no longer needed. Possible to monitor process when data gets added, but I'm not sure on the implications or details of managing it.