Nebo15 / logger_json

JSON logger formatter with support for Google Cloud, DataDog and other for Elixir.
https://nebo15.github.io/logger_json/
MIT License
237 stars 92 forks source link

Add GoogleErrorReporter to format errors compatible with Google Error Reporting #58

Closed patmaddox closed 3 years ago

patmaddox commented 3 years ago

This provides a new function `LoggerJSON.Formatters.GoogleErrorReporter.report/3) that formats errors to be compatible with Google Error Reporting. Reporting an error with this function will create a new error record in GER, complete with stack trace. It uses two mechanisms to achieve this:

  1. It formats the stacktrace consistent with Ruby's backtrace format, which GER supports
  2. It adds the "@type": "type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent" metadata, which forces GER to record the error, in case something is wrong with the formatting.

Aggregated errors

image

A single error instance

image

patmaddox commented 3 years ago

I'm sure this PR will need some more work (documenting the new function at a minimum). I am pretty new to Elixir and am not familiar with the different error types, so I'd appreciate some guidance on anything I might be missing.

I realize this library isn't for error reporting, but Google Error Reporting will automatically take care of that if the error is formatted correctly - which was easy to do, as you can see. So I hope this addition will be considered.

I have a bit more work to do to fully integrate this into our app, and will be working on it over the next few days, but I wanted to share it and get some early feedback. Thanks!

coveralls commented 3 years ago

Coverage Status

Coverage increased (+1.6%) to 77.833% when pulling 8e76992cc61bc8194ef687ce5985b18e9fccd179 on patmaddox:google-error-reporter into 7f35b8c3639b79db5ee0b15bc582cd22ad0ec783 on Nebo15:master.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.6%) to 76.804% when pulling a09ddd347eba29e331030e2b42105df063919c0d on patmaddox:google-error-reporter into 7f35b8c3639b79db5ee0b15bc582cd22ad0ec783 on Nebo15:master.

AndrewDryga commented 3 years ago

Hello @patmaddox, I like this feature a lot. We would love to aggregate errors from Elixir using native GoogleCloud services :).

I think it would be better to integrate it into the existing google_cloud_logger formatter, so that if error contains a stack trace then we would properly format jsonPayload.message and set a @type (according to this doc it should work, but I'm not sure) but keep some original stacktrace in some other field (I think it would be kept in metadata already anyways).

patmaddox commented 3 years ago

@AndrewDryga I have tried both approaches - adding a separate GoogleErrorReporter, and integrating directly into the GoogleCloudLogger (which you can see at 472e16c). I've found the first way preferable. The main reason is that when explicitly logging, it includes an operation value which is a unique identifier for the request. This lets you click on that identifier and find all logging statements that include it - so you can track down all log entries that are involved in the request that led to the error. When I integrated directly into GoogleCloudLogger it didn't contain the operation field, making the log information less useful.

So, feel free to use 472e16c as a starting point, and see if you can get it going. I'd be curious to see what part I missed.

This is working and is quite useful:

image

AndrewDryga commented 3 years ago

@patmaddox the list of errors looks good indeed. How do you use it? I'm asking because it would be really nice to have both JSON structured logs in GKE and errors reported, not just one of them (JSON logger setting for log provider is singular). Maybe we can generate a custom operation instead?

patmaddox commented 3 years ago

I hooked it up with Plug.ErrorHandler.

Here's relevant config bits:

  def router do
    quote do
      use Phoenix.Router
      use Plug.ErrorHandler
      use EpicenterWeb.ErrorHandler

      import Plug.Conn
      import Phoenix.Controller
      import Phoenix.LiveView.Router
    end
  end
# This seems pretty generic and could probably get moved into logger_json library
defmodule EpicenterWeb.ErrorHandler do
  alias LoggerJSON.Formatters.GoogleErrorReporter

  defmacro __using__(_opts) do
    quote do
      defp handle_errors(conn, %{kind: kind, reason: reason, stack: stacktrace}) do
        GoogleErrorReporter.report(kind, reason, stacktrace)
        conn
      end
    end
  end
end
config :logger_json, :google_error_reporter,
  service_context: [
    service: System.fetch_env!("ERROR_REPORTER_SERVICE_NAME"),
    version: :code.priv_dir(:epicenter) |> Path.join("static/version.txt") |> File.read!() |> String.trim()
  ]
patmaddox commented 3 years ago

@AndrewDryga just following up on this - we've identified a pretty serious shortcoming with the separate ErrorHandler approach. Because it's listed in the router, it will only be used when there are errors in Phoenix HTTP requests. It does not report errors that happen in web socket connections, or app initialization. So, we're going to go back to the approach of integrating it fully, so the default error handler adds the Google Error Reporting attribute.

The one thing that we lose with the integrated approach is that Phoenix no longer adds a request_id or remote_ip to the error. I believe this is because the Phoenix request runs in one process (with Logger.metadata set), and the error handler runs in a different process and so doesn't have the request id or remote IP. We'll look into this more, but I am pretty new to Elixir / Phoenix and so am curious if you have any pointers on how we might use the default Elixir error handling while also maintaining any logger metadata that Phoenix adds.

AndrewDryga commented 3 years ago

@patmaddox I quickly checked the code and the only way I have in mind to persist request_id in the Process metadata would be to have a double approach: hook up to Plug.ErrorHandler and still have default formatter being able to format any exception in a way that would report those errors to GoogleErrorReporter. The problem we would have then is preventing the same error to be reported twice and this is something I don't know how to approach yet. We might avoid re-raising the excepting and log directly but then we might face compatibility issues with other code that expects errors to be raised by Plug/Phoenix which we should avoid. Maybe we can add some metadata to https://github.com/elixir-plug/plug/blob/v1.11.0/lib/plug/error_handler.ex#L101 and propagate it to the logger but I also don't know how exactly from the top of my head.

Sorry for taking so long to reply, I'm buried in work and life :).

AndrewDryga commented 3 years ago

This is how AppSignal library deduplicates those reports: https://github.com/appsignal/appsignal-elixir-plug/blob/main/lib/appsignal_plug.ex#L61-L62. It ignores reports from processes that are going to raise an exception which looks like something logging library should not do because it would require some central coordinator to keep a list of those ignored pids.

AndrewDryga commented 3 years ago

I'm going to merge this and use format_stacktrace you have added in GKE formatter. This way it would be enabled by default and if somebody wants to track error manually it still would be possible to do so, thank you for your contribution ❤️

AndrewDryga commented 3 years ago

@patmaddox one of the ways you can deduplicate logs can be subscribing to :exception event that is mentioned here and ignore exception using a macro.