Clarification on Handling Raised RPCErrors as Crash Events

ksanderer commented 1 month ago

Hello Team!

Firstly, thank you for the recent update and the introduction of the crash report logger metadata. This feature is indeed beneficial for debugging and monitoring purposes (related PR and Issue: #381, #377)

However, I have a concern, or perhaps I have misunderstood something regarding the current implementation where all raised errors are considered crash events. Based on the logs, it appears that even standard gRPC errors, such as bad_request or not_found, are treated as crash events. Here is an example:

13:12:02.786 [error] ** (GRPC.Server.Adapters.ReportException) Exception raised while handling /api.messages.v1.MessageService/UpdateMessage:
** (GRPC.RPCError) Message with provided id doesn’t exist.
...

%GRPC.Server.Adapters.ReportException{
    kind: :error, 
    reason: %GRPC.RPCError{
        status: 5,  # GRPC.Status.not_found()
        message: \"Message with provided id doesn’t exist.\"}, 
    ...
}

These are "normal" gRPC errors supported by the protocol, indicating statuses like bad_request or not_found. Treating these as crash events might lead to unnecessary noise in the logs and potentially obscure actual issues that need attention.

By "normal," I refer to the Google APIs guidelines:

Because most Google APIs use resource-oriented API design, the error handling follows the same design principle by using a small set of standard errors with a large number of resources. For example, instead of defining different kinds of "not found" errors, the server uses one standard google.rpc.Code.NOT_FOUND error code and tells the client which specific resource was not found.

Google Cloud API Error Model

This also applies to errors in "Standard methods":

If the Create method supports client-assigned resource name and the resource already exists, the request should either fail with error code ALREADY_EXISTS or use a different server-assigned resource name and the documentation should be clear that the created resource name may be different from that passed in.

Standard methods - Create

According to the documentation, the only way to propagate google.rpc.Status is by raising a GRPC.RPCError.

At my workplace, we adhere to the approach outlined in the documentation, as demonstrated below:

  def delete_message(%DeleteMessageRequest{organization_id: ""}, _stream) do
    raise GRPC.RPCError,
      status: GRPC.Status.invalid_argument(),
      message: "`organization_id` is required"
  end

Questions:

Rationale: Could you please clarify the rationale behind treating all errors as crash events?
Strategy: What would be the proper strategy to handle these errors or suppress these crash reports for standard gRPC errors?

Looking forward to your insights on this matter. Thank you for your time and assistance.

sleipnir commented 1 month ago

Hello @ksanderer, thanks for the comments and the nice argument. I think we didn't intend to be so picky about which types would be logged and preferred to log everything. However, we can refer to the behavior of this. You can send a PR with the desired behavior as well and we can iterate on this API.

ksanderer commented 1 month ago

@sleipnir Hi and thank you for response!

I want to clarify whether this behavior is expected or if we have misconfigured our workload (may be there is a way to disable crash reports at all?).

We have started to receive a large volume of error logs in production for errors that are expected by design and encouraged by Google's AIP guidelines.

In case it's not a misconfiguration problem, I will think about a proposal and return to discuss.

BTW, In our codebase, we have an interceptor that checks if an error is part of the BackendGrpc.ErrorStatus.ignore_errors() list. If the error is not in this list, it is considered a critical error and shall be logged.

CleanShot 2024-08-09 at 13 53 38@2x

sleipnir commented 1 month ago

Hello @ksanderer, thanks for the clarification. Answering your question, yes, printing all messages is the expected behavior for the current implementation. But changing this behavior is not difficult and we can adapt to allow some kind of filtering. If you have some time I suggest taking a look at https://github.com/elixir-grpc/grpc/pull/381 maybe @beligante can give more details about the current implementation. If you don't have that time available I will consider taking a look at it as soon as my schedule is a little freer in the next few weeks.

beligante commented 1 month ago

Hey Folks! Sorry the delay on the response here (vacation weekend, no coding for me).

So, I believe where I work, we treat these cases differently indeed. But my thought process around the code is mainly designed like we treat those code paths on the adapter, on exceptions paths. With that being said, we could either refactor the code like @sleipnir mentioned and add filters (or configs to accept filters) or we could add code on the adapter level to handle these cases more gracefully (outside the scope of exceptions). I don't have strong opinions about these approaches.

Thoughts?

sleipnir commented 1 month ago

Hey Folks! Sorry the delay on the response here (vacation weekend, no coding for me).

So, I believe where I work, we treat these cases differently indeed. But my thought process around the code is mainly designed like we treat those code paths on the adapter, on exceptions paths. With that being said, we could either refactor the code like @sleipnir mentioned and add filters (or configs to accept filters) or we could add code on the adapter level to handle these cases more gracefully (outside the scope of exceptions). I don't have strong opinions about these approaches.

Thoughts?

Nice @beligante! I think we could just add some filtering mechanism by time

beligante commented 1 month ago

Sweet! I can find some time this week to work on a proposal and maybe code, unless @ksanderer has more free time and/or prefer another approach

ksanderer commented 1 month ago

Hi, everyone!

I like the idea of a configuration option to override the default filtering behavior (whatever the defaults would be). This will definitely solve the problem on my side.

The idea of checking if we need to log and handle the error before doing any work also looks nice. If the error is on the excluded list and shouldn’t be logged, we shouldn’t go down the logging path and build an error structure at all.

p.s. I’ll admit I don’t have much experience with the gRPC codebase, so I’m a bit concerned it might take me a while to come up with a solid pull request. I’m happy to discuss ideas and follow up with suggestions to better understand what’s going on, but I’d prefer not to dive into the coding side just yet since I’m not familiar with all the internal decisions and guidelines.

Hopefully, I’ll be able to get more hands-on in the future.

elixir-grpc / grpc

Clarification on Handling Raised RPCErrors as Crash Events #389

Questions: