dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.02k stars 1.64k forks source link

[CT-1268] [Spike] Integrate dbt exceptions into structured logging system #5958

Open gshank opened 2 years ago

gshank commented 2 years ago

Right now we have structured logging events for the majority of our logging, except for exceptions. In some but not all cases we have a separate logging event. Ideally all exceptions should be logged automatically. This ticket is labeled a "spike" because in order to scope this work we need to identify a mechanism/policies for doing this.

jtcohen6 commented 2 years ago

Copying from Nate's comment in https://github.com/dbt-labs/dbt-core/issues/4836#issuecomment-1104038063, because the underlying thrust of it continues to feel important:

whenever possible, we want to be able to programmatically distinguish when the root cause of an error is a warehouse issue, an adapter issue, or a core issue. This will be a big improvement in the process for debugging failed runs.

Basically, when a dbt run fails, how can we tell whether it was user error, database error, or internal dbt error?

In a Pythonic world, we'd expect to do this via class inheritance and type introspection. Given that our log events are now proto-composed, we probably need a field on each exception-log. Something like exception_cause or exception_category. Categories I can imagine:

emmyoop commented 2 years ago

This can probably be accomplished with an enum that defines all the categories. Without verifying the syntax I think something like this would work:

enum ExceptionCause {
{
        PARSING = 0;
        DISALLOWED = 1;
        DATABASE_CONNECTION = 2;
        DATABASE_QUERY = 3;
        UNHANDLED = 4;
        ...
 }

And then within the message add

message SomeException {
    EventInfo info = 1;
    repeated ExceptionCause exception_cause = 2;
    ...
}

This also allows the flexibility to add more types to the end of the enum which I'm sure we'll end up needing.

jtcohen6 commented 2 years ago

Blocking question we need to answer: Should we have separate LogCategory versus ExceptionCategory, or combine the two into a single field?

(Idea with LogCategory is to formally serve the role that the event codes' starting letters have been informally playing: this is parsing-related, deps-related, adapter-related, ...)

emmyoop commented 1 year ago

One of the unknowns here is exactly what would be useful to consumers to get logged. We likely don't want to log all exceptions, just specific ones. Is it best to fire an event as part of those exceptions or where the exception is raised?

Another option would be to add more structured information to MainEncounteredError. This may cover a lot of use cases and then just give a single event that would need to be subscribed to for exception details. The exception could contain the error category and then the MainEncounteredError event could pass the category (connection error, etc) in the proto message.

Minimum: Easily differentiate between errors in the users code or errors related to dbt itself

Better: Categorize Errors

jtcohen6 commented 1 year ago

Another option would be to add more structured information to MainEncounteredError. This may cover a lot of use cases and then just give a single event that would need to be subscribed to for exception details. The exception could contain the error category and then the MainEncounteredError event could pass the category (connection error, etc) in the proto message.

After chatting about this with @isaac-taylor, this might get us a lot of what we need, at least in the medium-term. It would be a quality-of-life improvement over parsing the string exc from MainEncounteredError.

Even better if that catch all error had some sort of error code that we can use to categorize the error to either our system, or the user's code (like connection problems)

github-actions[bot] commented 9 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.