dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.93k stars 1.63k forks source link

[Bug] dbt's custom exceptions inside a multiprocessing context hangs #10527

Open keraion opened 3 months ago

keraion commented 3 months ago

Is this a new bug in dbt-core?

Current Behavior

While debugging sqlfluff/sqlfluff#6037, dbt appears to hang if a dbt exception is raised. The exception appears to not be able to be pickled and prevents further execution.

Expected Behavior

The exceptions should implement __reduce__ to allow pickling and prevent hanging.

Steps To Reproduce

For these reproduction steps I'm using dbt-duckdb, but applies to all adapters.

  1. Using the example models, make the first model raise a compilation error:

    --my_first_dbt_model.sql
    SELECT * from {{ ref("abc") }}
  2. Call dbt run from a python multiprocessing context.

    
    import multiprocessing as mp
    from dbt.cli.main import cli

def run_dbt(): ctx = cli.make_context(cli.name, ["run"]) cli.invoke(ctx)

with mp.Pool() as pool: pool.apply(run_dbt)


### Relevant log output

```shell
02:42:36  [WARNING]: Deprecated functionality

User config should be moved from the 'config' key in profiles.yml to the 'flags' key in dbt_project.yml.
02:42:36  Running with dbt=1.8.4
02:42:37  Registered adapter: duckdb=1.8.2
02:42:37  Unable to do partial parsing because of a version mismatch
02:42:38  Encountered an error:
Compilation Error
  Model 'model.test_dbt.my_first_dbt_model' (project2/models/example/my_first_dbt_model.sql) depends on a node named 'abc' which was not found
Exception in thread Thread-8 (_handle_results):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 579, in _handle_results
    task = get()
           ^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TargetNotFoundError.__init__() missing 3 required positional arguments: 'node', 'target_name', and 'target_kind'

Environment

- OS: Ubuntu 20.04
- Python: 3.11.9
- dbt: 1.8.4

Which database adapter are you using with dbt?

other (mention it in "Additional Context")

Additional Context

As noted above, using dbt-duckdb The main entry point for this error will most likely be the sqlfluff-templater-dbt

In sqlfluff, monkeypatching __reduce__ prevents the process from hanging.

# sqlfluff_templater_dbt/templater.py
def _dbt_exception_reduce(self):
    return (
        type(self),
        tuple(
            getattr(self, arg)
            for arg in inspect.getfullargspec(self.__init__).args
            if arg != "self"
        ),
    )

DbtBaseException.__reduce__ = _dbt_exception_reduce
gshank commented 3 months ago

Nothing has changed with regard to exceptions in dbt, so I'm guessing that something changed in sqlfluff. Did sqlfluff not used to run dbt in a multi-processing context? Last time somebody reported a sqlfluff multi-processing related issue, it was just loading the code, not attempting to execute dbt commands.

keraion commented 3 months ago

I'm not sure this is "new" per se but newly discovered when hitting dbt exceptions within multiprocessing. The pre-commit hooks for sqlfluff default to running in the multi-process mode which have been seeing more usage. You can see the reduce pattern on the cpython's JSONDecodeError.

MichelleArk commented 3 months ago

Hey @keraion, thank you for taking the time to diagnose and open this issue.

dbt does not officially support parallel execution, and it would be quite a large undertaking to do so. We'd like to get there gradually, and it sounds like implementing __reduce__ on our exceptions could be a step along the way, but we still wouldn't guarantee safe multiprocessing support at that point.

I'm going to tag this as help_wanted to indicate this isn't something the maintainer team is prioritizing but would be open to an external contribution towards.