Open OSalama opened 3 months ago
I've been running my proposed fix (#1275) in production for 2 weeks and have not seen any further occurrences of the issue.
Just ran across this today as well (Python model / GCP / Dataproc)
Unhandled error while executing /tmp/tmppi1e0ffu/run/common/models/modelname.py
503 Socket closed
Is this a new bug in dbt-bigquery?
Current Behavior
After dbt-bigquery submits a dataproc batch job, it enters a polling mode, waiting for a response to indicate the job completed (dbt/adapters/bigquery/dataproc/batch.py#29). This polling process is not retrying transient errors, so the dbt run ends up failing with an error, while the actual dataproc job runs to completion successfully.
Expected Behavior
We should be performing the dataproc batch polling with a retry strategy to retry transient errors.
The BatchControllerClient.get_batch() function takes a retry parameter:
so we should pass one in!
Steps To Reproduce
This relies on GCP API throwing a 5xx error so it's not very reproducible, though we are hitting the error at least once a day in our runs.
Use dbt-bigquery 1.8.1
Configure Dataproc serverless in profiles.yml:
Submit many long-running Python models
Wait until one of them fails.
Relevant log output
Environment
Additional Context
Feels similar to https://github.com/dbt-labs/dbt-bigquery/issues/682, but I believe the Dataproc element means we're dealing with a separate code path that doesn't benefit from
RETRYABLE_ERRORS
. The stacktrace also doesn't mentiondbt/adapters/bigquery/connections.py
.