dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
217 stars 153 forks source link

[ADAP-498] [Bug] BQ does not retry on 503 #682

Open barberscott opened 1 year ago

barberscott commented 1 year ago

Is this a new bug in dbt-bigquery?

Current Behavior

Current if BigQuery returns a 503 error we do not retry even though BigQuery recommends that as the course of action.

Expected Behavior

This is not a regression but rather an oversight -- 503 errors should be both retryable and reopenable since it indicates a transient unavailable condition in BigQuery

Steps To Reproduce

Transient -- requires intermittent error from BQ.

Relevant log output

No response

Environment

- dbt-core: all 
- dbt-bigquery: all

Additional Context

No response

dbeatty10 commented 1 year ago

Thanks for reaching out @barberscott !

We'll put this in our queue.

The solution might be as simple as adding google.cloud.exceptions.ServiceUnavailable to the list here:

https://github.com/dbt-labs/dbt-bigquery/blob/7c216445f8009baa9cec4d61dd56693be1dd79fa/dbt/adapters/bigquery/connections.py#L53-L59

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 12 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jx2lee commented 10 months ago

@dbeatty10 I created an ServiceUnavailable instance and ran the test code (test_is_retryable).

Current: Not added ServiceUnavailable on RETRYABLE_ERRORS. Result: Test passed.

def test_is_retrievable(self):
        _is_retryable = dbt.adapters.bigquery.connections._is_retryable
        exceptions = dbt.adapters.bigquery.impl.google.cloud.exceptions
        Internal Server Error = Exceptions.Internal Server Error ("Code Abort")
        bad_request_error = Exception.BadRequest("Code is broken")
        connection_error = connection_error("Code broke")
        client_error = Exception.ClientError("Invalid code")
        rate_limit_error = Exception.Forbidden(
            "Code is broken", error=[{"reason": "rateLimitExceeded"}]]
        )
        # add service_unavailable_error
        service_unavailable_error = Exception.ServiceUnavailable("Code is broken")

        self.assertTrue(_is_retryable(internal_server_error))
        self.assertTrue(_is_retryable(bad_request_error))
        self.assertTrue(_is_retryable(connection_error))
        self.assertFalse(_is_retryable(client_error))
        self.assertTrue(_is_retryable(rate_limit_error))
        # passed below assertion
        self.assertTrue(_is_retryable(service_unavailable_error))

https://github.com/dbt-labs/dbt-bigquery/blob/06851679f75d18ece98c95d4eb2a0ddd16544f4d/dbt/adapters/bigquery/connections.py#L57-L63

The ServiceUnavailable class inherits from the ServerError class, so it seems to pass above test. I'd like to fix this, but is there anything else I look at? 🙏

dbeatty10 commented 10 months ago

Adding it to the test_is_retryable test like that makes sense 👍

But ... the thing that is surprising to me: if ServiceUnavailable inherits from ServerError and your modified test passes, then why is this not being retried?

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

@jx2lee Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

jx2lee commented 10 months ago

@dbeatty10

Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?

no, i expected it's impossible. we can create error classes with the from_http_status and from_grpc_status functions. (google.api_core.exceptions). error class generated from this functions always be "ServiceUnavailable"



Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?

That issue has never been occured...🙃 I need to more detailed logs when it happened.

IMO, If the issue reporter can't provide more error logs, I think okay to close the issue.

jx2lee commented 6 months ago

@dbeatty10 Is there anything else should check?

rrbarbosa commented 6 months ago

We did hit this recently. We use external-tables on a on-run-start macro. We also use service account impersonation in the dbt profile. While running dbt docs generate on CI environment we got:

('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "Authentication backend unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')

Because this happens intermittently on an isolated system, I don't have more logs.

dbeatty10 commented 6 months ago

Thanks for this report @rrbarbosa !

Since this is intermittent (and maybe relatively rare also), it has been hard to nail down.

If anyone can provide information to suggest that dbt is not retrying at least once, that would be very helpful 🙏

dbeatty10 commented 6 months ago

@jx2lee -- would you be willing to raise a PR with the addition you made to this test case?

I think that would be sufficient for us to establish that the ServiceUnavailable is retryable (which would allow us to close this issue).

jx2lee commented 6 months ago

@dbeatty10 okay, i would create PR included above test code soon!

jx2lee commented 5 months ago

@dbeatty10 I created PR! Could you edit PR body or add comment to make it easier for reviewers to understand?

OSalama commented 4 months ago

I'm not sure if this is the same code path, but we are seeing a problem with Dataproc (Python models) that dbt is submitting, where dbt successfully submits the batch job, then, during the polling in dbt-labs/dbt-bigquery/dbt/adapters/bigquery/dataproc/batch.py#poll_batch_job, one of the polling calls returns a 503 that is presumably not retried, and dbt errors the model, even though the dataproc job is still running in the background, and eventually completes successfully.

00:25:50  BigQuery adapter: Submitting batch job with id: 5f6d87c9-4045-4208-8941-03fbb8facf30
00:29:58  Unhandled error while executing target/run/core/models/working_tables/WT_rfm_status.py
503 502:Bad Gateway
00:29:58  58 of 63 ERROR creating python table model working_tables.WT_rfm_status ........ ERROR in 248.55s

We have seen the issue twice in a week, and running dbt-bigquery 1.8.1

mkielar commented 3 months ago

Got hit by this issue today, while generating "seed" tables with DBT running in CloudBuild:

"Step #7 - "dbt-seed": ('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 503,\n    "message": "The service is currently unavailable.",\n    "status": "UNAVAILABLE"\n  }\n}\n')"

We're using impersonation with dbt-bigquery and it seems IAM was unavailable for a moment. We have no explicit retry configured, so - by the docs - it should retry once, but I see no such thing in the logs.

mikealfare commented 5 days ago

GH closed this because an attached PR was merged. I think there is more to this, so I'm leaving it open.