Open barberscott opened 1 year ago
Thanks for reaching out @barberscott !
We'll put this in our queue.
The solution might be as simple as adding google.cloud.exceptions.ServiceUnavailable
to the list here:
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
@dbeatty10 I created an ServiceUnavailable instance and ran the test code (test_is_retryable).
Current: Not added ServiceUnavailable on RETRYABLE_ERRORS
.
Result: Test passed.
def test_is_retrievable(self):
_is_retryable = dbt.adapters.bigquery.connections._is_retryable
exceptions = dbt.adapters.bigquery.impl.google.cloud.exceptions
Internal Server Error = Exceptions.Internal Server Error ("Code Abort")
bad_request_error = Exception.BadRequest("Code is broken")
connection_error = connection_error("Code broke")
client_error = Exception.ClientError("Invalid code")
rate_limit_error = Exception.Forbidden(
"Code is broken", error=[{"reason": "rateLimitExceeded"}]]
)
# add service_unavailable_error
service_unavailable_error = Exception.ServiceUnavailable("Code is broken")
self.assertTrue(_is_retryable(internal_server_error))
self.assertTrue(_is_retryable(bad_request_error))
self.assertTrue(_is_retryable(connection_error))
self.assertFalse(_is_retryable(client_error))
self.assertTrue(_is_retryable(rate_limit_error))
# passed below assertion
self.assertTrue(_is_retryable(service_unavailable_error))
The ServiceUnavailable class inherits from the ServerError class, so it seems to pass above test. I'd like to fix this, but is there anything else I look at? 🙏
Adding it to the test_is_retryable
test like that makes sense 👍
But ... the thing that is surprising to me: if ServiceUnavailable
inherits from ServerError
and your modified test passes, then why is this not being retried?
Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable
?
@jx2lee Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?
@dbeatty10
Is is possible that the BigQuery client is raising a different error class for 503 errors other than ServiceUnavailable?
no, i expected it's impossible.
we can create error classes with the from_http_status and from_grpc_status functions. (google.api_core.exceptions). error class generated from this functions always be "ServiceUnavailable"
Do you happen to have any python stacktraces available where you ran into this problem and dbt-bigquery didn't retry?
That issue has never been occured...🙃 I need to more detailed logs when it happened.
IMO, If the issue reporter can't provide more error logs, I think okay to close the issue.
ServiceUnavailable
ServiceUnavailable
@dbeatty10 Is there anything else should check?
We did hit this recently. We use external-tables on a on-run-start
macro. We also use service account impersonation in the dbt profile. While running dbt docs generate on CI environment we got:
('Unable to acquire impersonated credentials', '{\n "error": {\n "code": 503,\n "message": "Authentication backend unavailable.",\n "status": "UNAVAILABLE"\n }\n}\n')
Because this happens intermittently on an isolated system, I don't have more logs.
Thanks for this report @rrbarbosa !
Since this is intermittent (and maybe relatively rare also), it has been hard to nail down.
If anyone can provide information to suggest that dbt is not retrying at least once, that would be very helpful 🙏
@jx2lee -- would you be willing to raise a PR with the addition you made to this test case?
I think that would be sufficient for us to establish that the ServiceUnavailable
is retryable (which would allow us to close this issue).
@dbeatty10 okay, i would create PR included above test code soon!
@dbeatty10 I created PR! Could you edit PR body or add comment to make it easier for reviewers to understand?
I'm not sure if this is the same code path, but we are seeing a problem with Dataproc (Python models) that dbt is submitting, where dbt successfully submits the batch job, then, during the polling in dbt-labs/dbt-bigquery/dbt/adapters/bigquery/dataproc/batch.py#poll_batch_job
, one of the polling calls returns a 503 that is presumably not retried, and dbt errors the model, even though the dataproc job is still running in the background, and eventually completes successfully.
00:25:50 BigQuery adapter: Submitting batch job with id: 5f6d87c9-4045-4208-8941-03fbb8facf30
00:29:58 Unhandled error while executing target/run/core/models/working_tables/WT_rfm_status.py
503 502:Bad Gateway
00:29:58 58 of 63 ERROR creating python table model working_tables.WT_rfm_status ........ ERROR in 248.55s
We have seen the issue twice in a week, and running dbt-bigquery 1.8.1
Got hit by this issue today, while generating "seed" tables with DBT running in CloudBuild:
"Step #7 - "dbt-seed": ('Unable to acquire impersonated credentials', '{\n "error": {\n "code": 503,\n "message": "The service is currently unavailable.",\n "status": "UNAVAILABLE"\n }\n}\n')"
We're using impersonation with dbt-bigquery
and it seems IAM was unavailable for a moment. We have no explicit retry configured, so - by the docs - it should retry once, but I see no such thing in the logs.
GH closed this because an attached PR was merged. I think there is more to this, so I'm leaving it open.
Is this a new bug in dbt-bigquery?
Current Behavior
Current if BigQuery returns a 503 error we do not retry even though BigQuery recommends that as the course of action.
Expected Behavior
This is not a regression but rather an oversight -- 503 errors should be both retryable and reopenable since it indicates a transient unavailable condition in BigQuery
Steps To Reproduce
Transient -- requires intermittent error from BQ.
Relevant log output
No response
Environment
Additional Context
No response