PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.27k stars 1.58k forks source link

ssl.SSLEOFError during Google Cloud Run Job #13738

Open jeremy-thomas-roc opened 4 months ago

jeremy-thomas-roc commented 4 months ago

First check

Bug summary

While a flow run was executing, Prefect logged a large stack trace in the UI, which did not appear in the Cloud Run Job logs. It essentially says there was an EOF error, and something may have gone wrong, but the flow run is not being canceled. The flow run continued executing as there is no error on the Google side, but the Prefect job will continue to show as Running until I cancel it manually.

Reproduction

This has happened to multiple flow runs and is not limited to any deployment in particular.

Error

An error occurred while monitoring flow run '604acb03-918c-4103-bf6f-2d39fcc85617'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 908, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_gcp/workers/cloud_run_v2.py", line 460, in run
    result = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_gcp/workers/cloud_run_v2.py", line 731, in _watch_job_execution_and_get_result
    execution = self._watch_job_execution(
  File "/usr/local/lib/python3.10/site-packages/prefect_gcp/workers/cloud_run_v2.py", line 805, in _watch_job_execution
    execution = ExecutionV2.get(
  File "/usr/local/lib/python3.10/site-packages/prefect_gcp/models/cloud_run_v2.py", line 361, in get
    response = request.execute()
  File "/usr/local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/googleapiclient/http.py", line 923, in execute
    resp, content = _retry_request(
  File "/usr/local/lib/python3.10/site-packages/googleapiclient/http.py", line 222, in _retry_request
    raise exception
  File "/usr/local/lib/python3.10/site-packages/googleapiclient/http.py", line 191, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/google_auth_httplib2.py", line 209, in request
    self.credentials.before_request(self._request, method, uri, request_headers)
  File "/usr/local/lib/python3.10/site-packages/google/auth/credentials.py", line 230, in before_request
    self._blocking_refresh(request)
  File "/usr/local/lib/python3.10/site-packages/google/auth/credentials.py", line 193, in _blocking_refresh
    self.refresh(request)
  File "/usr/local/lib/python3.10/site-packages/google/oauth2/service_account.py", line 445, in refresh
    access_token, expiry, _ = _client.jwt_grant(
  File "/usr/local/lib/python3.10/site-packages/google/oauth2/_client.py", line 308, in jwt_grant
    response_data = _token_endpoint_request(
  File "/usr/local/lib/python3.10/site-packages/google/oauth2/_client.py", line 268, in _token_endpoint_request
    response_status_ok, response_data, retryable_error = _token_endpoint_request_no_throw(
  File "/usr/local/lib/python3.10/site-packages/google/oauth2/_client.py", line 215, in _token_endpoint_request_no_throw
    request_succeeded, response_data, retryable_error = _perform_request()
  File "/usr/local/lib/python3.10/site-packages/google/oauth2/_client.py", line 191, in _perform_request
    response = request(
  File "/usr/local/lib/python3.10/site-packages/google_auth_httplib2.py", line 119, in __call__
    response, data = self.http.request(
  File "/usr/local/lib/python3.10/site-packages/httplib2/__init__.py", line 1724, in request
    (response, content) = self._request(
  File "/usr/local/lib/python3.10/site-packages/httplib2/__init__.py", line 1444, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python3.10/site-packages/httplib2/__init__.py", line 1367, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.10/http/client.py", line 999, in send
    self.sock.sendall(data)
  File "/usr/local/lib/python3.10/ssl.py", line 1270, in sendall
    v = self.send(byte_view[count:])
  File "/usr/local/lib/python3.10/ssl.py", line 1239, in send
    return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)
03:35:11 PM
prefect.flow_runs.worker
Encountered an exception while waiting for job run completion - EOF occurred in violation of protocol (_ssl.c:2426)

Versions

Version:             2.16.8
API version:         0.8.4
Python version:      3.11.7
Git commit:          11cb641c
Built:               Fri, Mar 29, 2024 11:01 AM
OS/Arch:             darwin/x86_64
Profile:             default
Server type:         cloud

Additional context

This happens after around 1 hour of running, typically.

ckeogh-loam commented 3 months ago

I've seen this issue crop up multiple times on our infrastructure too.