fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
326 stars 141 forks source link

gcsfs: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object #546

Closed sl2902 closed 6 days ago

sl2902 commented 1 year ago

I am trying to run a Prefect deployment using Docker containers. I have created a Docker container Prefect block and GCP credentials block using the service account credentials file, which I load inside of the Prefect flow. However, when I read a parquet file (I tried with both Pandas and Pyarrow), I get the following error

02:05:07.717 | INFO    | prefect.infrastructure.docker-container - Pulling image 'sl02/xetra:latest'...
02:05:10.920 | INFO    | prefect.infrastructure.docker-container - Creating Docker container 'tough-beluga'...
02:05:10.964 | INFO    | prefect.infrastructure.docker-container - Docker container 'tough-beluga' has status 'created'
02:05:11.248 | INFO    | prefect.infrastructure.docker-container - Docker container 'tough-beluga' has status 'running'
02:05:11.579 | INFO    | prefect.agent - Completed submission of flow run '8b3319e4-047e-4976-8d0b-3848e6cf3113'
/usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
20:35:17.862 | INFO    | Flow run 'tough-beluga' - Downloading flow code from storage at 'scripts/'
20:35:21.638 | INFO    | Flow run 'tough-beluga' - Created task run 'dataset_load_check-0' for task 'dataset_load_check'
20:35:21.639 | INFO    | Flow run 'tough-beluga' - Executing 'dataset_load_check-0' immediately...
20:35:23.259 | INFO    | Task run 'dataset_load_check-0' - Finished in state Completed()
20:35:24.339 | INFO    | Flow run 'tough-beluga' - Created subflow run 'sage-hedgehog' for flow 'Pipeline to read files from GCS and load to BigQuery'
20:35:25.905 | INFO    | Flow run 'sage-hedgehog' - Created task run 'list files from gcs-0' for task 'list files from gcs'
20:35:25.906 | INFO    | Flow run 'sage-hedgehog' - Executing 'list files from gcs-0' immediately...
20:35:27.583 | INFO    | Task run 'list files from gcs-0' - Finished in state Completed()
20:35:27.585 | INFO    | Flow run 'sage-hedgehog' - file: data/xetra/2022-04-05/2022-04-05_BINS_XETR07.parquet
20:35:27.976 | INFO    | Flow run 'sage-hedgehog' - Created task run 'gcs_to_bq-0' for task 'gcs_to_bq'
20:35:27.978 | INFO    | Flow run 'sage-hedgehog' - Executing 'gcs_to_bq-0' immediately...
20:35:31.971 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
20:35:31.974 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 111] Connection refused
20:35:31.975 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: [Errno 111] Connection refused
20:35:31.976 | WARNING | google.auth._default - Authentication failed using Compute Engine authentication due to unavailable metadata server.
20:35:31.983 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f84924e3b20>: Failed to establish a new connection: [Errno -2] Name or service not known'))
20:35:31.990 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f84924e3fa0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
20:35:31.995 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f84947c54f0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
20:35:32.000 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f84947c5a00>: Failed to establish a new connection: [Errno -2] Name or service not known'))
20:35:32.005 | WARNING | google.auth.compute_engine._metadata - Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f84947c5f10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
20:35:32.381 | INFO    | Task run 'gcs_to_bq-0' - Traceback (most recent call last):
20:35:32.382 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/retry.py", line 114, in retry_request
    return await func(*args, **kwargs)
20:35:32.385 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 411, in _request
    validate_response(status, contents, path, args)
20:35:32.386 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/retry.py", line 101, in validate_response
    raise HttpError(error)
20:35:32.387 | INFO    | Task run 'gcs_to_bq-0' - gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist)., 401
20:35:32.380 | ERROR   | gcsfs - _request non-retriable exception: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist)., 401
20:35:32.391 | INFO    | Task run 'gcs_to_bq-0' - Traceback (most recent call last):
20:35:32.392 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 1551, in orchestrate_task_run
    result = await call.aresult()
20:35:32.393 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 181, in aresult
    return await asyncio.wrap_future(self.future)
20:35:32.394 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 194, in _run_sync
    result = self.fn(*self.args, **self.kwargs)
20:35:32.395 | INFO    | Task run 'gcs_to_bq-0' -   File "scripts/gcs_to_bq.py", line 89, in gcs_to_bq
    with gc_fs.open(f"gs://{bucket_name}/{file}") as f:
20:35:32.396 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1135, in open
    f = self._open(
20:35:32.397 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 1312, in _open
    return GCSFile(
20:35:32.398 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 1471, in __init__
    super().__init__(
20:35:32.399 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1491, in __init__
    self.size = self.details["size"]
20:35:32.399 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 1507, in details
    self._details = self.fs.info(self.path, generation=self.generation)
20:35:32.400 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 114, in wrapper
    return sync(self.loop, func, *args, **kwargs)
20:35:32.401 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 99, in sync
    raise return_result
20:35:32.402 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 54, in _runner
    result[0] = await coro
20:35:32.403 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 791, in _info
    exact = await self._get_object(path)
20:35:32.404 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 491, in _get_object
    res = await self._call(
20:35:32.405 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 418, in _call
    status, headers, info, contents = await self._request(
20:35:32.406 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/decorator.py", line 221, in fun
    return await caller(func, *(extras + args), **kw)
20:35:32.407 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/retry.py", line 149, in retry_request
    raise e
20:35:32.408 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/retry.py", line 114, in retry_request
    return await func(*args, **kwargs)
20:35:32.409 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/core.py", line 411, in _request
    validate_response(status, contents, path, args)
20:35:32.410 | INFO    | Task run 'gcs_to_bq-0' -   File "/usr/local/lib/python3.8/site-packages/gcsfs/retry.py", line 101, in validate_response
    raise HttpError(error)
20:35:32.411 | INFO    | Task run 'gcs_to_bq-0' - gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist)., 401

I don't see this issue when the script runs locally.

gcsfs version = 2023.1.0

martindurant commented 1 year ago

GCP credentials block using the service account credentials file

What does this mean?

You can specify token= when instantiating GCSFileSystem, which can point to any gcloud JSON file, it sounds like this might be what you need.

sl2902 commented 1 year ago

This is an example of what it looks like creating a Prefect block programtically. Prefect is a workflow orchestration tool

from prefect_gcp import GcpCredentials

# replace this PLACEHOLDER dict with your own service account info
service_account_info = {
  "type": "service_account",
  "project_id": "PROJECT_ID",
  "private_key_id": "KEY_ID",
  "private_key": "-----BEGIN PRIVATE KEY-----\nPRIVATE_KEY\n-----END PRIVATE KEY-----\n",
  "client_email": "SERVICE_ACCOUNT_EMAIL",
  "client_id": "CLIENT_ID",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/SERVICE_ACCOUNT_EMAIL"
}

GcpCredentials(
    service_account_info=service_account_info
).save("BLOCK-NAME-PLACEHOLDER")

After persisting this on Prefect Cloud, I can reference the keys inside a flow.

You can specify token= when instantiating GCSFileSystem, which can point to any gcloud JSON file, it sounds like this might be what you need.

I tried using project = credentials where credentials are loaded from Prefect GCP Credentials block Hope that makes sense

martindurant commented 1 year ago

It is not project but token that you need to set to the saved credential file's location. You should also provide project= ideally (this can matter for some operations).

danielgafni commented 6 days ago

I can confirm this currently works on 2024.5.0 with GOOGLE_APPLICATION_CREDENTIALS pointing to a service_account.json file.

sl2902 commented 6 days ago

@danielgafni Thanks for the update! I will check it out