Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.25k stars 774 forks source link

Logs don't show up on the console. gs_tail raises NotFound error #1758

Open erdememekligil opened 8 months ago

erdememekligil commented 8 months ago

Logs do not show up for the steps that ran on GCP.

I've traced it to gs_tail.py. It uses the same blob object (self._blob_client) to get the latest logs. However, after this object is initialised, its generation field is set to the latest generation value of the file. The generation value becomes invalid and the code raises a NotFound error when the file is updated.

https://github.com/Netflix/metaflow/blob/cbf9b7f198bf2f1e255e0dda5c47324b63cc8bd3/metaflow/plugins/gcp/gs_tail.py#L49

404 GET https://storage.googleapis.com/download/storage/v1/b/testbucket/o/tf-full-stack-sysroot%2FSimpleTestFlow%2F18%2Frun_on_cpu_remote%2F274258%2F0.task_stdout.log?alt=media&generation=1709809091417399: No such object: testbucket/tf-full-stack-sysroot/SimpleTestFlow/18/run_on_cpu_remote/274258/0.task_stdout.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

It can be tested like below:

>>> b = storage.Client().bucket("testbucket").blob("hello.json")
>>> b.download_as_bytes()
b'{\n    "text": "Hello from the file in the bucket"\n}'

re-uploaded the same file again here / overwritten

>>> b.download_as_bytes()
Traceback (most recent call last):
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/cloud/storage/client.py", line 1151, in download_blob_to_file
    blob_or_uri._do_download(
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/cloud/storage/blob.py", line 989, in _do_download
    response = download.consume(transport, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/resumable_media/requests/download.py", line 237, in consume
    return _request_helpers.wait_and_retry(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/resumable_media/requests/_request_helpers.py", line 155, in wait_and_retry
    response = func()
               ^^^^^^
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/resumable_media/requests/download.py", line 219, in retriable_request
    self._process_response(result)
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/resumable_media/_download.py", line 188, in _process_response
    _helpers.require_status_code(
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/resumable_media/_helpers.py", line 108, in require_status_code
    raise common.InvalidResponse(
google.resumable_media.common.InvalidResponse: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/cloud/storage/blob.py", line 1401, in download_as_bytes
    client.download_blob_to_file(
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/cloud/storage/client.py", line 1164, in download_blob_to_file
    _raise_from_invalid_response(exc)
  File "/Users/erdememekligil/miniconda3/envs/gcp-metaflow/lib/python3.11/site-packages/google/cloud/storage/blob.py", line 4457, in _raise_from_invalid_response
    raise exceptions.from_http_status(response.status_code, message, response=response)
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/testbucket/o/hello.json?alt=media&generation=1709894045845102: No such object: testbucket/hello.json: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

I'm not sure why this error hadn't happened until now, I've also tested it with older versions of Google components:

setup

metaflow                 2.11.2
google-api-core          2.17.1
google-auth              2.28.1
google-cloud-core        2.4.1
google-cloud-storage     2.15.0
google-crc32c            1.5.0
google-resumable-media   2.7.0
googleapis-common-protos 1.62.0

Older setup

metaflow                 2.11.2
google-api-core          2.10.2
google-auth              2.12.0
google-cloud-core        2.4.1
google-cloud-storage     2.5.0
google-crc32c            1.5.0
google-resumable-media   2.7.0
googleapis-common-protos 1.62.0
savingoyal commented 3 months ago

@erdememekligil apologies for the delay here. are you still running into issue?