fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
321 stars 140 forks source link

Release 0.7.0 causes GCP Cloud Function to fail to deploy #278

Open nmetts opened 3 years ago

nmetts commented 3 years ago

What happened: We are using Cloud Functions in Google Cloud Platform with a Python 3.7 runtime, and we use gcsfs to access files stored in Google Storage buckets. After adding some new features to an existing, working function, an attempt to redeploy the function failed with the following error message in the Logs Viewer: Error: function terminated. Recommended action: inspect logs for termination reason. Function cannot be initialized.

The Cloud Function was then unable to be tested since deployment failed.

We simplified our code until we were able to determine that the only issue was gcsfs in the requirements.txt file. After checking the release history and seeing that 0.7.0 was released on August 21, 2020, which coincided with the deploy failure of our Cloud Function, we decided to specify version 0.6.2 on the requirements.txt file. With this change made, the Cloud Function successfully deployed.

What you expected to happen: We expected our Cloud Function to continue working as it had been before.

Minimal Complete Verifiable Example:

import base64
import datetime
import json
import os

def hello_gcs(event, context):
    """Triggered by a change to a Cloud Storage bucket.
    Args:
         event (dict): Event payload.
         context (google.cloud.functions.Context): Metadata for the event.
    """
    file = event
    print(f"Processing file: {file['name']}.")

The issue appears to be with attempting to install gcsfs 0.7.0 on pip, so no code is required to trigger this bug.

Anything else we need to know?:

Environment: Google Cloud Platform Cloud Functions

martindurant commented 3 years ago

Sorry, there isn't enough information here for me to have an opinion - what call were you attempting, and what traceback did you get?

nmetts commented 3 years ago

We weren't calling any functions from gcsfs. Just having gcsfs in the requirements.txt file caused the Cloud Function to fail to deploy. As soon as we specified gcsfs==0.6.2 in requirements.txt the Cloud Function was able to deploy. I added the MCVE to the issue. As you can see, the code snippet, which was simplified as much as possible, did not even import gcsfs, let alone call any functions from the package. Something in the Cloud Functions environment does not agree with gcsfs version 0.7.0.

martindurant commented 3 years ago

You must have some sort of logging about what went wrong during environment preparation. For example, the following works fine for a new blank environment (I create with conda, but the install is with pip, just like yours):

$ conda create -n testme python=3.7.6 pip
$ conda activate testme
$ pip install gcsfs
$ python
>>> import gcsfs
>>> gcsfs.__version__
'0.7.0'
nmetts commented 3 years ago

Deploying from the Cloud Functions GUI results in the following: Error: function terminated. Recommended action: inspect logs for termination reason. Function cannot be initialized.

Deploying from the Terminal in my local machine results in the following:

$gcloud functions deploy test-fn --verbosity debug DEBUG: Running [gcloud.functions.deploy] with arguments: [--verbosity: “debug”, NAME: “stevea-test-fn”] Deploying function (may take a while - up to 2 minutes)...failed.
DEBUG: (gcloud.functions.deploy) OperationError: code=13, message=Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during a test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient. Traceback (most recent call last): File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py”, line 848, in Execute resources = calliope_command.Run(cli=self, args=args) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py”, line 770, in Run resources = command_instance.Run(args) File “/Users/test_user/google-cloud-sdk/lib/surface/functions/deploy.py”, line 180, in Run return _Run(args, track=self.ReleaseTrack()) File “/Users/test_user/google-cloud-sdk/lib/surface/functions/deploy.py”, line 150, in _Run return api_util.PatchFunction(function, updated_fields) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/util.py”, line 308, in CatchHTTPErrorRaiseHTTPExceptionFn return func(*args, *kwargs) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/util.py”, line 364, in PatchFunction operations.Wait(op, messages, client, _DEPLOY_WAIT_NOTICE) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py”, line 126, in Wait _WaitForOperation(client, request, notice) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py”, line 101, in _WaitForOperation sleep_ms=SLEEP_MS) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py”, line 219, in RetryOnResult result = func(args, **kwargs) File “/Users/test_user/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py”, line 65, in _GetOperationStatus raise exceptions.FunctionsError(OperationErrorToString(op.error)) FunctionsError: OperationError: code=13, message=Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during a test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient. ERROR: (gcloud.functions.deploy) OperationError: code=13, message=Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during a test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient. (edited)`

martindurant commented 3 years ago

This is the traceback from your CLI. Can you get the traceback/log from the actual deployment?

It may be obvious by now, but I have no idea how "Cloud Functions" work.

nmetts commented 3 years ago

Essentially Google Cloud Functions creates a container for code and deploys it on Google Cloud Platform to be invoked by some event (HTTP Call, message to Pub/Sub topic, etc.). You can read more about it here: https://cloud.google.com/functions/docs/writing The traceback above is from attempting to deploy a Cloud Function not locally, but to GCP. The difference is in the example above I am simply executing the deploy command from my CLI. Logging for Google's Cloud Functions is abysmal and generally appears to exclude tracebacks or really any useful information at all, which is why I used the CLI example because it actually gives a debug option. This is all that is shown in the logs for the Cloud Function deployment on the GCP GUI:

Error: function terminated. Recommended action: inspect logs for termination reason. Function cannot be initialized.

martindurant commented 3 years ago

Hm, I'm a bit stuck, because the slightly more detailed logs from the CLI also says "Examine the logs to determine the cause." ! It suggests that there is more detail somewhere.

nmetts commented 3 years ago

I'll see what I can find. If there are additional details somewhere it it not obvious where to find them. I'll open an issue with GCP as well.

yan-hic commented 3 years ago

@martindurant just as a fyi, we faced an issue with 0.7.0 as well, not in GCF but in local (and k8s) container. Symptom: a Airflow DAG using gcsfs timed out. A colleague mentioned a hash problem in gcsfs lib but not sure, will troubleshoot later. Till then we reverted back to 0.6.2. Since we use gcsfs in GCF too, we will version the dependency to be safe.

Do you have more details on the 0.7.0 changes (or in fsspec dependency) ? aiohttp cannot be the only change.

martindurant commented 3 years ago

0.7.0 was a big rewrite to allow async - and many code paths changed to make this happen. In this case, the changes are too many to list, really. I am not sure what "hash" problem there could be, though, and any help you can give in pinpointing it would be appreciated.

yan-hic commented 3 years ago

Here what happens: 0.7.x has a dependency on aiohttp 3.7.3 which requires chardet <4.0

Google has no documentation on the libs included with the Python 3.8 runtime image but it must have chardet 4.0.0 as that's the error I get when deploying.

So solution is either to set in requirements.txt a lower version e.g. chardet==3.0.4 or, what I have chosen, skip the use of chardet as per https://docs.aiohttp.org/en/stable/ by adding in requirements, aiohttp[speedups]

@nmetts after that you should be able to deploy @martindurant may want to add that in gcfs dependencies instead

yan-hic commented 3 years ago

Update: although function can be deployed, 0.7.x still "hangs" on any method of GCSFileSystem() It seems the issue exists when running on any GCP resource as we experienced the same on GKE.

To reproduce, create a gcs background function through console and use the following code:

import gcsfs
gs_filesys = gcsfs.GCSFileSystem()

def hello_gcs(event, context):
    file = event
    print(f"Processing file: {file['name']}.")
    print( event, context)

    file_name = event['name']
    input_bucket = event['bucket']
    file_path = f'{input_bucket}/{file_name}'

    if gs_filesys.exists(file_path):
        print('Processing ' + file_path)

The exists() never returns and function ends up timing out. Downgrading to 0.6.2 works

martindurant commented 3 years ago

Could you please try with gcsfs.GCSFileSystem(token="cloud") ?

create a gcs background function

What does this mean?

Can you please configure the logger "gcsfs" or set the environment variable GCSFS_DEBUG , and you should (hopefully) get more information about what gcsfs is trying to do.

yan-hic commented 3 years ago

Yep - I had tried all that:

I wish I could debug exists() but locally using Win10, Ubuntu on WSL or a 3.8 docker image, all work fine. So something on GCP is holding up. Unfortunately there is no timeout error so don't know what gcsfs is trying. The only timeout is from GCF when function times out and the instance gets killed. Are there any other flags to produce more logs ?

As to how to deploy a GCF function: https://cloud.google.com/functions/docs/deploying/console

martindurant commented 3 years ago

Do you explicitly supply a google credentials file for the function?

yan-hic commented 3 years ago

Never. gcsfs.GCSFileSystem() just works every time on < 0.7.0, either locally or on GCP.

It works locally because env GOOGLE_APPLICATION_CREDENTIALS points to the json file. Not sure on GCP.

Note that 0.7.x on GCP does not error out on GCSFileSystem() instantiation - it never does but any subsequent method runs forever.

rm3195 commented 3 years ago

We have the same issue in GCP Cloud Composer DAG - call to gcsfs to list GCS bucket just hangs forever. Although interesting thing here is that issue happens in Composer version >composer-1.13.0-airflow-1.10.12. Same code on composer-1.13.0-airflow-1.10.12 runs without issues, so I guess that has something to do with Google (although they list no visible changes in Composer changelog). Here's an issue I've reported in GCP bugtracker, will update it soon with the latest findings.

yan-hic commented 3 years ago

@martindurant would it make sense to fork the issue / create a new one as deploying on GCF is different from running on GCP (any component).

martindurant commented 3 years ago

Sure, feel free

yan-hic commented 3 years ago

Puzzling:

Cannot be deployed w/o versioning chardet or e.g. installing aiohttp[speedups]:

import gcsfs

gcs = gcsfs.GCSFileSystem()

def hello_gcs(event, context):

    print(gcsfs._version.get_versions())

    file_name = event['name']
    input_bucket = event['bucket']
    file_path = f'{input_bucket}/{file_name}'

    if gcs.exists(file_path):
        print('Confirmed exists: ' + file_path)

Error: Build failed: found incompatible dependencies: "aiohttp 3.7.3 has requirement chardet<4.0,>=2.0, but you have chardet 4.0.0."; Error ID: 5bbe2179

Can be deployed as-is - just installing gcsfs:

import gcsfs

gcs = gcsfs.GCSFileSystem()

def hello_gcs(event, context):
    gcs.maybe_refresh()
    gcs._connect_cloud()

    print(gcsfs._version.get_versions())

    file_name = event['name']
    input_bucket = event['bucket']
    file_path = f'{input_bucket}/{file_name}'

    if gcs.exists(file_path):
        print('Confirmed exists: ' + file_path)
yan-hic commented 3 years ago

Well, ignore previous comment. I managed to deploy as-is by just retrying. Issue might have been intermittent on GCP-side. To be safe though, I would add cchardet to requirements.txt

martindurant commented 3 years ago

Would you mind making that a PR? So chardet should be v4 or any version, and does its order versus aiohttp matter?

yan-hic commented 3 years ago

I have been testing a dozen variations of a function and can confirm that the issue is with GCP, when first deployed. More specifically,

import gcsfs 

gcs = gcsfs.GCSFileSystem()

def hello_world(request):
    return 'test'

As such, I have logged a support ticket with GCP as there is nothing wrong with gcsfs. It looks like GCF installs chardet 4.0 after requirements.txt gets first processed

Note: OP may have experienced another issue as when first posted aiohttp 3.7.3 was not out yet - if relevant in any way. The point is that today I cannot reproduce a deployment issue when selecting the 3.7 runtime

martindurant commented 3 years ago

Thanks @yiga2 . Is there a public link to the issue? If not, please report back when you hear anything.

Aside from the build issue, did you try instantiating GCSFileSystem within your function?

yan-hic commented 3 years ago

Private issue to get more traction - will report back indeed and request G to create a public issue if cannot be remediated quickly.

Responding in the other thread for the second question as unrelated to GCF deploying.