allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

Remove `subdir` path from GCP bucket url #1117

Closed materight closed 11 months ago

materight commented 1 year ago

Patch Description

When running a cloned task remotely and using GS as storage, calling task.logger.set_default_upload_destination usually fails with an error like:

GET https://storage.googleapis.com/storage/v1/b/clearml/project_a/.pipelines/Main/task_1.d7691550c955409db4622fd74496f51c/artifacts/result_TRAIN_config/result_TRAIN_config.pkl/iam/testPermissions?permissions=storage.objects.get&permissions=storage.objects.update&prettyPrint=false: Not Found

This is because the permissions are being tested on a single file instead of a GS bucket. Removing the subdir from the bucket url should fix it.

Testing Instructions

jkhenning commented 1 year ago

Hi @materight,

If the bucket has a specific subdir defined in the configuration, wouldn't it make sense to check these permissions on the subdir? Or are you saying this is not supported in Google Storage?

materight commented 1 year ago

Hi @jkhenning. The second one, this method works only on buckets: https://cloud.google.com/storage/docs/json_api/v1/buckets/testIamPermissions

jkhenning commented 1 year ago

@materight got it 👍

jkhenning commented 1 year ago

In that specific case, how did you set the default output destination? I'm not sure I get how marking out the subdir will prevent that, since the line that basically causes the file to be mentioned in the request is the one setting the blob to the test_obj, and when setting a default output_uri you do not specify a file - do how did the file end up there?

materight commented 1 year ago

So this is something I couldn't figure out, I just call set_output_destination('gs://my-bucket'). It only happens when I run a task remotely after cloning it, so it's hard to debug.

But the main problem here is that the subdir is passed as bucket_url, so the resulting bucket object is invalid anyway. So blob.exists() is false because the bucket is invalid, then test_obj == bucket, and then the request fails because the bucket has also the subdir in the uri.

jkhenning commented 1 year ago

That makes sense, but it wouldn't yield the same error 😕

materight commented 1 year ago

Sorry, what do you mean?

jkhenning commented 1 year ago

If blob.exists() is false because the bucket is invalid, and test_obj == bucket, than the resulting object does not reference the file, and the error you mentioned can't be raised, so I'm worried about the flow that generates this error (even if the fix you added is indeed required)

materight commented 1 year ago

Ah got it. If you can give me some hint on how/where this config.subdir is set I can take a deeper look. I'm able to reproduce it, just not locally.

jkhenning commented 1 year ago

It can basically be set in the configuration file's google.storage credentials section (see here) when configuring the credentials for the bucket. It's strange you have it configured when running remotely but not when running locally - can you share how the agent running this remotely is configured (especially in these storage related sections)?

materight commented 1 year ago

I have this in the config file:

sdk {
    google.storage {
        project: "myproject"
        credentials_json: ${GOOGLE_APPLICATION_CREDENTIALS}
        pool_connections: 512
        pool_maxsize: 1024
    }
    development {
        default_output_uri: "gs://clearml-bucket"
    }
}

Also in the agent's logs I only see this:

sdk.google.storage.project = myproject
sdk.google.storage.credentials_json = /service_account_key.json
sdk.google.storage.pool_connections = 512
sdk.google.storage.pool_maxsize = 1024
sdk.development.default_output_uri = gs://clearml-bucket

Btw I'm using pipelines, the result_TRAIN_config.pkl artifact it tries to access is a parameter of the pipeline component.

jkhenning commented 1 year ago

And can you reproduce that error message you shared?

materight commented 1 year ago

Yes, if I clone the failed task and re-enqueue it I get the same error. But if I run a new task it doesn't happen all the time.

jkhenning commented 1 year ago

The config file you attached is from your own workstation? If so, what's the config file used by the agent?

materight commented 1 year ago

It's exactly the same for both

jkhenning commented 1 year ago

And the task log in the remote run where it happens? Can you share?

materight commented 1 year ago

Sure, here: task_ff8c8917072641bca89f27ddd488d74c.log

materight commented 11 months ago

Hi @jkhenning any update on this? Would it be possible to merge it and dig in more if the issue reappear?