allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
381 stars 132 forks source link

Artifacts not being deleted when deleting task #112

Closed mmiller-max closed 1 year ago

mmiller-max commented 2 years ago

When I delete a task using the Web GUI, I see the following message:

Screenshot 2022-02-24 at 09 51 15

When I check the fileserver in the VM on which it's running, I can see that the artifacts are still there, for example at the location /opt/clearml/data/fileserver/project/task/artifacts/...

So there seems to be two issues, one with artifacts not being deleted (which I wasn't aware happened with previous version of server) and one with the error message not showing what hasn't been deleted.

Server version is 1.2.0 running on GCP. Cheers!

jkhenning commented 2 years ago

Hi @mmiller-max ,

Any more info? Is this reproducible? If so, can you share a small code snippet that creates a task which exhibits this behavior when deleted using the UI?

mmiller-max commented 2 years ago

For me just this creates the error:

from clearml import Task
task = Task.init()
task.upload_artifact("artifact", {"1":1})

Then Ctrl+C and delete in UI.

And yep it's reproducible. Could it be a file permissions thing perhaps?

jkhenning commented 2 years ago

Well, seems like an obvious but - we'll take a look, I'll update!

mmiller-max commented 2 years ago

Cheers @jkhenning !

mmiller-max commented 2 years ago

Trying to do a bit of debugging on this but can't see anything in the fileserver logs. Do I need to set something in logging.conf in either the file server or the api server?

jkhenning commented 2 years ago

I think you should see console logs. The issue might be in the WebApp...

mmiller-max commented 2 years ago

I think this is the corresponding web app log but can't see any errors with it:

35.191.10.5 - - [29/Mar/2022:13:52:30 +0000] "POST /api/v2.16/tasks.delete_many HTTP/1.1" 200 395 "https://app.{domain}/projects/c28adf12db964d169a645f5351a669de/experiments?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=-last_update&filter=&archive=true" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15" "195.224.76.82,34.149.227.91"
mmiller-max commented 2 years ago

Another bit of information, I only get undefined in the error message if using the fileserver subdomain for the files (e.g. https://files.domain.com. If I use a different URL (e.g. GCP bucket) it displays the URL (but still fails to delete)

mmiller-max commented 2 years ago

This seems to be cropping up often in slack, e.g. here, here and here. It seems to be agnostic of whether the files are stored on the same VM as the server or elsewhere, and seems to be an issue in the Web App as there are no logs in the fileserver.

mmiller-max commented 2 years ago

One further comment, I'm pretty sure I never saw this error with ClearML Server v1.1.1

mmiller-max commented 1 year ago

After updating to the latest server (1.9.1) I'm no longer seeing these errors so going to close this 👏

ainoam commented 1 year ago

Appreciate the update @mmiller-max