allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

[Feature Request] Get an error message when an agent tries to write to S3 but credentials are missing or invalid #1103

Open d13g0 opened 1 year ago

d13g0 commented 1 year ago

Proposal Summary

Get an error message when the ClearML server or the ClearML agent try to write to S3 but the credentials are missing or invalid.

Motivation

I logged a local experiment in my machine and I was able to see the pytorch model being logged and stored properly in S3 (my local clearml.conf has the AWS S3 credentials.

I then, cloned the experiment and try to run it. The experiment ran and completed. However, when I tried to download the respective models I see a malformed URL that looks like this on the web interface:

Screenshot 2023-08-25 at 11 42 43 PM

When I checked the agent that ran this task I realized that there were no AWS credentials in its clearml.conf file.

It would be extremely useful to get an error message in the task log (console).

Similarly, it would be great to have a log as well for any failed S3 operation in the ClearML server.

ainoam commented 1 year ago

@d13g0 Any errors should appear in the executed task's console log.

Note that the server is not performing S3 operations in this context.

d13g0 commented 1 year ago

Hi @ainoam thanks for your reply.

I believe there is no warning/error message when an agent tries to save a model to S3 and fails. I checked my logs and didn't see anything. So from the conversation, I think I have two requests:

  1. Get a warning/error message in the console log for storage issues

  2. Have a mechanism to inform the user of issues in tasks that are finishing as Completed

For instance, I think it would be a great idea to have a state Completed with Warnings (or something similar), as this would really help us surface and track problems such as the storage issue that I described. I imagine that the list of warnings can be a separate section under the [Execution] Tab, or, warnings could be highlighted in the console.

ainoam commented 1 year ago

Thanks for suggesting @d13g0.

We'll try to verify why storage access errors don't show up in the execution log, as they definitely should.

For having some kind of additional attention mechanism - that definitely makes a lot of sense, though I imagine what actually merits attention will be very specific to every use-case?

In the meantime, a quick hack might be to use the existing metrics reporting to flag if your code goes through such an "interesting" event, so you could easily see (and filter on) this in the experiment table - WDYT?

d13g0 commented 1 year ago

Hi @ainoam I ran a different experiment and I did see the error messages in the agent log. Thanks for pointing it out.

Regarding what would merit attention. I think that is a good question. I believe it could be very generic just indicating that there were Warnings or Errors in the console since this is not obvious for tasks marked as completed.

I will pass along the hack to the AI team. Thanks again.

ainoam commented 1 year ago

@d13g0 I'm not sure calling attention to any warning would prove useful, as most users rarely adopt a zero-warning policy... (where most of the time such messages come from negligible cross dependency issues) - Due to the specific nature of this, it is best left to the code to decide whether to validate its results and set the tasks status accordingly.

As it seems storage access errors do appear in the logs - can we close this issue?

AlexandruBurlacu commented 9 months ago

Hey @d13g0, it's been a while, but we recently tried reproducing this error with our newest SDK release and failed to observe the behavior you're describing.

Our setup for the reproduction was:

Can you please repeat the experiment and let us know if this issue is still happening with the newest SDK version (1.13.2).

Either way, for the upcoming release we will also introduce the possibility to perform the model upload in the main thread, which will throw an exception if for some reason the upload failed, and will give a proper traceback to aid debugging.