Model weights not automatically uploaded even with output_uri set

allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

https://clear.ml/docs

Apache License 2.0

5.69k stars 655 forks source link

Model weights not automatically uploaded even with output_uri set #1012

Open julianschoep opened 1 year ago

julianschoep commented 1 year ago

I've set an output_uri to S3, and am able to upload custom artifacts without problems. My models are however not uploaded. The documentation states that it will "automatically" upload models if an output_uri is specified and the frameworks' model storing is used (e.g. I'm using pytorch, and save the model with torch.save and with file-extension .pt). Yet there are no "InputModels" or "OutputModels" present in the artifacts tab, only my own custom artifacts.

Are there any other things required for the model saving to be automatically picked up? Is there a file-naming convention, or should the .pt files be saved in a specific directory to be picked up? Or does it simply upload everything that is saved with torch.save to OutputModels?

Environment

Server type: self hosted
ClearML SDK Version: 1.8.3
Python Version 3.8
OS (Windows \ Linux \ Macos) Linux

jkhenning commented 1 year ago

Hi @julianschoep , can you provide an example of how you're setting it in code? (or in configuration)

julianschoep commented 1 year ago

Sure:

writer = SummaryWriter(log_dir=experiment_directory)
task = Task.init(
            project_name="project_name",
            task_name=task_name,
            output_uri="s3://models",
            continue_last_task=True,
            tags=tags,
        )
decoder = Decoder(latent_size=latent_size, **specs["NetworkSpecs"])
decoder = ModelWrapper(decoder)
decoder = decoder.to(device)
.. train loop ...
if epoch % log_frequency == 0:
  task.upload_artifact("sample_0", artifact_object=artifact_path)
  state_dict = {
      "decoder": decoder.state_dict(),
      "epoch": epoch,
      "optimizer": optimizer_all.state_dict(),
      "latents": lat_vecs.state_dict(),
  }
  state_path = experiment_directory / f"model_e{epoch}.pt"
  torch.save(state_dict, state_path)

With this snippet I do get the sample_0 artifact under Artifacts / Other, but no OutputModel as would be expected.

julianschoep commented 1 year ago

Perhaps a clue, I'm now trying it manually by defining an OutputModel object and using update_weights, and I got an error as my weight_path was a pathlib.Path object instead of a str (has no attribute .lower()). Could that be the reason? I have no idea how the auto-model uploading works under the hood :p does it search the disk or does it wrap the torch.save somehow?

jkhenning commented 1 year ago

If you call upload_artifact directly, it will use what you provided (in which case it does need to be an str, I think). There's also automatic wrapping of torch.save.

What are you seeing now?

MightyGoldenJA commented 1 year ago

Had the same issue lately, I used setup_aws_upload() explicitly after my task.init() as a workaround:

task.setup_aws_upload(
        bucket='models',
        region='us-east-1'
    )

julianschoep commented 11 months ago

Ultimately I worked around it via

task = clearml.Task.init(output_uri="s3://bucket")
output_model = clearml.OutputModel(task=task)
output_model.update_weights(str(state_path))

There's also automatic wrapping of torch.save.

Didn't there used to be? I remember not having to define this and the model files were automatically backed up.