allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

ClearML doesn't work well with huggingface accelerate checkpoints #1115

Open prassanna-ravishankar opened 1 year ago

prassanna-ravishankar commented 1 year ago

Describe the bug

I cannot use clearml with huggingface accelerate for uploading checkpoints. Accelerate handles the folder structure, so checkpoints are usually like <checkpoint_folder>/iteration_4000/pytorch.bin (example). I initialise my clearml task with Task.init(..., output_uri="s3://my-awesome-bucket/clearml") Clearml creates some nested structures with the s3 key, however keeps overwriting the pytorch.bin for every checkpoint Any way we specify to keep the folder structure like iteration_4000/pytorch.bin ?

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1694531453622189

eugen-ajechiloae-clearml commented 11 months ago

Hi @prassanna-ravishankar ! We save models based on their file name, this is why pytorch.bin gets overwritten. We do have a way to intercept model uploads tho. See this example and adapt it to your use-case:

from clearml import Task, OutputModel
from clearml.binding.frameworks import WeightsFileHandler
import os
import torch
from pathlib import Path

def filter_callback(
    callback_type: WeightsFileHandler.CallbackType,
    model_info: WeightsFileHandler.ModelInfo,
):
    if callback_type != WeightsFileHandler.CallbackType.save:
        return model_info
    p = Path(model_info.local_model_path)
    p = p.rename(Path(p.parent, "{}_{}{}".format(p.stem, "1" if "1" in model_info.local_model_path else "2", p.suffix)))
    out_model = OutputModel(task=Task.current_task())
    out_model.update_weights(weights_filename=p.as_posix())
    return None

if __name__ == "__main__":
    Task.init(project_name="example", task_name="filter out")
    WeightsFileHandler.add_pre_callback(filter_callback)
    filter_out_model = torch.nn.Module()
    dont_filter_out_model = torch.nn.Module()
    torch.save(filter_out_model, "1_dir/some_model_name_c.pt")
    torch.save(dont_filter_out_model, "2_dir/some_model_name_c.pt")