Open prassanna-ravishankar opened 1 year ago
Hi @prassanna-ravishankar ! We save models based on their file name, this is why pytorch.bin
gets overwritten.
We do have a way to intercept model uploads tho. See this example and adapt it to your use-case:
from clearml import Task, OutputModel
from clearml.binding.frameworks import WeightsFileHandler
import os
import torch
from pathlib import Path
def filter_callback(
callback_type: WeightsFileHandler.CallbackType,
model_info: WeightsFileHandler.ModelInfo,
):
if callback_type != WeightsFileHandler.CallbackType.save:
return model_info
p = Path(model_info.local_model_path)
p = p.rename(Path(p.parent, "{}_{}{}".format(p.stem, "1" if "1" in model_info.local_model_path else "2", p.suffix)))
out_model = OutputModel(task=Task.current_task())
out_model.update_weights(weights_filename=p.as_posix())
return None
if __name__ == "__main__":
Task.init(project_name="example", task_name="filter out")
WeightsFileHandler.add_pre_callback(filter_callback)
filter_out_model = torch.nn.Module()
dont_filter_out_model = torch.nn.Module()
torch.save(filter_out_model, "1_dir/some_model_name_c.pt")
torch.save(dont_filter_out_model, "2_dir/some_model_name_c.pt")
Describe the bug
I cannot use clearml with huggingface accelerate for uploading checkpoints. Accelerate handles the folder structure, so checkpoints are usually like
<checkpoint_folder>/iteration_4000/pytorch.bin
(example). I initialise my clearml task withTask.init(..., output_uri="s3://my-awesome-bucket/clearml")
Clearml creates some nested structures with the s3 key, however keeps overwriting the pytorch.bin for every checkpoint Any way we specify to keep the folder structure likeiteration_4000/pytorch.bin
?Related Discussion
https://clearml.slack.com/archives/CTK20V944/p1694531453622189