Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

put the monitor metric into default filename for ModelCheckpoint #20397

Open VDFaller opened 3 weeks ago

VDFaller commented 3 weeks ago

Description & Motivation

Tiny annoyance, but wouldn't it make sense to put the monitor value in the default value of the filename so it's not just epoch-X-step-y by default?

Pitch

Couldn't something like this work here?

    def _format_checkpoint_name(
        self,
        filename: Optional[str],
        metrics: Dict[str, Tensor],
        prefix: str = "",
        auto_insert_metric_name: bool = True,
    ) -> str:
        if not filename:
            if self.monitor is not None and self.monitor in metrics:
                filename = "{epoch}" + self.CHECKPOINT_JOIN_CHAR + "{step}" + self.CHECKPOINT_JOIN_CHAR + f"{{{self.monitor}}}"
            # filename is not set, use default name
            else:
                filename = "{epoch}" + self.CHECKPOINT_JOIN_CHAR + "{step}"

Alternatives

No response

Additional context

Happy to put it in MR

cc @borda

lantiga commented 1 week ago

Thanks @VDFaller, can you briefly elaborate on what's the itch you'd like to scratch in your workflow?

VDFaller commented 1 week ago

@lantiga no problem. I have developers that give me checkpoints in a folder with no context sometimes. And they forget to put the val loss (or whatever) in their filename. So I have to rerun their validation to see which is which.

Maybe it's stored elsewhere than the filename that I could extract? Am I making it too difficult?

Like I said, tiny annoyance.