IBM / terratorch

a Python toolkit for fine-tuning Geospatial Foundation Models (GFMs).
Apache License 2.0
161 stars 23 forks source link

terratorch CLI help printout incorrect? #108

Open Foxigod opened 3 months ago

Foxigod commented 3 months ago

Describe the issue The help printout from terratorch CLI seems to be inherited from the PyTorch Lightning docs but the functionality has been modified in terratorch to be inconsistent with said docs. Running terratorch fit -h produces:

[...]
  --trainer.enable_checkpointing {true,false,null}
                        If ``True``, enable checkpointing. It will configure a default ModelCheckpoint callback if there is no user-defined ModelCheckpoint in
                        :paramref:`~lightning.pytorch.trainer.trainer.Trainer.callbacks`. Default: ``True``. (type: Optional[bool], default: null)
[...]

which specifically states that a default ModelCheckpoint callback will be created if a ModelCheckpoint callback doesn't already exist in the trainer callbacks. In my experience with terratorch this doesn't actually seem to be a "default" ModelCheckpoint callback, rather it seems to monitor a specific quantity and thus doesn't save a new checkpoint when this monitored quantity doesn't improve. Looking at the PyTorch Lightning docs it seems to actually instantiate a default instance of ModelCheckpoint. However, since this is not my experience with terratorch, I have to look into the terratorch code where I find this: Screenshot 2024-08-10 at 12 45 03 Although I don't fully know how all these things interact with one another, this screenshot seems to show a configuration that is consistent with my experience with terratorch, the filename as {epoch} and the monitor quantity being set to val/loss. The filename and monitor parameters also seem to be consistent with the StateDict checkpoint that gets saved.

So from my point of view it seems that terratorch "blindly" inherits the documentation from Pytorch Lightning but then modifies the functionality to be inconsistent with said documentation.

To Reproduce (optional, but appreciated)

  1. Run terratorch from CLI with a config file with the trainer.enable_checkpointing option set to True and don't include any other ModelCheckpoint callback in the trainer callbacks. (This even seems to be the case for examples like sen1floods11)
  2. Notice how the saved checkpoints are inconsistent with what was expected.

Deployment information (optional) I installed terratorch by cloning the repo and running pip3 install -e <cloned_directory>. If I run git rev-parse HEAD I get the following output:

3b763912d54594f5e89092ca5999178b399c2a10

which is this commit.

Notes The issue I am raising here is not just of this specific part of the help printout from terratorch, but of the "integrity" of said printout. There might exist similar issues with other parts that I haven't encountered.

CarlosGomes98 commented 2 months ago

Hi @Foxigod . Thanks for pointing this out! Indeed it is true, the ModelCheckpoint callback which is created is not default, but has different values for filename and monitor.

In my view, this is actually the most commonly desired behaviour, so I would argue we should keep the functionality as is, but update the help message for this command if possible.

As for the possibility of there being other similar issues, all modifications to the core of the default LightningCLI are concentrated in cli_tools.py. Because of this, after checking said module, there should be no other differences, other than the 2 new arguments, predict_output_dir and out_dtype, useful for inference.

Having said that, it is a useful reminder that modifications we make to the LightningCLI class (in this case subclass), in particular when calling methods such as set_defaults, may require modifications in the documentation / help messages.