Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Introduce `Logger.experiment_dir` #14188

Open awaelchli opened 2 years ago

awaelchli commented 2 years ago

🚀 Feature

Introduce a property/method experiment_dir on the abstract logger interface. All loggers must implement this and return a path to where their files are being saved.

Motivation

The Trainer currently determines where checkpoints are saved like so:

os.path.join(logger.save_dir, str(logger.name), logger.version, "checkpoints")

As you can see, it assumes that there is a rigid folder structure save_dir/name/version. Every logger has to follow this structure. However, at least one logger (wandb) struggles with this, because it has an additional concept of "project" that does not fit well within this layout (https://github.com/Lightning-AI/lightning/issues/14054). Furthermore, the "name" may change AFTER the files were written, so there is a disconnect between the folder structure and what is represented "online" in the wandb ui.

One observation is that the trainer actually never cares about any of the subfolders. All it needs is the full path. This is why I believe, if we let the logger decide how to assemble the path, we will avoid many issues and disagreements with 3rd party experiment management.

We are struggling with making all loggers agree to a standardized layout, so maybe we shouldn't try to do that, at least not in every aspect of how the loggers work (https://github.com/Lightning-AI/lightning/issues/12028#issuecomment-1088325894).

Pitch

Introduce

def experiment_dir(self) -> str:
    """Returns the (relative) path to where logs and artifacts get saved for the current version."""

This should probably be a relative path, so that the Trainer can prepend a root dir if needed (default cwd). An implementation of a logger could then look like this:

def experiment_dir(self):
    return os.path.join(self.save_dir, self.name, self.version)

And the checkpointing in the Trainer would save files to

os.path.join(logger.experiment_dir(), "checkpoints") 

for example.

Implications

  1. After introducing experiment dir, we could consider deprecating save_dir and log_dir or remove it from the base interface.
  2. The code here in Trainer.log_dir would probably get simplified: https://github.com/Lightning-AI/lightning/blob/fe9e5d55bf7991ba36b76d6adae9075b93dfcaa0/src/pytorch_lightning/trainer/trainer.py#L2208-L2219
  3. The code in the checkpoint callback would get simplified too: https://github.com/Lightning-AI/lightning/blob/fe9e5d55bf7991ba36b76d6adae9075b93dfcaa0/src/pytorch_lightning/callbacks/model_checkpoint.py#L595-L608
  4. It would immediately address these issues, as the LightningCLI can write to the experiment dir without ambiguity: #14162, #12748

Alternatives

Struggle

Additional context

Discussions in: https://github.com/Lightning-AI/lightning/issues/14054

Current blockers:

12177


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @tchaton @justusschock @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy

kabouzeid commented 2 years ago

Sounds good to me. We would also have to think about what happens when multiple loggers are used. I can't find the code snippet right now, but currently the artifact path is built by concatenating the name and version of each logger to obtain a shared artifact location. This wouldn't really work with the proposed approach.

tshu-w commented 2 years ago

This is a concise idea. But I need to emphasize two details:

  1. The code in the checkpoint callback __resolve_ckpt_dir is run before setup, that is, before the actual logger object(return by experiment property) is initialized. This results in the experiment_dir may not be determined when the checkpoint directory is created, for example, we can't get the version of the Logger (usually the run id or name).

  2. In cases where users don't use Logger (though not many), we still need an experiment_dir to hold checkpoint, CLI config, etc. (putting it in the default_root_dir would cause conflicts). Maybe experiment_dir should be the trainer's property and not the Logger's so that there is also no problem with no/multi-logger and problem 1, we just need to log experiment_dir with the Logger.

awaelchli commented 2 years ago

For 1) That's not correct. setup() runs after processes have spawned. The __resolve_ckpt_dir gets called in ModelCheckpoint.setup().

For 2)

This results in the experiment_dir may not be determined when the checkpoint directory is created,

By design, this wouldn't be possible. By introducing experiment_dir, and the same way as with the other properties, there is a contract between the logger and the framework to return the right directory.

2) In cases where users don't use Logger (though not many), we still need an experiment_dir to hold checkpoint, CLI config, etc. (putting it in the default_root_dir would cause conflicts)

We can't solve this problem. If you don't use a logger, i.e., no versioning of experiments, then we can't avoid conflicts. The LightningCLI can handle this by itself.

This proposal is not aiming at addressing these issues, not at all. These are completely separate concerns.

justusschock commented 2 years ago

+1 for this!

We can't solve this problem. If you don't use a logger, i.e., no versioning of experiments, then we can't avoid conflicts. The LightningCLI can handle this by itself.

I think it would be sensible to just use trainer.root_dir in these cases, as this is the only defined thing then.

Maybe experiment_dir should be the trainer's property and not the Logger's so that there is also no problem with no/multi-logger and problem 1, we just need to log experiment_dir with the Logger.

A definite no from me here. We shouldn't force loggers into our behaviour, we have the trainer root dir and everything in it is up to the callbacks (ModelCheckpoint) and loggers to use as they seem appropriate. If we force them to log to a directory given by the trainer, they are no longer self-contained (which they should be). Also this may again not work well with certain logger structures. I think it is best to have this on the logger and only implement an interface for the trainer to know the directory for sure without special casing.

tshu-w commented 2 years ago

@awaelchli Thanks for the explanation, it's a lot clearer for me. I hope this introduce will happen soon.

awaelchli commented 2 years ago

@kabouzeid Good point. We wanted to resolve this a while a go by redirecting to the first logger (#12177). So this issue here is based on the assumption that option B in #12177 gets implemented.

kabouzeid commented 2 years ago

I'm not quite sure why the versioning is coupled to the loggers anyways. With no loggers, there is no version and with multiple loggers the version becomes ambiguous. Also checkpointing doesn't really have much to do with logging. Shouldn't lightning provide its own versioning in form of random identifiers for each run? Every run would get its own lightning directory with a unique identifier where the checkpoints, config.yaml and possibly stdout log is stored. The lightning directory location is then logged (like a hparam) to all connected loggers and/or symlinked to their respective log dirs.

Anecdotally, I only started using lightning a few weeks ago, and the first time I tripped was when it took my quite a while to figure out why my checkpoints weren't showing up. The docs say they're saved to the default_root_dir and I sure did not even think about checking the logger directories.

awaelchli commented 2 years ago

Lightning determines that a good default behavior is that all artifacts land in one place. If this is not desired, the user can change this easily by configuring the output directory manually.

Also checkpointing doesn't really have much to do with logging.

Some loggers are doing more than just "logging". They are more like experiment trackers. This can include archiving checkpoints.

The docs say they're saved to the default_root_dir and I sure did not even think about checking the logger directories.

That's correct, and the default is the current working directory.

tshu-w commented 2 years ago

Hi @awaelchli, here's a case where I'm not sure how to handle it.

Currently, when the Logger.save_dir is None (DummyLogger, CometLogger(online), Deprecated LoggerCollection, etc), the Trainer.log_dir is None either, and the checkpoints directory is default_root_dir/name/version. What should experiment_dir return in this case?

  1. None, we still need dirty logic to determine where checkpoints should be saved.
  2. default_root_dir/name/version, how should I access the trainer.default_root_dir in the logger.

A new idea: experiement_dir just returns a relative path to save_dir and we join them in trainer.log_dir.

awaelchli commented 2 years ago

experiement_dir just returns a relative path to save_dir and we join them in trainer.log_dir

Yes, it's one possible solution that I propose in the issue description above. It's not ideal, but the only way currently to respect the Trainer default root dir.

If we don't care about respecting the default root dir for this edge cases, we can return None in DummyLogger.experiment_dir, and internally in places where we need to access experiment dir, we check for isinstance(trainer.logger, DummyLogger).

vedal commented 1 year ago

I found this issue because it was unclear to me how an experiment_dir should be set for storing checkpoints, hyperparameter configs etc. from model training.

It wasn't clear to me that in order to get the folder hierarchy suggested in https://github.com/Lightning-AI/lightning/issues/1207#issue-585620257, one had to leave both Trainer.default_root_dir and ModelCheckpoint.dirpath to default, and only set TensorBoardLogger.save_dir, see https://github.com/Lightning-AI/lightning/issues/1207#issuecomment-1355520316.

Suggestion: clarification in docs Maybe, along with the update to logger.experiment_dir, the documentation for all three parameters Trainer.default_root_dir, ModelCheckpoint.dir_path and TensorBoardLogger.save_dir` should be updated with a link to a description of how to set the parameters to get this nice default folder hierarchy?

At least for me it wasn't immediately obvious.

vedal commented 1 year ago

Why was this issue closed????? Seems very needed

tshu-w commented 1 year ago

Hello members of Lightning, I would like to know what the discussion or decision was behind closing this issue. According to the link, at least six other issues (https://github.com/Lightning-AI/lightning/issues/16461 https://github.com/Lightning-AI/lightning/issues/17168 https://github.com/Lightning-AI/lightning/issues/1207 https://github.com/Lightning-AI/lightning/issues/15085 https://github.com/Lightning-AI/lightning/issues/14739https://github.com/Lightning-AI/lightning/issues/14054) and many PRs are pointing towards this issue, with potentially more in the future. @lantiga opposed breaking changes in my PR https://github.com/Lightning-AI/lightning/pull/14640, but I still think we need to address this problem as existing third-party loggers often lack corresponding experiment directories which pose significant challenges for CLI configuration, profiler, and checkpoint storage as mentioned above.

lantiga commented 1 year ago

@tshu-w @vedal thanks for your comments

My concern here is breaking people’s code extensively, but I’m all for finding a solution that honors compatibility (including adding a new recommended way to specify location and keeping the old one around).

machur commented 1 year ago

Hi, I've just encountered a similar issue - I want to save all artifacts in a uniquely versioned experiment dir, but there are so many mechanisms for that in PL that it gets extremally difficult.

Many classes create their own "version" folders, and the code is not being shared. Some of them rely strongly on the order of loggers. What is even more unfortunate, these classes are often instantiated during different steps of processing pipeline (e.g. log dirs for loggers vs output model dir for checkpoint callback), so it's quite difficult to make sure that all directories are set correctly.

Right now I'm getting my log_dir from CSVLogger and share it on my own, but it would be really helpful to implement the changes proposed in this PR and make the output folders consistent across the PL framework.

genghisun commented 5 months ago

Any progress now? It would be great to have a unified experiment dir.

tshu-w commented 5 months ago

I tried to submit a PR after communication, but apparently, the lightning team did not reach a consensus and lacked the energy to focus on this issue (;.

If they confirm how to solve it, I am willing to submit the PR again.