Open awaelchli opened 2 years ago
Sounds good to me. We would also have to think about what happens when multiple loggers are used. I can't find the code snippet right now, but currently the artifact path is built by concatenating the name and version of each logger to obtain a shared artifact location. This wouldn't really work with the proposed approach.
This is a concise idea. But I need to emphasize two details:
The code in the checkpoint callback __resolve_ckpt_dir
is run before setup
, that is, before the actual logger object(return by experiment
property) is initialized. This results in the experiment_dir
may not be determined when the checkpoint directory is created, for example, we can't get the version of the Logger (usually the run id or name).
In cases where users don't use Logger (though not many), we still need an experiment_dir
to hold checkpoint, CLI config, etc. (putting it in the default_root_dir
would cause conflicts). Maybe experiment_dir
should be the trainer's property and not the Logger's so that there is also no problem with no/multi-logger and problem 1, we just need to log experiment_dir
with the Logger.
For 1) That's not correct. setup()
runs after processes have spawned. The __resolve_ckpt_dir
gets called in ModelCheckpoint.setup()
.
For 2)
This results in the experiment_dir may not be determined when the checkpoint directory is created,
By design, this wouldn't be possible. By introducing experiment_dir, and the same way as with the other properties, there is a contract between the logger and the framework to return the right directory.
2) In cases where users don't use Logger (though not many), we still need an experiment_dir to hold checkpoint, CLI config, etc. (putting it in the default_root_dir would cause conflicts)
We can't solve this problem. If you don't use a logger, i.e., no versioning of experiments, then we can't avoid conflicts. The LightningCLI can handle this by itself.
This proposal is not aiming at addressing these issues, not at all. These are completely separate concerns.
+1 for this!
We can't solve this problem. If you don't use a logger, i.e., no versioning of experiments, then we can't avoid conflicts. The LightningCLI can handle this by itself.
I think it would be sensible to just use trainer.root_dir
in these cases, as this is the only defined thing then.
Maybe experiment_dir should be the trainer's property and not the Logger's so that there is also no problem with no/multi-logger and problem 1, we just need to log experiment_dir with the Logger.
A definite no from me here. We shouldn't force loggers into our behaviour, we have the trainer root dir and everything in it is up to the callbacks (ModelCheckpoint) and loggers to use as they seem appropriate. If we force them to log to a directory given by the trainer, they are no longer self-contained (which they should be). Also this may again not work well with certain logger structures. I think it is best to have this on the logger and only implement an interface for the trainer to know the directory for sure without special casing.
@awaelchli Thanks for the explanation, it's a lot clearer for me. I hope this introduce will happen soon.
@kabouzeid Good point. We wanted to resolve this a while a go by redirecting to the first logger (#12177). So this issue here is based on the assumption that option B in #12177 gets implemented.
I'm not quite sure why the versioning is coupled to the loggers anyways. With no loggers, there is no version and with multiple loggers the version becomes ambiguous. Also checkpointing doesn't really have much to do with logging. Shouldn't lightning provide its own versioning in form of random identifiers for each run? Every run would get its own lightning directory with a unique identifier where the checkpoints, config.yaml
and possibly stdout log is stored.
The lightning directory location is then logged (like a hparam) to all connected loggers and/or symlinked to their respective log dirs.
Anecdotally, I only started using lightning a few weeks ago, and the first time I tripped was when it took my quite a while to figure out why my checkpoints weren't showing up. The docs say they're saved to the default_root_dir
and I sure did not even think about checking the logger directories.
Lightning determines that a good default behavior is that all artifacts land in one place. If this is not desired, the user can change this easily by configuring the output directory manually.
Also checkpointing doesn't really have much to do with logging.
Some loggers are doing more than just "logging". They are more like experiment trackers. This can include archiving checkpoints.
The docs say they're saved to the default_root_dir and I sure did not even think about checking the logger directories.
That's correct, and the default is the current working directory.
Hi @awaelchli, here's a case where I'm not sure how to handle it.
Currently, when the Logger.save_dir
is None
(DummyLogger
, CometLogger(online)
, Deprecated LoggerCollection
, etc), the Trainer.log_dir
is None
either, and the checkpoints directory is default_root_dir/name/version
. What should experiment_dir
return in this case?
None
, we still need dirty logic to determine where checkpoints should be saved.default_root_dir/name/version
, how should I access the trainer.default_root_dir
in the logger.A new idea: experiement_dir
just returns a relative path to save_dir
and we join them in trainer.log_dir
.
experiement_dir just returns a relative path to save_dir and we join them in trainer.log_dir
Yes, it's one possible solution that I propose in the issue description above. It's not ideal, but the only way currently to respect the Trainer default root dir.
If we don't care about respecting the default root dir for this edge cases, we can return None in DummyLogger.experiment_dir, and internally in places where we need to access experiment dir, we check for isinstance(trainer.logger, DummyLogger)
.
I found this issue because it was unclear to me how an experiment_dir should be set for storing checkpoints, hyperparameter configs etc. from model training.
It wasn't clear to me that in order to get the folder hierarchy suggested in https://github.com/Lightning-AI/lightning/issues/1207#issue-585620257, one had to leave both Trainer.default_root_dir
and ModelCheckpoint.dirpath
to default, and only set TensorBoardLogger.save_dir
, see https://github.com/Lightning-AI/lightning/issues/1207#issuecomment-1355520316.
Suggestion: clarification in docs
Maybe, along with the update to logger.experiment_dir
, the documentation for all three parameters Trainer.default_root_dir
, ModelCheckpoint.dir_path
and TensorBoardLogger.save_dir` should be updated with a link to a description of how to set the parameters to get this nice default folder hierarchy?
At least for me it wasn't immediately obvious.
Why was this issue closed????? Seems very needed
Hello members of Lightning, I would like to know what the discussion or decision was behind closing this issue. According to the link, at least six other issues (https://github.com/Lightning-AI/lightning/issues/16461 https://github.com/Lightning-AI/lightning/issues/17168 https://github.com/Lightning-AI/lightning/issues/1207 https://github.com/Lightning-AI/lightning/issues/15085 https://github.com/Lightning-AI/lightning/issues/14739https://github.com/Lightning-AI/lightning/issues/14054) and many PRs are pointing towards this issue, with potentially more in the future. @lantiga opposed breaking changes in my PR https://github.com/Lightning-AI/lightning/pull/14640, but I still think we need to address this problem as existing third-party loggers often lack corresponding experiment directories which pose significant challenges for CLI configuration, profiler, and checkpoint storage as mentioned above.
@tshu-w @vedal thanks for your comments
My concern here is breaking people’s code extensively, but I’m all for finding a solution that honors compatibility (including adding a new recommended way to specify location and keeping the old one around).
Hi, I've just encountered a similar issue - I want to save all artifacts in a uniquely versioned experiment dir, but there are so many mechanisms for that in PL that it gets extremally difficult.
Many classes create their own "version" folders, and the code is not being shared. Some of them rely strongly on the order of loggers. What is even more unfortunate, these classes are often instantiated during different steps of processing pipeline (e.g. log dirs for loggers vs output model dir for checkpoint callback), so it's quite difficult to make sure that all directories are set correctly.
Right now I'm getting my log_dir from CSVLogger and share it on my own, but it would be really helpful to implement the changes proposed in this PR and make the output folders consistent across the PL framework.
Any progress now? It would be great to have a unified experiment dir.
I tried to submit a PR after communication, but apparently, the lightning team did not reach a consensus and lacked the energy to focus on this issue (;.
If they confirm how to solve it, I am willing to submit the PR again.
🚀 Feature
Introduce a property/method
experiment_dir
on the abstract logger interface. All loggers must implement this and return a path to where their files are being saved.Motivation
The Trainer currently determines where checkpoints are saved like so:
As you can see, it assumes that there is a rigid folder structure
save_dir/name/version
. Every logger has to follow this structure. However, at least one logger (wandb) struggles with this, because it has an additional concept of "project" that does not fit well within this layout (https://github.com/Lightning-AI/lightning/issues/14054). Furthermore, the "name" may change AFTER the files were written, so there is a disconnect between the folder structure and what is represented "online" in the wandb ui.One observation is that the trainer actually never cares about any of the subfolders. All it needs is the full path. This is why I believe, if we let the logger decide how to assemble the path, we will avoid many issues and disagreements with 3rd party experiment management.
We are struggling with making all loggers agree to a standardized layout, so maybe we shouldn't try to do that, at least not in every aspect of how the loggers work (https://github.com/Lightning-AI/lightning/issues/12028#issuecomment-1088325894).
Pitch
Introduce
This should probably be a relative path, so that the Trainer can prepend a root dir if needed (default cwd). An implementation of a logger could then look like this:
And the checkpointing in the Trainer would save files to
for example.
Implications
save_dir
andlog_dir
or remove it from the base interface.Trainer.log_dir
would probably get simplified: https://github.com/Lightning-AI/lightning/blob/fe9e5d55bf7991ba36b76d6adae9075b93dfcaa0/src/pytorch_lightning/trainer/trainer.py#L2208-L2219Alternatives
Struggle
Additional context
Discussions in: https://github.com/Lightning-AI/lightning/issues/14054
Current blockers:
12177
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.
cc @borda @tchaton @justusschock @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy