Closed BraveDistribution closed 2 years ago
This is my custom wrapper solution using scikit-learn K-fold that perform cross validation with datamodule:
from sklearn.model_selection import KFold
from torch.utils.data import ConcatDataset, Subset, DataLoader
data_set = ConcatDataset([data_module.train_dataloader().dataset,
data_module.val_dataloader().dataset,
data_module.test_dataloader().dataset])
kf = KFold(n_splits=num_fold, shuffle=shuffle, random_state=random_state)
for k, (train_index, test_index) in enumerate(kf.split(data_set)):
# Save model, logs in different folders
if logger and isinstance(logger, TensorBoardLogger):
logger._version = f'fold_{k + 1}'
log.info(f'Starting fold {k+1}/{num_fold} Training...')
train_data_loader = DataLoader(Subset(data_set, train_index.tolist()), collate_fn=data_module.collate_fn,
batch_size=data_module.train_dataloader().batch_size)
test_data_loader = DataLoader(Subset(data_set, test_index.tolist()), collate_fn=data_module.collate_fn,
batch_size=data_module.train_dataloader().batch_size)
model = instantiator.model(task, model_data_kwargs=getattr(data_module, "model_data_kwargs", None))
trainer = instantiator.trainer(trainer_conf, logger=logger)
trainer.fit(model, train_dataloader=train_data_loader)
trainer.test(model, test_dataloaders=test_data_loader)
Just sharing in case it may be useful to someone until official features is implemented
Hi @GabrielePicco can you please explain what the instantiator class does as it takes task and the data model datasets?
If possible the sharing the code snippet would be helpful thanks
@Troublem1 instantiator is just an object for creating model and training instances from hydra configuration, you can find the implementation here: https://github.com/PyTorchLightning/lightning-transformers/blob/master/lightning_transformers/core/instantiator.py
you can replace the two rows with normale instance creation:
model = YourPLModule(args)
trainer = Trainer(logger=logger, checkpoint_callback=checkpoint_callback)
@GabrielePicco how does the metrics calculation work, does it just calculate 5 numbers say for 5 folds? I want the average over all folds
@alexyalunin yes it calculates metrics for each fold. You could access the fold metric with trainer.callback_metrics
at the end of the for loop and average across folds.
Hi, this feature is very important.
Any update on this??
@ogreyesp . based on @GabrielePicco implementation, you could come up with various approaches in doing this until this feature is implemented with PL be it with a custom dataset class or datamodule..
@ogreyesp why don't you check out the linked PR https://github.com/PyTorchLightning/pytorch-lightning/pull/8715 and give comments and code review?
@Troublem1 Thanks, the solution of @GabrielePicco is straightforward and simple. I had a similar solution implemented. However, this kind of solution does not guarantee neither the correct aggregation of results across folds nor the logging of results in the different loggers.
I think that the solutions provided by @jbschiratti and @CarlosUziel are more elaborated.
@turian I would glad to help, but unfortunately I don't have the enough experience in Pytorch-lightning. I just started learning Pytorch and Pytorch-lightning two weeks ago. After years using Tensorflow and Keras, I'm moving to Pytorch...
I've worked on this and can submit a PR in (almost) a week.
Hi @jbschiratti . That is a great news. This feature is very important for researching.
One question, Is your future pull request related in some way with the pull request mentioned by @turian ?
Hi @jbschiratti ,
We would welcome a PR and also definitely your thoughts on #8854 and #8715 which are some basic implementations based on different lightning APIs. We are currently trying to figure out on top of which part of our API to build features like this.
Make sure to ping @tchaton @awaelchli @ananthsub and me (those are the people interested in the other approaches) when you open the PR of yours with a (potentially) different approach so that we can discuss advantages and disadvantages of each of them :)
Hello,
I am also interested in this feature! I am happy to help with PRs or new proposals.
In particular, I would like to highlight my use case: cross-validated hyperparameters tuning with the Ray Tune - Lightning integration.
In this case, each tuning trial is a cross-validated run. It is not obvious how to integrate them nicely since:
val_loss
)I feel like it should be possible to implement the CV so that it is transparent to ray tune -- or in general any hyperparameters tuning. What do you think?
@lucmos yes this should be possible. But to be honest, this won't be our main concern in the initial release phase. There we are mostly looking at the UI and integration within lightning and maybe later we'll have a look on how to integrate with other libraries :)
I've worked on this and can submit a PR in (almost) a week.
Did you submit the PR?
Hey everyone,
I have a PoC for an KFoldLoop subclassing ExternalLoop: https://github.com/PyTorchLightning/pytorch-lightning/pull/8715/files.
Would really appreciate some feedbacks.
Best, T.C
Hi, any update on this topic??
hi everyone!
we have an example to implement KFold inside lightning here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/loop_examples/kfold.py
Hi @rohitgr7 , @tchaton
Thanks a lot for this, it was a really useful starting point! I took this gist and modified it. I run into the following problem:
max_epocs=10
, which is reached in the first fold already.trainer.current_epochs
starts from 9 and therefore it only does a single epoch's worth of training then goes straight into testing.. I tried resetting self.trainer.current_epich=0
in the on_advance_start
method, but that's a read only property. Any ideas on how to solve this? Thanks!
For anyone else wondering, you can just add
self.fit_loop.current_epoch = 0
to the _reset_fitting
.
hey! thanks for raising this. will check this more. thanks @ltx-dan for the hint.
Hi @rohitgr7 , @tchaton
Another issue with your solution is that end of epoch logging seems to not work at all.. Instead of getting a line plot in tensorboard from the metrics, I get a single dot (with the latest value of the logged metric), each epoch overwriting the previous one.. Any idea why this might be? Thanks!
Dear @ltx-dan,
Yes, there is a small bug. I have a fix there: https://github.com/PyTorchLightning/lightning-flash/pull/879/files#diff-86a5b4000afaacf440fd3b55201f5c9b7eb005b08977b3c2eb3608dcaab6d95fR100
If you are interested, you can provide make a PR fix to Lightning :)
Best, T.C
Hi @tchaton
Thanks! I'm a bit confused.. is this PR pending against PTL or some other project? Are you referring to the line
if metrics:
self.trainer.logger.log_metrics(metrics[0], step=self.trainer.global_step)
as a solution to the logging problem?
Thanks, d
Dear @ltx-dan,
Yes, this PR is pending toward Lightning Flash and this line would be the solution. We need to replace the step=self.trainer.global_step
by the right step and maybe hack around to prevent overriding the values.
If you are interested, you can give it a try, otherwise I would try to sort it out soon :)
Best, T.C
HI @tchaton
I don't think
if metrics:
self.trainer.logger.log_metrics(metrics[0], step=self.trainer.global_step)
solves much (or anything really).. the whole logging is disrupted..
I think KFold loop is overwriting something fundamental in the fit_loop.. I think this is a serious bug currently..
Dear @ltx-dan,
I would say it is not fully supported feature yet :) I don't believe there is already a consensus on how logging should look in such use case. I think it is interesting to investigate if you are interested. Otherwise, I would work on this after v1.5
Best, T.C
Further details.. If you change the __getattr__
to
def __getattr__(self, key) -> Any:
# requires to be overridden as attributes of the wrapped loop are being accessed
if key not in self.__dict__:
# if "global_step" in key:
# print(key, getattr(self.fit_loop, key))
return getattr(self.fit_loop, key)
return self.__dict__[key]
you can clearly see that the global_step
is not being accessed after the first epoch finishes.. This seems wrong (although I don't know the inner working of the fitloop to be sure). Also, if I set max_epochs=4
for example then at the end of four epochs
self.fit_loop.global_step
shows four epochs worth of steps but self.trainer.global_step
only shows one epoch worth of steps.. What's going on here?
I'd be happy to update your example and solve this but some hints or tips would be useful.. Thanks!
@tchaton
I think I have a hacky workaround.. The problem is that something (genuinely don't know from the call stack) is setting global_step
on the trainer.fit_loop
which in our case is a wrapper class (KFoldLoop
) around the real fit_loop. Once KFoldLoop.global_step = x
is set, the __getattr__
override trick doesn't work..
For the time being I just added an explicit and hacky workaround to the class.
@property
def global_step(self) -> int:
return self.fit_loop.global_step
@global_step.setter
def global_step(self, value):
self.fit_loop.global_step = value
Also, in my trainer, I'm logging per fold by doing
def training_step(self, batch, batch_idx):
loss, y_true, y_pred = self._calculate_loss(batch)
self.log(f"loss_train/fold{self.trainer.fit_loop.current_fold}", loss)
and that works quite nicely too..
Finally, I took your idea from the PR and added self.fit_loop.epoch_progress = Progress()
after the self.fit_loop.run()
to ensure the epoch_count is reset after each fold.
I should also add, that I'm using all of this with LightningCLI and it all works perfectly fine
cli = LightningCLI(
model,
datamodule,
seed_everything_default=42,
run=False,
save_config_overwrite=True,
)
cli.trainer.logger = pl_loggers.TensorBoardLogger("logs/")
cli.trainer.fit_loop = KFoldLoop(
n_folds=cli.datamodule.hparams.n_folds,
fit_loop=cli.trainer.fit_loop,
export_path=cli.trainer.logger.log_dir,
)
cli.trainer.fit(cli.model, cli.datamodule)
Dope ! Mind making a PR :) ?
I'm pretty swamped this week.. so if it can wait till next week I'm happy to, otherwise, feel free to go ahead and add these tiny chages to your example.
Dear @ltx-dan,
It can wait next week, however, we are releasing Lightning v1.5 next Tuesday. Would be awesome to get this example fixed if you can by then :)
Best, T.C
Hi,
I have a question. The original datasets split by training and test set.
When we do data argumentation for example create synthetic data on training set, how can we apply K fold cross validation ? It seems that we can only use synthetic data for training, right?
Best wishes, Min
Worth noting that metrics from on_train_epoch_end
and on_validation_epoch_end
log with global_step rather than epoch (different to behaviour with vanilla pytorch-lightning).
This is after the proposed changes by @ltx-dan. If I find anything that fixes this I will let you know.
(I assume @ltx-dan with your set up you have the epochs metric logging as a zigzag/sharktooth against global step for each fold?)
Yepp, that's what I have @jameschapman19 . But that works for me bc I wanted to ensure that the max_epoch param is abided by fold and I don't really care about the total epochs passed across the folds..
Thanks @ltx-dan - nice job with the changes by the way!
Hey @ltx-dan,
Did you make a PR with the changes ?
No I'm afraid I haven't got around to it yet.. But all the mods I had to do are outlined above so feel free to add them to your PR.
Really hope this is still planned for milestone 1.6 as suggested by @awaelchli :)
fixed the example recently. are you guys still facing issues with this? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/loop_examples/kfold.py
hey @ltx-dan ! the example is updated to work with master, new release is probably next month. to run it with latest release, you need to make small adjustments. check more here: https://github.com/PyTorchLightning/pytorch-lightning/issues/11565
Hi @rohitgr7 I see it's added here. Would the strategy=null
work too on the master branch?
for single device training, the strategy will be SingleDeviceStrategy. It can never be Null.
Hello @rohitgr7 , may I ask if there's some plan to adopt the example implementation officially into the library, or just leave it as an example (with copy & paste for usage)? Thanks!
hey @function2-llx
for now, AFAIK, it will stay as an example since Loop API is still experimental.
what if we just integrate with sklearn cross validation? this can be the start of supporting sklearn interop
I agree with @williamFalcon , for the case of datasets in memory we should really just have a function (probably in bolts) that wraps SklearnDatamodule
Trainer
Module
to a module that can be used with sklearn. Then you could get all the sklearn goodies for free: Pipeline (eg for preprocessing or to add a sklearn predictor after a SSL module), cross validation, hyperparameter tuning ...
hey @function2-llx
for now, AFAIK, it will stay as an example since Loop API is still experimental.
I'm getting the following error when I try to use the kfold.py
example:
self.trainer.strategy.setup_optimizers(self.trainer)
AttributeError: 'Trainer' object has no attribute 'strategy'
I'm using PL 1.5.10 and Pytorch 1.10.2
hi @rohitgr7
FYI, we just discovered that this updated version of the example (running on master) does not work at all if early_stopping callbacks are used.. after the first fold, the subsequent ones don't run at all.. do you have any idea why that might be?
What I found was that in anticipation of the rollout of the strategy
feature, callbacks were updated (early_stopping included), see here
So for some reason, once early_stopping is updating the should_stop
property of the trainer, it is never reset (for the subsequent folds) in kfold (after switching to strategy
backend for ptl1.6.0. So, simply adding
self.trainer.should_stop = False
to the _reset_fitting
method in your example makes it work again.
🚀 Feature
Cross-Validation is a crucial model validation techniques for assessing how the model generalizes on new data.
Motivation
Research papers usually require cross-validation. From my point of view, this kind of feature would simplify the work of researches.
Pitch
I want to pass a parameter to the Trainer object to specify that I want to train the model on K-folds.
In the case that nobody wants to make a PR, I can start working on that.