Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Cross validation feature #839

Closed BraveDistribution closed 2 years ago

BraveDistribution commented 4 years ago

🚀 Feature

Cross-Validation is a crucial model validation techniques for assessing how the model generalizes on new data.

Motivation

Research papers usually require cross-validation. From my point of view, this kind of feature would simplify the work of researches.

Pitch

I want to pass a parameter to the Trainer object to specify that I want to train the model on K-folds.

In the case that nobody wants to make a PR, I can start working on that.

GabrielePicco commented 3 years ago

This is my custom wrapper solution using scikit-learn K-fold that perform cross validation with datamodule:

from sklearn.model_selection import KFold
from torch.utils.data import ConcatDataset, Subset, DataLoader

data_set = ConcatDataset([data_module.train_dataloader().dataset,
                              data_module.val_dataloader().dataset,
                              data_module.test_dataloader().dataset])

kf = KFold(n_splits=num_fold, shuffle=shuffle, random_state=random_state)
    for k, (train_index, test_index) in enumerate(kf.split(data_set)):

        # Save model, logs in different folders
        if logger and isinstance(logger, TensorBoardLogger):
            logger._version = f'fold_{k + 1}'
        log.info(f'Starting fold {k+1}/{num_fold} Training...')

        train_data_loader = DataLoader(Subset(data_set, train_index.tolist()), collate_fn=data_module.collate_fn,
                                       batch_size=data_module.train_dataloader().batch_size)
        test_data_loader = DataLoader(Subset(data_set, test_index.tolist()), collate_fn=data_module.collate_fn,
                                      batch_size=data_module.train_dataloader().batch_size)

        model = instantiator.model(task, model_data_kwargs=getattr(data_module, "model_data_kwargs", None))
        trainer = instantiator.trainer(trainer_conf, logger=logger)

        trainer.fit(model, train_dataloader=train_data_loader)
        trainer.test(model, test_dataloaders=test_data_loader)

Just sharing in case it may be useful to someone until official features is implemented

Troublem1 commented 3 years ago

Hi @GabrielePicco can you please explain what the instantiator class does as it takes task and the data model datasets?

If possible the sharing the code snippet would be helpful thanks

GabrielePicco commented 3 years ago

@Troublem1 instantiator is just an object for creating model and training instances from hydra configuration, you can find the implementation here: https://github.com/PyTorchLightning/lightning-transformers/blob/master/lightning_transformers/core/instantiator.py

you can replace the two rows with normale instance creation:

model = YourPLModule(args)
trainer = Trainer(logger=logger, checkpoint_callback=checkpoint_callback)
alexyalunin commented 3 years ago

@GabrielePicco how does the metrics calculation work, does it just calculate 5 numbers say for 5 folds? I want the average over all folds

GabrielePicco commented 3 years ago

@alexyalunin yes it calculates metrics for each fold. You could access the fold metric with trainer.callback_metrics at the end of the for loop and average across folds.

ogreyesp commented 3 years ago

Hi, this feature is very important.

Any update on this??

Troublem1 commented 3 years ago

@ogreyesp . based on @GabrielePicco implementation, you could come up with various approaches in doing this until this feature is implemented with PL be it with a custom dataset class or datamodule..

turian commented 3 years ago

@ogreyesp why don't you check out the linked PR https://github.com/PyTorchLightning/pytorch-lightning/pull/8715 and give comments and code review?

ogreyesp commented 3 years ago

@Troublem1 Thanks, the solution of @GabrielePicco is straightforward and simple. I had a similar solution implemented. However, this kind of solution does not guarantee neither the correct aggregation of results across folds nor the logging of results in the different loggers.

I think that the solutions provided by @jbschiratti and @CarlosUziel are more elaborated.

ogreyesp commented 3 years ago

@turian I would glad to help, but unfortunately I don't have the enough experience in Pytorch-lightning. I just started learning Pytorch and Pytorch-lightning two weeks ago. After years using Tensorflow and Keras, I'm moving to Pytorch...

jbschiratti commented 3 years ago

I've worked on this and can submit a PR in (almost) a week.

ogreyesp commented 3 years ago

Hi @jbschiratti . That is a great news. This feature is very important for researching.

One question, Is your future pull request related in some way with the pull request mentioned by @turian ?

justusschock commented 3 years ago

Hi @jbschiratti ,

We would welcome a PR and also definitely your thoughts on #8854 and #8715 which are some basic implementations based on different lightning APIs. We are currently trying to figure out on top of which part of our API to build features like this.

Make sure to ping @tchaton @awaelchli @ananthsub and me (those are the people interested in the other approaches) when you open the PR of yours with a (potentially) different approach so that we can discuss advantages and disadvantages of each of them :)

lucmos commented 3 years ago

Hello,

I am also interested in this feature! I am happy to help with PRs or new proposals.


In particular, I would like to highlight my use case: cross-validated hyperparameters tuning with the Ray Tune - Lightning integration.

In this case, each tuning trial is a cross-validated run. It is not obvious how to integrate them nicely since:

I feel like it should be possible to implement the CV so that it is transparent to ray tune -- or in general any hyperparameters tuning. What do you think?

justusschock commented 3 years ago

@lucmos yes this should be possible. But to be honest, this won't be our main concern in the initial release phase. There we are mostly looking at the UI and integration within lightning and maybe later we'll have a look on how to integrate with other libraries :)

ogreyesp commented 3 years ago

I've worked on this and can submit a PR in (almost) a week.

Did you submit the PR?

tchaton commented 3 years ago

Hey everyone,

I have a PoC for an KFoldLoop subclassing ExternalLoop: https://github.com/PyTorchLightning/pytorch-lightning/pull/8715/files.

Would really appreciate some feedbacks.

Best, T.C

ogreyesp commented 3 years ago

Hi, any update on this topic??

rohitgr7 commented 3 years ago

hi everyone!

we have an example to implement KFold inside lightning here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/loop_examples/kfold.py

danielhomola commented 3 years ago

Hi @rohitgr7 , @tchaton

Thanks a lot for this, it was a really useful starting point! I took this gist and modified it. I run into the following problem:

I tried resetting self.trainer.current_epich=0 in the on_advance_start method, but that's a read only property. Any ideas on how to solve this? Thanks!

ltx-dan commented 3 years ago

For anyone else wondering, you can just add

self.fit_loop.current_epoch = 0 to the _reset_fitting.

rohitgr7 commented 3 years ago

hey! thanks for raising this. will check this more. thanks @ltx-dan for the hint.

ltx-dan commented 3 years ago

Hi @rohitgr7 , @tchaton

Another issue with your solution is that end of epoch logging seems to not work at all.. Instead of getting a line plot in tensorboard from the metrics, I get a single dot (with the latest value of the logged metric), each epoch overwriting the previous one.. Any idea why this might be? Thanks!

tchaton commented 3 years ago

Dear @ltx-dan,

Yes, there is a small bug. I have a fix there: https://github.com/PyTorchLightning/lightning-flash/pull/879/files#diff-86a5b4000afaacf440fd3b55201f5c9b7eb005b08977b3c2eb3608dcaab6d95fR100

If you are interested, you can provide make a PR fix to Lightning :)

Best, T.C

ltx-dan commented 3 years ago

Hi @tchaton

Thanks! I'm a bit confused.. is this PR pending against PTL or some other project? Are you referring to the line

 if metrics:
   self.trainer.logger.log_metrics(metrics[0], step=self.trainer.global_step)

as a solution to the logging problem?

Thanks, d

tchaton commented 3 years ago

Dear @ltx-dan,

Yes, this PR is pending toward Lightning Flash and this line would be the solution. We need to replace the step=self.trainer.global_step by the right step and maybe hack around to prevent overriding the values.

If you are interested, you can give it a try, otherwise I would try to sort it out soon :)

Best, T.C

ltx-dan commented 3 years ago

HI @tchaton

I don't think

 if metrics:
   self.trainer.logger.log_metrics(metrics[0], step=self.trainer.global_step)

solves much (or anything really).. the whole logging is disrupted..

I think KFold loop is overwriting something fundamental in the fit_loop.. I think this is a serious bug currently..

tchaton commented 3 years ago

Dear @ltx-dan,

I would say it is not fully supported feature yet :) I don't believe there is already a consensus on how logging should look in such use case. I think it is interesting to investigate if you are interested. Otherwise, I would work on this after v1.5

Best, T.C

ltx-dan commented 3 years ago

Further details.. If you change the __getattr__ to

    def __getattr__(self, key) -> Any:
        # requires to be overridden as attributes of the wrapped loop are being accessed
        if key not in self.__dict__:
            # if "global_step" in key:
            #     print(key, getattr(self.fit_loop, key))
            return getattr(self.fit_loop, key)
        return self.__dict__[key]

you can clearly see that the global_step is not being accessed after the first epoch finishes.. This seems wrong (although I don't know the inner working of the fitloop to be sure). Also, if I set max_epochs=4 for example then at the end of four epochs self.fit_loop.global_step shows four epochs worth of steps but self.trainer.global_step only shows one epoch worth of steps.. What's going on here?

I'd be happy to update your example and solve this but some hints or tips would be useful.. Thanks!

ltx-dan commented 3 years ago

@tchaton

I think I have a hacky workaround.. The problem is that something (genuinely don't know from the call stack) is setting global_step on the trainer.fit_loop which in our case is a wrapper class (KFoldLoop) around the real fit_loop. Once KFoldLoop.global_step = x is set, the __getattr__ override trick doesn't work..

For the time being I just added an explicit and hacky workaround to the class.

    @property
    def global_step(self) -> int:
        return self.fit_loop.global_step

    @global_step.setter
    def global_step(self, value):
        self.fit_loop.global_step = value

Also, in my trainer, I'm logging per fold by doing

    def training_step(self, batch, batch_idx):
        loss, y_true, y_pred = self._calculate_loss(batch)
        self.log(f"loss_train/fold{self.trainer.fit_loop.current_fold}", loss)

and that works quite nicely too..

Finally, I took your idea from the PR and added self.fit_loop.epoch_progress = Progress() after the self.fit_loop.run() to ensure the epoch_count is reset after each fold.

I should also add, that I'm using all of this with LightningCLI and it all works perfectly fine

    cli = LightningCLI(
        model,
        datamodule,
        seed_everything_default=42,
        run=False,
        save_config_overwrite=True,
    )
    cli.trainer.logger = pl_loggers.TensorBoardLogger("logs/")
    cli.trainer.fit_loop = KFoldLoop(
        n_folds=cli.datamodule.hparams.n_folds,
        fit_loop=cli.trainer.fit_loop,
        export_path=cli.trainer.logger.log_dir,
    )
    cli.trainer.fit(cli.model, cli.datamodule)
tchaton commented 3 years ago

Dope ! Mind making a PR :) ?

ltx-dan commented 3 years ago

I'm pretty swamped this week.. so if it can wait till next week I'm happy to, otherwise, feel free to go ahead and add these tiny chages to your example.

tchaton commented 3 years ago

Dear @ltx-dan,

It can wait next week, however, we are releasing Lightning v1.5 next Tuesday. Would be awesome to get this example fixed if you can by then :)

Best, T.C

minwang-ai commented 2 years ago

Hi,

I have a question. The original datasets split by training and test set.

When we do data argumentation for example create synthetic data on training set, how can we apply K fold cross validation ? It seems that we can only use synthetic data for training, right?

Best wishes, Min

jameschapman19 commented 2 years ago

Worth noting that metrics from on_train_epoch_end and on_validation_epoch_end log with global_step rather than epoch (different to behaviour with vanilla pytorch-lightning).

This is after the proposed changes by @ltx-dan. If I find anything that fixes this I will let you know.

(I assume @ltx-dan with your set up you have the epochs metric logging as a zigzag/sharktooth against global step for each fold?)

ltx-dan commented 2 years ago

Yepp, that's what I have @jameschapman19 . But that works for me bc I wanted to ensure that the max_epoch param is abided by fold and I don't really care about the total epochs passed across the folds..

jameschapman19 commented 2 years ago

Thanks @ltx-dan - nice job with the changes by the way!

tchaton commented 2 years ago

Hey @ltx-dan,

Did you make a PR with the changes ?

ltx-dan commented 2 years ago

No I'm afraid I haven't got around to it yet.. But all the mods I had to do are outlined above so feel free to add them to your PR.

vedal commented 2 years ago

Really hope this is still planned for milestone 1.6 as suggested by @awaelchli :)

rohitgr7 commented 2 years ago

fixed the example recently. are you guys still facing issues with this? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/loop_examples/kfold.py

ltx-dan commented 2 years ago

Hi @rohitgr7, I get

AttributeError: 'FitLoop' object has no attribute 'replace'

What is this line supposed to do? If I delete it, it all works fine btw..

Also, if you don't want to do distributed training (bc you have single GPU) this fails..

rohitgr7 commented 2 years ago

hey @ltx-dan ! the example is updated to work with master, new release is probably next month. to run it with latest release, you need to make small adjustments. check more here: https://github.com/PyTorchLightning/pytorch-lightning/issues/11565

ltx-dan commented 2 years ago

Hi @rohitgr7 I see it's added here. Would the strategy=null work too on the master branch?

rohitgr7 commented 2 years ago

for single device training, the strategy will be SingleDeviceStrategy. It can never be Null.

function2-llx commented 2 years ago

Hello @rohitgr7 , may I ask if there's some plan to adopt the example implementation officially into the library, or just leave it as an example (with copy & paste for usage)? Thanks!

rohitgr7 commented 2 years ago

hey @function2-llx

for now, AFAIK, it will stay as an example since Loop API is still experimental.

YannDubs commented 2 years ago

what if we just integrate with sklearn cross validation? this can be the start of supporting sklearn interop

I agree with @williamFalcon , for the case of datasets in memory we should really just have a function (probably in bolts) that wraps SklearnDatamodule Trainer Module to a module that can be used with sklearn. Then you could get all the sklearn goodies for free: Pipeline (eg for preprocessing or to add a sklearn predictor after a SSL module), cross validation, hyperparameter tuning ...

chanshing commented 2 years ago

hey @function2-llx

for now, AFAIK, it will stay as an example since Loop API is still experimental.

I'm getting the following error when I try to use the kfold.py example:

    self.trainer.strategy.setup_optimizers(self.trainer)
AttributeError: 'Trainer' object has no attribute 'strategy'

I'm using PL 1.5.10 and Pytorch 1.10.2

ltx-dan commented 2 years ago

hi @rohitgr7

FYI, we just discovered that this updated version of the example (running on master) does not work at all if early_stopping callbacks are used.. after the first fold, the subsequent ones don't run at all.. do you have any idea why that might be?

What I found was that in anticipation of the rollout of the strategy feature, callbacks were updated (early_stopping included), see here

So for some reason, once early_stopping is updating the should_stop property of the trainer, it is never reset (for the subsequent folds) in kfold (after switching to strategy backend for ptl1.6.0. So, simply adding

self.trainer.should_stop = False

to the _reset_fitting method in your example makes it work again.