Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

Support all iterator modes for fit/validate/test/predict #16830

Open carmocca opened 1 year ago

carmocca commented 1 year ago

Description & Motivation

trainer.fit only works with CombinedLoader(..., mode="max_size_cycle"|"min_size")

trainer.{validate,test,predict} only works with CombinedLoader(..., mode="sequential")

This constraint is checked in the top-level loops: https://github.com/Lightning-AI/lightning/blob/0009cde1db1a9ab4e2f1e0a9f69a4affb59d5134/src/lightning/pytorch/loops/fit_loop.py#L351-L354 https://github.com/Lightning-AI/lightning/blob/0009cde1db1a9ab4e2f1e0a9f69a4affb59d5134/src/lightning/pytorch/loops/evaluation_loop.py#L182-L183

Pitch

Have all trainer functions support all modes

TODO:

Alternatives

Not do it

Additional context

This builds on top of https://github.com/Lightning-AI/lightning/pull/16726

cc @borda @justusschock @awaelchli

mees commented 1 year ago

I am migrating my code to PL 2 and it seems that for the val dataloader getting a batch to be of the form {"key_a": batch_dataloader_a, "key_b": batch_dataloader_b} is not implemented in PL 2 yet. Here my old code as a reference:

def val_dataloader(self):
    val_dataloaders = {
        key: DataLoader(
            dataset,
            batch_size=dataset.batch_size,
            shuffle=False,
            num_workers=dataset.num_workers,
            pin_memory=False,
        )
        for key, dataset in self.val_datasets.items()
    }
    combined_val_loaders = CombinedLoader(val_dataloaders, "max_size_cycle")
    return combined_val_loaders
carmocca commented 1 year ago

@mees I added support for that in #17163, if you want to give it a try. The PR only implements it for validation and testing.

bkmi commented 1 year ago

@mees I added support for that in #17163, if you want to give it a try. The PR only implements it for validation and testing.

really helpful! I hope this gets into "stable" soon.... or even the next release!

FarzanT commented 1 year ago

I really wish there was sequential support in the training loop. Right now, it's not clear how one should handle batches of potentially different sizes in the training_step. We'd have to collate inside the training_step and ensure the given batch size is divided by the number of data loaders to keep gradient accumulation consistent etc. It gets pretty ugly. @carmocca Thank you for your work on this issue. Not to rush you, but any update on the sequential support in the training loop? Thanks again!

carmocca commented 1 year ago

Unfortunately, I dont have bandwidth to work on this now. If somebody wants to try, I can help getting the PR merged. You can follow the structure in the EvaluationLoop. The training hooks will need an optional dataloader_idx argument

surya-narayanan commented 1 year ago

@mees I added support for that in #17163, if you want to give it a try. The PR only implements it for validation and testing.

really helpful! I hope this gets into "stable" soon.... or even the next release!

Me too! Is there any release timeline / nightly version with this supported? I can't use lightning without this and really would like to leverage its other features!

spfrommer commented 1 year ago

Ditto! FYI for others pulling nightly will get the feature: https://github.com/Lightning-AI/lightning/pull/17163

chenhaomingbob commented 1 year ago

Thanks! I also need this great feature.

johnathanchiu commented 1 year ago

+1, please release this feature asap!

lukas-folle-snkeos commented 3 months ago

Is this feature currently worked on?

carmocca commented 3 months ago

As far as I know, nobody is currently working on it, Lukas

astirn commented 1 month ago

I would also really like this feature. I use CombinedDataloader to encapsulate modality-specific Dataloaders to recycle modalities with fewer data than our largest modality. For this reason, I use CombinedDatloader with "max_size_cycle"/"min_cycle" for train/validation and would like to be able for predict as well. Thanks for considering it!