Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Unable to access datalaoders within `configure_optimizers` #10430

Closed rohitgr7 closed 1 year ago

rohitgr7 commented 3 years ago

Proposed refactoring or deprecation

Before v1.5, the dataloader hooks were patched to model and were easily accessible within configure_optimizers to setup total training steps for scheduler. But now since they are no longer patched, https://github.com/PyTorchLightning/pytorch-lightning/blob/0ed5e3dc8abcec40aacd64cc9175590bb1409759/pytorch_lightning/trainer/connectors/data_connector.py#L213-L224 they are no longer available directly if using a datamodule or dataloaders are passed directly to .fit. Neither they can be accessed using self.trainer.train_dataloaders because dataloaders are being loaded within Fit Loop. https://github.com/PyTorchLightning/pytorch-lightning/blob/0ed5e3dc8abcec40aacd64cc9175590bb1409759/pytorch_lightning/loops/fit_loop.py#L194-L197

Motivation

I'd suggest these dataloaders should be available for users, no matter how to passed it during .fit.

Pitch

If possible we should load call configure_optimizers after loading the dataloaders for the first time within fit loop. Not sure if it will bring some complications and failures because we load the optimizers differently for deepspeed.

Additional context

As always, alternative suggestions/thoughts would be appreciated :)

cc: @karthikrangasai @awaelchli


If you enjoy Lightning, check out our other projects! ⚡

awaelchli commented 3 years ago

@rohitgr7 Good observation.

Yes, this was indeed a somewhat breaking change and it was not fully intentional. However, one could also argue this was a bug or at least undefined behavior. Previously, if we called train_dataloader(), e.g., through the trainer, it would still have lead to Trainer calling it a second time later on when it got reset, if I recall correctly.

For the purpose of documentation, here is the sequence of hooks we exacute around configure_optimizers and *_dataloader calls:

https://github.com/PyTorchLightning/pytorch-lightning/blob/7fb277f260bf91137af1ac599ed2751bbf1d49e3/tests/models/test_hooks.py#L648-L663

With this issue request, either the configure_optimizers call would have to move several hooks later or the dataloader reset call would have to move earlier.

What would make this configuration more favorable than the current one? Isn't it equally valid to request having the optimizers available in the dataloader methods? If both need to be the case, then this is hitting the fundamental limits of the hook system in Lightning. A similar issue was faced in #8485.

All I'm saying is, if the hook order changes then we need to be 100% sure that the order we choose is more favorable than the other in most cases, and we need to be aware of the limitations.

romovpa commented 3 years ago

Hi! @rohitgr7 @awaelchli I have a use case where I need to access the training data loader in configure_optimizers(), not vise-versa.

I'm working on integrating Lightning with Opacus (it enables training with differential privacy). This pull request demonstrates how PL user can add privacy by just wrapping optimizer in configure_optimizers().

The problem is, we need to access the training data to configure noise generation and privacy accounting properly. We can pass some stats about the training data as the model parameters, but this can easily lead to inconsistent optimizer configuration and as the result, incorrect privacy accounting.

My current solution touches _data_connector, which is a dirty hack:

    def configure_optimizers(self):
        optimizer = optim.SGD(self.parameters(), lr=self.lr, momentum=0)

        if self.enable_dp:
            # dirty introspection of the trainer instance to get the training data
            data_loader = (
                self.trainer._data_connector._train_dataloader_source.dataloader()
            )
            # transform (model, optimizer, dataloader) to DP-versions
            model, optimizer, dataloader = self.privacy_engine.make_private(
                self,
                optimizer,
                data_loader,
                ...
            )

        return optimizer

cc @ananthsub (To our discussion about using TrainingTypePlugin: I think this way of enabling DP by modifying configure_optimizers can be useful for advanced users and DP researchers, I would prefer to have both on the table)

rohitgr7 commented 3 years ago

With this issue request, either the configure_optimizers call would have to move several hooks later or the dataloader reset call would have to move earlier.

I think moving the configure_optimizers later would be better since moving the dataloader initialization call outside FitLoop is inconvenient plus we only initialize optimizers during .fit so moving this initialization inside Fit Loop might be better. Although that will require a lot of changes around sharded/deepspeed since they do their initialization and wrap them around in a different way. cc @SeanNaren

Isn't it equally valid to request having the optimizers available in the dataloader methods?

do we have any use-case/example for it? @awaelchli

awaelchli commented 3 years ago

@rohitgr7 Would you like to try to invert the order in a draft? Failing tests should then reveal potential challenges we have not considered yet, if there are any.

rohitgr7 commented 3 years ago

yep! we do have some failing tests.

andreimargeloiu commented 2 years ago

Any update?

rohitgr7 commented 2 years ago

hey @andreimargeloiu !

the current workaround is that you can initialize the dataloader before it's initialized by lightning using:

def configure_optimizers(self):
    self.trainer.reset_train_dataloader()
    self.train_dataloader.loaders  # access it here.
2533245542 commented 2 years ago

A bit confused here. So in the current pytorch lightning version, if I want to access the train dataloder, for example, in my model's forward method, I do train_loader = self.trainer.datamodule.train_dataloader()?

Looks like if I do this in model's def setup(self), the train dataloader will be loaded because it was not loaded before. But if I do this in model's forward method, will the dataloader be reloaded or it is just using a cached one?

Would be happy to see if there is a more elegant way of doing this.

LarsHill commented 1 year ago

The above workaround by @rohitgr7 does not sem to work anymore in pytorch-lightning >= 2.0.0, since the reset_train_dataloader() function does no longer exist. I receive the error AttributeError: 'Trainer' object has no attribute 'reset_train_dataloader'.

Is there an other workaround or any plans to fix this? I think it is quite more common to access the dataset in configure_optimizer in order to easily get access to data loading parameters like batch size, number of total samples, etc. These information are often necessary for initializing learning rate schedulers.

carmocca commented 1 year ago

In 2.0, you can do

def configure_optimizers(self):
    self.trainer.fit_loop.setup_data()
    dataloader = self.trainer.train_dataloader

    ...

You could also call self.train_dataloader() too if your LightningModule defines it

carmocca commented 1 year ago

Closing as I don't think there's anything actionable. Feel free to post further questions though