Clarity of the differences between `prepare_data` and `setup` of the `LightningDataModule`

Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

https://lightning.ai

Apache License 2.0

28.3k stars 3.38k forks source link

Clarity of the differences between `prepare_data` and `setup` of the `LightningDataModule` #11528

Closed shabie closed 2 years ago

shabie commented 2 years ago

📚 Documentation

The current explanation of prepare_data and setup seem to me a bit unsatisfactory.

While they do go somewhat into the technical differences (i.e. prepare_data runs as a part of the main process while setup runs on each GPU process), it leaves a lot to be desired.

Things that I thing the docs ought to do:

Provide a more realistic example with explanation than a single word pseudo-code like tokenize() on why tokenization should be done as a part of this process and not the ones on GPU.
Explain why defining state (i.e. self.x=y) in prepare_data is a bad idea since this is precisely what a popular repo called MLOps-Basics introducing people to MLOps is doing. See here.
It should, in my opinion, go as far as to provide some guidelines how to tell where the common preprocessing steps belong and the reason for doing so.

May be I have exaggerated the need for explanation but I feel right now lost with the existing docs.

cc @borda @rohitgr7

rohitgr7 commented 2 years ago

hey @shabie !

prepare_data and setup can both be used to configure anything as long as you are using single device strategies (no distributed training) but when it comes to multi-device settings, it will be a problem. That's why we recommend them to use it as mentioned in the docs to ensure even if you change your settings to multi-device in Trainer, no code change will be required.

Provide a more realistic example with explanation than a single word pseudo-code like tokenize() on why tokenization should be done as a part of this process and not the ones on GPU.

Because, setup is called on each process/device, so if you tokenize the same data on different devices, its sort of like doing the same thing and is just a waste of compute/time.

Explain why defining state (i.e. self.x=y) in prepare_data is a bad idea since this is precisely what a popular repo called MLOps-Basics introducing people to MLOps is doing. See here.

you can ask the author of the repo/or send a PR over there to update the code. It's slightly incorrect if you are using a multi-device setting. I have seen a few issues before on this repo and have asked him to fix them. https://github.com/graviraja/MLOps-Basics/issues?q=is%3Aissue+author%3Arohitgr7+is%3Aclosed

It should, in my opinion, go as far as to provide some guidelines how to tell where the common preprocessing steps belong and the reason for doing so.

I think it's pretty clear over there why we recommend that:

Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures the prepare_data() is called only within a single process, so you can safely add your downloading logic within.

But if you think this section can be improved, feel free to send a PR with an improved version :)

shabie commented 2 years ago

Thanks a lot @rohitgr7! Your answer sheds some important light on the differences.

I think the room for clarity is definitely there. I'll wrap my head around it and make a PR because I still feel that a new reader in a hurry to use the framework would still end up being unsure of his choices.

I'll keep the issue open for now if that's OK.

rohitgr7 commented 2 years ago

I still feel that a new reader in a hurry to use the framework would still end up being unsure of his choices.

totally valid point! feel free to send a PR anytime :)

ananthsub commented 2 years ago

Some other topics that the docs could clarify with prepare_data:

In a distributed setting, the time prepare_data can take is restricted to the collective timeout. That's because we add a barrier after prepare_data to ensure all processes proceed to setup only after the data is actually prepared. Which means super expensive data processing pipelines cannot run in this step (e.g. more than 30 minutes)
Another knob is the prepare_data_per_node which points out that we don't always have a single process globally that downloads the data. Sometimes we have one process per node.
I've never seen prepare_data used in a production ML pipeline. Typically, the data is prepared before the training job even starts. Then all sorts of data checks are run, and only after these pass is a training job scheduled. Given prepare_data is optional to implement, I believe Lightning could be clearer around when it really should be implemented vs deferred to other systems entirely

shabie commented 2 years ago

BTW, I am still not sure how the code should be if not like how it was done in the MLOps-Basics repository.

The example you have in the docs points to does the following:

Step 1) an explicit download step done in the prepare_data step (using a throwaway class initialization that triggers the downloading of MNIST which is stored in the data_dir). Step 2) The same MNIST dataset class is reinitialized this time pointing to the downloaded folder in the setup again. This time with the intention of storing state in self.mnist_train, self.mnist_val etc.

Now in the example of MLOps-Basics, the data on the account of being small is loaded into the memory directly and the downloading step is taken over by the datasets library. We could but shouldn't be necessary.

So if I were to do this in an analogous way, I'd also call the load_dataset function only to let it download without saving it any variable and redo this in setup (without explicitly giving the download directory since HF's datasets library will look into familiar places before downloading again) but this time using it for declaring variables (i.e. storing state) containing train and validation splits.

Edit: This still leaves open the question on when to do tokenization since it is recommended to do it in the prepare_data and yet since I am not maintaining state, doing so will be entirely pointless.

shabie commented 2 years ago

I've never seen prepare_data used in a production ML pipeline. Typically, the data is prepared before the training job even starts. Then all sorts of data checks are run, and only after these pass is a training job scheduled. Given prepare_data is optional to implement, I believe Lightning could be clearer around when it really should be implemented vs deferred to other systems entirely

*wipes tears...* 😋 thank you @ananthsub!

rohitgr7 commented 2 years ago

@shabie

BTW, I am still not sure how the code should be if not like how it was done in the MLOps-Basics repository.

something like, in an ideal case:

    def prepare_data(self):
        # download
        load_dataset("glue", "cola")

    def setup(self, stage=None):
        # we set up only relevant datasets when stage is specified
        if stage == "fit" or stage is None:
            cola_dataset = load_dataset("glue", "cola")
            self.train_data = cola_dataset["train"]
            self.val_data = cola_dataset["validation"]
            ...

this can be improved further a little, where we can tokenize the data inside prepare_data itself and save the tokenized data, and inside setup load it back.

def prepare_data(self):
    cola_dataset = load_Dataset(...)
    train_dataset = ...
    val_datase = ...
    # tokenize
    # save it to disk

def setup(self):
    # load it back here

shabie commented 2 years ago

Now I haven't gotten around to making the PR yet but this thread provides far more clarity than is available on the docs regarding the differences between the two methods :)

For Lightning to give people the lightning effect in their work, the docs need to outshine everything else. Generally speaking, I think this is where Lightning needs to do a bit more work and I find how well Transformers is documented a real inspiration. Sure it is partly due to their simpler API due to the limited nature of cases they cover, but they go in depth regarding each and every function call + parameter.

rohitgr7 commented 2 years ago

Now I haven't gotten around to making the PR yet but this thread provides far more clarity than is available on the docs regarding the differences between the two methods :)

For Lightning to give people the lightning effect in their work, the docs need to outshine everything else. Generally speaking, I think this is where Lightning needs to do a bit more work and I find how well Transformers is documented a real inspiration. Sure it is partly due to their simpler API due to the limited nature of cases they cover, but they go in depth regarding each and every function call + parameter.

thanks for the feedback!

yes, we are constantly improving our docs. But yeah section might need more clarification. We have covered the recommendation but not why this is recommended with more details.

vitalwarley commented 2 years ago

Guys, here is how I did it for a classification task with labels from a .csv file.

Inspired by @rohitgr7 answer, in prepare_data I have

    def prepare_data(self):
        # load data
        ...
        # split data
        ...
        # save splits
        ...

and in setup

    def setup(self, stage: Optional[str] = None):
        if stage in (None, "fit"):
            train_arr = np.load(self.train_save_path, allow_pickle=True)
            val_arr = np.load(self.val_save_path, allow_pickle=True)
            self.train_ds = MS1MDataset(
                self.data_dir, transform=self.train_transform, seq=train_df
            )
            self.val_ds = MS1MDataset(
                self.data_dir, transform=self.val_transform, seq=val_df
            )

The main process will prepare (load, split, and save to disk) and set up the dataset (load splits, instantiate MSM1Dataset for each split). That is, the main process (with GPU 0) will do prepare_data, and every GPU will do setup. Each process (across devices or not) gets a different subset of training/validation data with PTL via DistributedSampler.

Please let me know if my understanding is right.

Thanks!

rohitgr7 commented 2 years ago

@vitalwarley !

yes, just one thing to clarify:

That is, the main process (with GPU 0) will do prepare_data

the main process isn't GPU 0 but just the main process in the CPU.

adosar commented 7 months ago

Guys, here is how I did it for a classification task with labels from a .csv file.

Inspired by @rohitgr7 answer, in prepare_data I have
    def prepare_data(self):
        # load data
        ...
        # split data
        ...
        # save splits
        ...
and in setup
    def setup(self, stage: Optional[str] = None):
        if stage in (None, "fit"):
            train_arr = np.load(self.train_save_path, allow_pickle=True)
            val_arr = np.load(self.val_save_path, allow_pickle=True)
            self.train_ds = MS1MDataset(
                self.data_dir, transform=self.train_transform, seq=train_df
            )
            self.val_ds = MS1MDataset(
                self.data_dir, transform=self.val_transform, seq=val_df
            )
The main process will prepare (load, split, and save to disk) and set up the dataset (load splits, instantiate MSM1Dataset for each split). That is, the main process (with GPU 0) will do prepare_data, and every GPU will do setup. Each process (across devices or not) gets a different subset of training/validation data with PTL via DistributedSampler.

Please let me know if my understanding is right.

Thanks!

Could we move the loading of arrays np.load* inside __init__, to avoid loading the arrays multiple times for each stage?