Closed shabie closed 2 years ago
hey @shabie !
prepare_data
and setup
can both be used to configure anything as long as you are using single device strategies (no distributed training) but when it comes to multi-device settings, it will be a problem. That's why we recommend them to use it as mentioned in the docs to ensure even if you change your settings to multi-device in Trainer, no code change will be required.
Provide a more realistic example with explanation than a single word pseudo-code like tokenize() on why tokenization should be done as a part of this process and not the ones on GPU.
Because, setup
is called on each process/device, so if you tokenize the same data on different devices, its sort of like doing the same thing and is just a waste of compute/time.
Explain why defining state (i.e. self.x=y) in prepare_data is a bad idea since this is precisely what a popular repo called MLOps-Basics introducing people to MLOps is doing. See here.
you can ask the author of the repo/or send a PR over there to update the code. It's slightly incorrect if you are using a multi-device setting. I have seen a few issues before on this repo and have asked him to fix them. https://github.com/graviraja/MLOps-Basics/issues?q=is%3Aissue+author%3Arohitgr7+is%3Aclosed
It should, in my opinion, go as far as to provide some guidelines how to tell where the common preprocessing steps belong and the reason for doing so.
I think it's pretty clear over there why we recommend that:
Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures the prepare_data() is called only within a single process, so you can safely add your downloading logic within.
But if you think this section can be improved, feel free to send a PR with an improved version :)
Thanks a lot @rohitgr7! Your answer sheds some important light on the differences.
I think the room for clarity is definitely there. I'll wrap my head around it and make a PR because I still feel that a new reader in a hurry to use the framework would still end up being unsure of his choices.
I'll keep the issue open for now if that's OK.
I still feel that a new reader in a hurry to use the framework would still end up being unsure of his choices.
totally valid point! feel free to send a PR anytime :)
Some other topics that the docs could clarify with prepare_data
:
prepare_data
can take is restricted to the collective timeout. That's because we add a barrier after prepare_data
to ensure all processes proceed to setup
only after the data is actually prepared. Which means super expensive data processing pipelines cannot run in this step (e.g. more than 30 minutes)prepare_data_per_node
which points out that we don't always have a single process globally that downloads the data. Sometimes we have one process per node.prepare_data
used in a production ML pipeline. Typically, the data is prepared before the training job even starts. Then all sorts of data checks are run, and only after these pass is a training job scheduled. Given prepare_data
is optional to implement, I believe Lightning could be clearer around when it really should be implemented vs deferred to other systems entirelyBTW, I am still not sure how the code should be if not like how it was done in the MLOps-Basics repository.
The example you have in the docs points to does the following:
Step 1) an explicit download step done in the prepare_data
step (using a throwaway class initialization that triggers the downloading of MNIST which is stored in the data_dir
).
Step 2) The same MNIST dataset class is reinitialized this time pointing to the downloaded folder in the setup
again. This time with the intention of storing state in self.mnist_train
, self.mnist_val
etc.
Now in the example of MLOps-Basics, the data on the account of being small is loaded into the memory directly and the downloading step is taken over by the datasets
library. We could but shouldn't be necessary.
So if I were to do this in an analogous way, I'd also call the load_dataset
function only to let it download without saving it any variable and redo this in setup
(without explicitly giving the download directory since HF's datasets
library will look into familiar places before downloading again) but this time using it for declaring variables (i.e. storing state) containing train and validation splits.
Edit: This still leaves open the question on when to do tokenization since it is recommended to do it in the prepare_data
and yet since I am not maintaining state, doing so will be entirely pointless.
I've never seen prepare_data used in a production ML pipeline. Typically, the data is prepared before the training job even starts. Then all sorts of data checks are run, and only after these pass is a training job scheduled. Given prepare_data is optional to implement, I believe Lightning could be clearer around when it really should be implemented vs deferred to other systems entirely
*wipes tears...* 😋 thank you @ananthsub!
@shabie
BTW, I am still not sure how the code should be if not like how it was done in the MLOps-Basics repository.
something like, in an ideal case:
def prepare_data(self):
# download
load_dataset("glue", "cola")
def setup(self, stage=None):
# we set up only relevant datasets when stage is specified
if stage == "fit" or stage is None:
cola_dataset = load_dataset("glue", "cola")
self.train_data = cola_dataset["train"]
self.val_data = cola_dataset["validation"]
...
this can be improved further a little, where we can tokenize the data inside prepare_data
itself and save the tokenized data, and inside setup
load it back.
def prepare_data(self):
cola_dataset = load_Dataset(...)
train_dataset = ...
val_datase = ...
# tokenize
# save it to disk
def setup(self):
# load it back here
Now I haven't gotten around to making the PR yet but this thread provides far more clarity than is available on the docs regarding the differences between the two methods :)
For Lightning to give people the lightning effect in their work, the docs need to outshine everything else. Generally speaking, I think this is where Lightning needs to do a bit more work and I find how well Transformers is documented a real inspiration. Sure it is partly due to their simpler API due to the limited nature of cases they cover, but they go in depth regarding each and every function call + parameter.
Now I haven't gotten around to making the PR yet but this thread provides far more clarity than is available on the docs regarding the differences between the two methods :)
For Lightning to give people the lightning effect in their work, the docs need to outshine everything else. Generally speaking, I think this is where Lightning needs to do a bit more work and I find how well Transformers is documented a real inspiration. Sure it is partly due to their simpler API due to the limited nature of cases they cover, but they go in depth regarding each and every function call + parameter.
thanks for the feedback!
yes, we are constantly improving our docs. But yeah section might need more clarification. We have covered the recommendation but not why this is recommended with more details.
Guys, here is how I did it for a classification task with labels from a .csv
file.
Inspired by @rohitgr7 answer, in prepare_data
I have
def prepare_data(self):
# load data
...
# split data
...
# save splits
...
and in setup
def setup(self, stage: Optional[str] = None):
if stage in (None, "fit"):
train_arr = np.load(self.train_save_path, allow_pickle=True)
val_arr = np.load(self.val_save_path, allow_pickle=True)
self.train_ds = MS1MDataset(
self.data_dir, transform=self.train_transform, seq=train_df
)
self.val_ds = MS1MDataset(
self.data_dir, transform=self.val_transform, seq=val_df
)
The main process will prepare (load, split, and save to disk) and set up the dataset (load splits, instantiate MSM1Dataset
for each split). That is, the main process (with GPU 0) will do prepare_data
, and every GPU will do setup
. Each process (across devices or not) gets a different subset of training/validation data with PTL via DistributedSampler.
Please let me know if my understanding is right.
Thanks!
@vitalwarley !
yes, just one thing to clarify:
That is, the main process (with GPU 0) will do prepare_data
the main process isn't GPU 0 but just the main process in the CPU.
If
Guys, here is how I did it for a classification task with labels from a
.csv
file.Inspired by @rohitgr7 answer, in
prepare_data
I havedef prepare_data(self): # load data ... # split data ... # save splits ...
and in
setup
def setup(self, stage: Optional[str] = None): if stage in (None, "fit"): train_arr = np.load(self.train_save_path, allow_pickle=True) val_arr = np.load(self.val_save_path, allow_pickle=True) self.train_ds = MS1MDataset( self.data_dir, transform=self.train_transform, seq=train_df ) self.val_ds = MS1MDataset( self.data_dir, transform=self.val_transform, seq=val_df )
The main process will prepare (load, split, and save to disk) and set up the dataset (load splits, instantiate
MSM1Dataset
for each split). That is, the main process (with GPU 0) will doprepare_data
, and every GPU will dosetup
. Each process (across devices or not) gets a different subset of training/validation data with PTL via DistributedSampler.Please let me know if my understanding is right.
Thanks!
Could we move the loading of arrays np.load*
inside __init__
, to avoid loading the arrays multiple times for each stage
?
📚 Documentation
The current explanation of
prepare_data
andsetup
seem to me a bit unsatisfactory.While they do go somewhat into the technical differences (i.e.
prepare_data
runs as a part of the main process whilesetup
runs on each GPU process), it leaves a lot to be desired.Things that I thing the docs ought to do:
tokenize()
on why tokenization should be done as a part of this process and not the ones on GPU.self.x=y
) inprepare_data
is a bad idea since this is precisely what a popular repo called MLOps-Basics introducing people to MLOps is doing. See here.May be I have exaggerated the need for explanation but I feel right now lost with the existing docs.
cc @borda @rohitgr7