Enable distributed data loading on vCPUs and GPUs

cnellington commented 2 years ago

Make the torch IterableDataset num_workers arg work with the current data iterators

juwaynilucman commented 2 years ago

We have decided to close the multi-process data loading issue for now because it was found that, for iterable-style datasets, using multiple workers does not improve training over its single-process counterpart. We suspect that this occurs because of the worker instantiation slowdown.

When num_workers > 0, each time a DataLoader's iterator is created (i.e. the beginning of for batch in dataloader:), num_workers workers are instantiated. Moreover, the dataset, collate_fn, and worker_init_fn are passed to each worker. When all dataset is consumed, the data loader shuts down the workers. In the following epoch, a new set of workers are created and so forth.

The argument persistent_workers in DataLoader, if set to True, prevents the data loader from killing the workers after the dataset has been consumed (i.e. after each epoch). So, I used this property to validate my claim; I trained the ContextualizedRegression model for 1, 2, 3, and 4 epochs. The results can be summarized as follows:

By setting num_workers=1, we observe a significant training slowdown over num_workers=0 (i.e. iterations/second in the former decreases significantly. Thus, it slows down the training)
In the multi-process data loading mode, models with persistent_workers=True train faster than the ones with persistent_workers=False. Moreover, the former has comparable iterations/second to the single-process model
Ideally, the multi-process data loading should work if persistent_workers=True. However, that is not the case because it seems that the first worker instantiation (i.e. when the first for batch in dataloader: is run) is very slow, so the overall training time still does not improve over the single-process models

Here is a table with the details: https://docs.google.com/document/d/16jgHfB0o_QflFug02cz5pWI-lhavwEpLoaKctF_5f-w/edit?usp=sharing

cnellington commented 2 years ago

Thanks @juwaynilucman, great work digging into this. It seems that the most important info here is that even with persistent workers (no instantiation time in epochs > 1) the iterations per second is similar or slightly lower when using distributed data loaders. Combined with the extra instantiation time for distributed data loaders on epoch 0, there doesn't seem to be any use-case where distributed data loaders are desirable right now. Hopefully we can revisit this in the future, but for now there seem to be many well-documented examples online confirming your results on the slowness of distributed data loaders in most cases.

cnellington / Contextualized

Enable distributed data loading on vCPUs and GPUs #62