cnellington / Contextualized

An SKLearn-style toolbox for estimating and analyzing models, distributions, and functions with context-specific parameters.
http://contextualized.ml/
GNU General Public License v3.0
65 stars 9 forks source link

Enable distributed data loading on vCPUs and GPUs #62

Open cnellington opened 2 years ago

cnellington commented 2 years ago

Make the torch IterableDataset num_workers arg work with the current data iterators

juwaynilucman commented 2 years ago

We have decided to close the multi-process data loading issue for now because it was found that, for iterable-style datasets, using multiple workers does not improve training over its single-process counterpart. We suspect that this occurs because of the worker instantiation slowdown.

When num_workers > 0, each time a DataLoader's iterator is created (i.e. the beginning of for batch in dataloader:), num_workers workers are instantiated. Moreover, the dataset, collate_fn, and worker_init_fn are passed to each worker. When all dataset is consumed, the data loader shuts down the workers. In the following epoch, a new set of workers are created and so forth.

The argument persistent_workers in DataLoader, if set to True, prevents the data loader from killing the workers after the dataset has been consumed (i.e. after each epoch). So, I used this property to validate my claim; I trained the ContextualizedRegression model for 1, 2, 3, and 4 epochs. The results can be summarized as follows:

Here is a table with the details: https://docs.google.com/document/d/16jgHfB0o_QflFug02cz5pWI-lhavwEpLoaKctF_5f-w/edit?usp=sharing

cnellington commented 2 years ago

Thanks @juwaynilucman, great work digging into this. It seems that the most important info here is that even with persistent workers (no instantiation time in epochs > 1) the iterations per second is similar or slightly lower when using distributed data loaders. Combined with the extra instantiation time for distributed data loaders on epoch 0, there doesn't seem to be any use-case where distributed data loaders are desirable right now. Hopefully we can revisit this in the future, but for now there seem to be many well-documented examples online confirming your results on the slowness of distributed data loaders in most cases.