Open cnellington opened 2 years ago
We have decided to close the multi-process data loading issue for now because it was found that, for iterable-style datasets, using multiple workers does not improve training over its single-process counterpart. We suspect that this occurs because of the worker instantiation slowdown.
When num_workers
> 0, each time a DataLoader
's iterator is created (i.e. the beginning of for batch in dataloader:
), num_workers
workers are instantiated. Moreover, the dataset
, collate_fn
, and worker_init_fn
are passed to each worker. When all dataset is consumed, the data loader shuts down the workers. In the following epoch, a new set of workers are created and so forth.
The argument persistent_workers
in DataLoader
, if set to True
, prevents the data loader from killing the workers after the dataset has been consumed (i.e. after each epoch). So, I used this property to validate my claim; I trained the ContextualizedRegression model for 1, 2, 3, and 4 epochs. The results can be summarized as follows:
num_workers=1
, we observe a significant training slowdown over num_workers=0
(i.e. iterations/second in the former decreases significantly. Thus, it slows down the training)persistent_workers=True
train faster than the ones with persistent_workers=False
. Moreover, the former has comparable iterations/second to the single-process modelpersistent_workers=True
. However, that is not the case because it seems that the first worker instantiation (i.e. when the first for batch in dataloader:
is run) is very slow, so the overall training time still does not improve over the single-process modelsHere is a table with the details: https://docs.google.com/document/d/16jgHfB0o_QflFug02cz5pWI-lhavwEpLoaKctF_5f-w/edit?usp=sharing
Thanks @juwaynilucman, great work digging into this. It seems that the most important info here is that even with persistent workers (no instantiation time in epochs > 1) the iterations per second is similar or slightly lower when using distributed data loaders. Combined with the extra instantiation time for distributed data loaders on epoch 0, there doesn't seem to be any use-case where distributed data loaders are desirable right now. Hopefully we can revisit this in the future, but for now there seem to be many well-documented examples online confirming your results on the slowness of distributed data loaders in most cases.
Make the torch IterableDataset num_workers arg work with the current data iterators