NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

Multi-GPU training with Tensorflow Distributed #774

Open viswa-nvidia opened 3 years ago

karlhigley commented 3 years ago

According to the TF docs for tf.data.Dataset, the main possibility for converting from KerasSequenceLoader to a Dataset compatible with tf.distributed is Dataset.from_generator(), which comes with some significant caveats:

Note: The current implementation of Dataset.from_generator() uses tf.numpy_function and inherits the same constraints. In particular, it requires the dataset and iterator related operations to be placed on a device in the same process as the Python program that called Dataset.from_generator(). The body of generator will not be serialized in a GraphDef, and you should not use this method if you need to serialize your model and restore it in a different environment.

See also the tf.keras.utils.Sequence docs:

Notes: Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

There's also the issue that tf.utils.keras.Sequence uses multiple processes for parallelism (as opposed to tf.data.Dataset's thread-based parallelism, which works better when input processing is handled by a non-TF library like NVTabular, because it sidesteps contention for Python's Global Interpreter Lock.

karlhigley commented 3 years ago

Additionally, (unless there have been significant changes in the last year) sharding with TF Datasets looks pretty bad: Screenshot from 2021-05-03 11-44-28

karlhigley commented 3 years ago

Taking all that into account, I'm starting to think we should drop this from our roadmap and just make Horovod the standard for multi-GPU training with NVTabular (as it is for use with other NVIDIA libraries.) That would likely also imply dropping PyTorch Distributed support (#775), since it wouldn't make that much sense to support native distributed training on one framework and not the other when we already have an approach that covers both frameworks.