Open viswa-nvidia opened 3 years ago
Additionally, (unless there have been significant changes in the last year) sharding with TF Datasets looks pretty bad:
Taking all that into account, I'm starting to think we should drop this from our roadmap and just make Horovod the standard for multi-GPU training with NVTabular (as it is for use with other NVIDIA libraries.) That would likely also imply dropping PyTorch Distributed support (#775), since it wouldn't make that much sense to support native distributed training on one framework and not the other when we already have an approach that covers both frameworks.
According to the TF docs for
tf.data.Dataset
, the main possibility for converting fromKerasSequenceLoader
to aDataset
compatible withtf.distributed
isDataset.from_generator()
, which comes with some significant caveats:See also the
tf.keras.utils.Sequence
docs:There's also the issue that
tf.utils.keras.Sequence
uses multiple processes for parallelism (as opposed totf.data.Dataset
's thread-based parallelism, which works better when input processing is handled by a non-TF library like NVTabular, because it sidesteps contention for Python's Global Interpreter Lock.