NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
764 stars 114 forks source link

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

Open bschifferer opened 1 year ago

bschifferer commented 1 year ago

Problem:

In (536)[https://github.com/NVIDIA-Merlin/Merlin/issues/536], we provided functionality of horovod with Merlin Models and added features which automates the process on the Merlin Models side. However, the current feature is not 100% user friendly and there are still open questions how a user can use multi-GPU data parallel training.

Goal:

Constraints:

Starting Point:

bschifferer commented 1 year ago

@EvenOldridge @viswa-nvidia - as we discussed - I created a follow up roadmap ticket for multi-data parallel GPU feature