[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models

Problem:

In (536)[https://github.com/NVIDIA-Merlin/Merlin/issues/536], we provided functionality of horovod with Merlin Models and added features which automates the process on the Merlin Models side. However, the current feature is not 100% user friendly and there are still open questions how a user can use multi-GPU data parallel training.

Goal:

Improve the user experience to user multi-GPU data parallel training
Test multi-GPU data parallel training: AUC? Scale-Up Performance?

Constraints:

I am not sure, if the issues with unequal batch size for the data loader is solved: https://github.com/NVIDIA-Merlin/dataloader/issues/75: -- if the solution is about how the data is generated, correctly - how das that work? -- How are we ensure it with NVTabular? -- How about users who do NOT use NVTabular?
The unittest is written that each worker runs through the FULL dataset per epoch. That is incorrect. If we have 1M data points and 2 GPUs, each GPU should run only through 500k data points. I wrote the example that NVTabular produces distinct files per worker. Is that the proposed workflow for a user?

Starting Point:

Analyze scaling factor by using multiple GPUs: If we go from 1x GPU -> 2x GPUs -> 4x GPUs -> 8 GPUx - how much higher is the throughput?
Provide performance metrics (accuracy / AUC / etc) to show that there is no negative effect in the model performance
Provide guidance how to set global batch size, batch size per GPU and learning rate when scaling

NVIDIA-Merlin / Merlin

[RMP] Refine Multi-GPU Data Parallel training for Tensorflow in Merlin Models #752

Problem:

Goal:

Constraints:

Starting Point: