NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

Enabling data-parallel multi-GPU training #1188

Closed marcromeyn closed 1 year ago

marcromeyn commented 1 year ago

This PR enables multi-GPU training, as well as add auto-initialization of a Model.

It also introduces singlegpu and multigpu pytest markers for splitting the GPU CI Github Actions workflow into two jobs: one for the 1GPU runner, and one for multi-gpu 2GPU runner.

Follow-up: The test in tests/integration is not complete because Lightning launches separte processes under the hood with the correct environment variables like LOCAL_RANK, but the pytest stays in the main process and tests only the LOCAL_RANK=0 case. To follow up with proper test that ensures dataloader is working properly with e.g., global_rank > 0.

github-actions[bot] commented 1 year ago

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1188