Enabling data-parallel multi-GPU training

This PR enables multi-GPU training, as well as add auto-initialization of a Model.

It also introduces singlegpu and multigpu pytest markers for splitting the GPU CI Github Actions workflow into two jobs: one for the 1GPU runner, and one for multi-gpu 2GPU runner.

Follow-up: The test in tests/integration is not complete because Lightning launches separte processes under the hood with the correct environment variables like LOCAL_RANK, but the pytest stays in the main process and tests only the LOCAL_RANK=0 case. To follow up with proper test that ensures dataloader is working properly with e.g., global_rank > 0.

NVIDIA-Merlin / models

Enabling data-parallel multi-GPU training #1188

Documentation preview