How to set t4r PyTorch trainner to use multiple GPU DataParallel [QST]

NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.

Apache License 2.0

1.1k stars 143 forks source link

Details

In this document page - https://nvidia-merlin.github.io/Transformers4Rec/main/training_eval.html, it mentions training on a single machine with multiple GPUs using DataParallel approach. But the doc does not seem to mention how other than "fp16=True" compared with the example code in https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/examples/getting-started-session-based. I have two V100 GPUs. I noticed that only 1 GPU is used by the trainer by default. Do I need to do sth like "model = nn.DataParallel(model)", or is there any setting to turn on the T4RecTrainingArguments? Any example code will be much appreciated.

@lightsailpro, thank you for your question! We are leveraging the multi-GPU settings defined in the HuggingFace library (you can check this documentation to get more details about the supported strategies).

The most straightforward strategy is DataParallel. If you are using a notebook for training, you should just set the visible GPUs at the beginning by running:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

If you are using a python script you should run the following command: CUDA_VISIBLE_DEVICES=0,1 python $YOUR_SCRIPT --{arguments}

NVIDIA-Merlin / Transformers4Rec

How to set t4r PyTorch trainner to use multiple GPU DataParallel [QST] #435

❓ Questions & Help

Details