Closed lightsailpro closed 2 years ago
@lightsailpro, thank you for your question! We are leveraging the multi-GPU settings defined in the HuggingFace library (you can check this documentation to get more details about the supported strategies).
The most straightforward strategy is DataParallel
. If you are using a notebook for training, you should just set the visible GPUs at the beginning by running:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
If you are using a python script you should run the following command:
CUDA_VISIBLE_DEVICES=0,1 python $YOUR_SCRIPT --{arguments}
Thanks for the quick response. Will try later.
❓ Questions & Help
Details
In this document page - https://nvidia-merlin.github.io/Transformers4Rec/main/training_eval.html, it mentions training on a single machine with multiple GPUs using DataParallel approach. But the doc does not seem to mention how other than "fp16=True" compared with the example code in https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/examples/getting-started-session-based. I have two V100 GPUs. I noticed that only 1 GPU is used by the trainer by default. Do I need to do sth like "model = nn.DataParallel(model)", or is there any setting to turn on the T4RecTrainingArguments? Any example code will be much appreciated.