Open frankxyy opened 1 year ago
The gpu usage is almost 100% all time for both 2 gpu case and 4 gpu case. From my assumption, the throughtput should be linear with the computation usage.
The phenomenon exists for different parallel methods including data-parallel, shard-parallel and pipeshard-parallel.
Could you share your scripts? Did you increase the batch size when you increase the number of devices?
@merrymercy
https://gist.github.com/frankxyy/efbb128fc41b5f11ae756874ec73b28c
Hi @merrymercy , this is the script I used for training, which is a modified vesion of the vit example script in the official repository. The dataset used is imagenette.
This is the training command:
NCCL_SOCKET_IFNAME=bond0 XLA_PYTHON_CLIENT_PREALLOCATE=false CUDA_VISIBLE_DEVICES=4,5,6,7 python run_image_classification.py \
--output_dir ./vit-base-patch16-imagenette \
--train_dir="/home/xuyangyang/imagenette2/train" \
--validation_dir="/home/xuyangyang/imagenette2/val" \
--num_train_epochs 50 \
--num_micro_batches 1 \
--learning_rate 1e-3 \
--per_device_train_batch_size 6 \
--per_device_eval_batch_size 2 \
--overwrite_output_dir \
--preprocessing_num_workers 2 \
--num_devices_per_node 4 \
--num_of_nodes 1 \
--model_name_or_path google/vit-base-patch16-224-in21k
When refering to more gpus, the num_devices_per_node argument is modified. From my assumption, the global batch size is also modified with the modification of num_devices_per_node.
Feeling truly confused about this problem, willing to receive your analysis.
I have the similar experiments but from NLP. I have run the opt example with 1.3b gpt-neo. with one node 8 2080ti gpus, it is Throughput: 3664.00 token/s, 3.74 TFLOP/s. When i scaled to 8 node (8*8 64 gpus), it became to Throughput: 6703.90 token/s, 0.86 TFLOP/s. The node info from ray dashboard as following.
Is there any way i can found the bottleneck ? currently, i might say the internet speed is the reason?
I run alpa for Vit-Large model training. On my machine with Nvidia A100 gpu, when I use 2 gpus, the throughput is 368.2 samples/s, when I use 4 gpus, the throughtput is 424.7 samples/s. It seems that the training efficiency does not change much with the increase of the device number.