alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.05k stars 353 forks source link

Training throughput not scale with increasing number of devices (ViT training) #824

Open frankxyy opened 1 year ago

frankxyy commented 1 year ago

I run alpa for Vit-Large model training. On my machine with Nvidia A100 gpu, when I use 2 gpus, the throughput is 368.2 samples/s, when I use 4 gpus, the throughtput is 424.7 samples/s. It seems that the training efficiency does not change much with the increase of the device number.

frankxyy commented 1 year ago

The gpu usage is almost 100% all time for both 2 gpu case and 4 gpu case. From my assumption, the throughtput should be linear with the computation usage.

The phenomenon exists for different parallel methods including data-parallel, shard-parallel and pipeshard-parallel.

merrymercy commented 1 year ago

Could you share your scripts? Did you increase the batch size when you increase the number of devices?

frankxyy commented 1 year ago

@merrymercy

https://gist.github.com/frankxyy/efbb128fc41b5f11ae756874ec73b28c

Hi @merrymercy , this is the script I used for training, which is a modified vesion of the vit example script in the official repository. The dataset used is imagenette.

This is the training command:

NCCL_SOCKET_IFNAME=bond0 XLA_PYTHON_CLIENT_PREALLOCATE=false CUDA_VISIBLE_DEVICES=4,5,6,7  python run_image_classification.py \
    --output_dir ./vit-base-patch16-imagenette \
    --train_dir="/home/xuyangyang/imagenette2/train" \
    --validation_dir="/home/xuyangyang/imagenette2/val" \
    --num_train_epochs 50 \
    --num_micro_batches 1 \
    --learning_rate 1e-3 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 2 \
    --overwrite_output_dir \
    --preprocessing_num_workers 2 \
    --num_devices_per_node 4 \
    --num_of_nodes 1 \
    --model_name_or_path google/vit-base-patch16-224-in21k

When refering to more gpus, the num_devices_per_node argument is modified. From my assumption, the global batch size is also modified with the modification of num_devices_per_node.

Feeling truly confused about this problem, willing to receive your analysis.

dumpmemory commented 1 year ago

I have the similar experiments but from NLP. I have run the opt example with 1.3b gpt-neo. with one node 8 2080ti gpus, it is Throughput: 3664.00 token/s, 3.74 TFLOP/s. When i scaled to 8 node (8*8 64 gpus), it became to Throughput: 6703.90 token/s, 0.86 TFLOP/s. The node info from ray dashboard as following.

Screen Shot 2022-12-29 at 12 39 14 AM

Is there any way i can found the bottleneck ? currently, i might say the internet speed is the reason?