[BUG/Help] <title>如何提高多卡下的p-tuning训练速度

Is there an existing issue for this?

[x] I have searched the existing issues

Current Behavior

在8卡V100机器上，设置NUM_GPUS=8，使用train.sh训练，发现和1卡训练速度没啥差别，都是2个小时左右，哪里设置不对吗

PRE_SEQ_LEN=128
LR=2e-2
NUM_GPUS=1
date=$(date +"%Y%m%d%H%M")

torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
    --do_train \
    --train_file formated_q_a.json \
    --validation_file formated_q_a.json \
    --preprocessing_num_workers 10 \
    --prompt_column input \
    --response_column output \
    --overwrite_cache \
    --model_name_or_path /data/vege/llm/wenda/model/chatglm2-6b \
    --output_dir /data/vege/llm/wenda/model/chatglm2-6b-ptuning-example-output/adgen-chatglm2-6b-pt-$PRE_SEQ_LEN-$LR-$date \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 128 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --predict_with_generate \
    --max_steps 3000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate $LR \
    --pre_seq_len $PRE_SEQ_LEN \
    --quantization_bit 4

Expected Behavior

No response

Steps To Reproduce

直接按照官方p-tuning微调操作即可

Environment

- OS:ubuntu20.04
- Python:3.10.11
- Transformers:4.29.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

THUDM / ChatGLM2-6B

[BUG/Help] <title>如何提高多卡下的p-tuning训练速度 #605

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?