[BUG/Help] <执行ds_train_finetune.sh报错torch.cuda.OutOfMemoryError>

xwdreamer commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

查看 ds_train_finetune.sh 文件

cat ds_train_finetune.sh 

LR=1e-4

MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --num_gpus=1 --master_port $MASTER_PORT main.py \
    --deepspeed deepspeed.json \
    --do_train \
    --train_file AdvertiseGen/train.json \
    --test_file AdvertiseGen/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/test/model/chatglm2-6b \
    --output_dir ./output/adgen-chatglm2-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --predict_with_generate \
    --max_steps 5000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate $LR \
    --fp16

执行finetune

 bash ds_train_finetune.sh

报错

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.26 GiB (GPU 0; 79.35 GiB total capacity; 58.15 GiB already allocated; 20.51 GiB free; 58.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-08-22 12:53:56,295] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6591
[2023-08-22 12:53:56,295] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=0', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '/data/test/xuwei32/model/chatglm2-6b', '--output_dir', './output/adgen-chatglm2-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = 1

Expected Behavior

No response

Steps To Reproduce

.

Environment

- OS:Ubuntu 20.04
- Python:

python3 --version
Python 3.8.10

Transformers:

pip list | grep trans
transformer-engine      0.6.0
transformers            4.30.2

PyTorch:

pip list | grep torch
pytorch-quantization    2.1.2
torch                   2.0.0a0+1767026
torch-tensorrt          1.4.0.dev0
torchtext               0.13.0a0+fae8e8c
torchvision             0.15.0a0

CUDA Support (python -c "import torch; print(torch.cuda.is_available())") :True

Anything else?

No response

xwdreamer commented 1 year ago

GPU用到的A100

nvidia-smi
Tue Aug 22 12:59:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   32C    P0    66W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   33C    P0    66W / 400W |      3MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

renwenlong-github commented 12 months ago

same problems， I use 4*V100(32G), torch.cuda.OutOfMemoryError: CUDA out of memory

937739823 commented 11 months ago

解决了吗？我也遇到这个问题了，全参数微调，具体需要多大的显存啊，fp16训练，80G 都OMM了。

THUDM / ChatGLM2-6B