[Question]: 求助，chatglm2 单卡sft内存溢出

yidu0924 commented 3 months ago

请提出你的问题

报错如下 Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 428.000000MB memory on GPU 0, 79.153320GB memory has been allocated and available memory is only 175.562500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model. (at ../paddle/fluid/memory/allocation/cuda_allocator.cc:86)

已经将batch改到最小 { "model_name_or_path": "/home/duyi/paddle", "dataset_name_or_path": "/home/duyi/ChatGLM2-6B/ptuning/AdvertiseGen", "output_dir": "./checkpoints/chatglm2_sft_ckpts", "per_device_train_batch_size": 1, "gradient_accumulation_steps": 4, "per_device_eval_batch_size": 1, "eval_accumulation_steps":16, "num_train_epochs": 3, "learning_rate": 3e-05, "warmup_steps": 30, "logging_steps": 1, "evaluation_strategy": "epoch", "save_strategy": "epoch", "src_length": 1024, "max_length": 2048, "fp16": true, "fp16_opt_level": "O2", "do_train": true, "do_eval": true, "disable_tqdm": true, "load_best_model_at_end": true, "eval_with_do_generation": false, "metric_for_best_model": "accuracy", "recompute": true, "save_total_limit": 1, "sharding_parallel_degree": 4, "sharding": "stage3", "zero_padding": false, "use_flash_attention": false }

DrownFish19 commented 3 months ago

本地相同环境，尝试多次无法复现问题。复现命令如下：

 python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" finetune_generation.py chatglm2/sft_argument.json

建议尝试以下操作后再运行：

更新paddle为develop版本

python -m pip install paddlepaddle-gpu==0.0.0.post120 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html

确认数据集位置放置准确

如仍存在问题，辛苦上传完整日志和复现命令，方便我们进行调试。

yidu0924 commented 2 months ago

本地相同环境，尝试多次无法复现问题。复现命令如下：
 python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" finetune_generation.py chatglm2/sft_argument.json
建议尝试以下操作后再运行：

更新paddle为develop版本
python -m pip install paddlepaddle-gpu==0.0.0.post120 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
确认数据集位置放置准确

如仍存在问题，辛苦上传完整日志和复现命令，方便我们进行调试。

我用的是run_finetune.py，没有在llm目录下找到 finetune_generation 这个文件。同时我尝试用多卡跑，但是会出现连不上端口的情况：

======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0715 07:49:49.644336  2696 tcp_utils.cc:181] The server starts to listen on IP_ANY:46524

之后一直无响应我的启动命令： python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py config/chatglm2/sft_argument.json 完整日志：

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2024-07-15 07:49:45,444 -----------  Configuration  ----------------------
LAUNCH INFO 2024-07-15 07:49:45,445 auto_parallel_config: None
LAUNCH INFO 2024-07-15 07:49:45,445 auto_tuner_json: None
LAUNCH INFO 2024-07-15 07:49:45,445 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2024-07-15 07:49:45,445 elastic_level: -1
LAUNCH INFO 2024-07-15 07:49:45,445 elastic_timeout: 30
LAUNCH INFO 2024-07-15 07:49:45,445 enable_gpu_log: True
LAUNCH INFO 2024-07-15 07:49:45,445 gloo_port: 6767
LAUNCH INFO 2024-07-15 07:49:45,445 host: None
LAUNCH INFO 2024-07-15 07:49:45,445 ips: None
LAUNCH INFO 2024-07-15 07:49:45,445 job_id: default
LAUNCH INFO 2024-07-15 07:49:45,445 legacy: False
LAUNCH INFO 2024-07-15 07:49:45,445 log_dir: log
LAUNCH INFO 2024-07-15 07:49:45,445 log_level: INFO
LAUNCH INFO 2024-07-15 07:49:45,445 log_overwrite: False
LAUNCH INFO 2024-07-15 07:49:45,445 master: None
LAUNCH INFO 2024-07-15 07:49:45,445 max_restart: 3
LAUNCH INFO 2024-07-15 07:49:45,445 nnodes: 1
LAUNCH INFO 2024-07-15 07:49:45,445 nproc_per_node: None
LAUNCH INFO 2024-07-15 07:49:45,445 rank: -1
LAUNCH INFO 2024-07-15 07:49:45,445 run_mode: collective
LAUNCH INFO 2024-07-15 07:49:45,445 server_num: None
LAUNCH INFO 2024-07-15 07:49:45,445 servers: 
LAUNCH INFO 2024-07-15 07:49:45,445 sort_ip: False
LAUNCH INFO 2024-07-15 07:49:45,445 start_port: 6070
LAUNCH INFO 2024-07-15 07:49:45,445 trainer_num: None
LAUNCH INFO 2024-07-15 07:49:45,445 trainers: 
LAUNCH INFO 2024-07-15 07:49:45,445 training_script: run_finetune.py
LAUNCH INFO 2024-07-15 07:49:45,445 training_script_args: ['config/chatglm2/sft_argument.json']
LAUNCH INFO 2024-07-15 07:49:45,445 with_gloo: 1
LAUNCH INFO 2024-07-15 07:49:45,445 --------------------------------------------------
LAUNCH INFO 2024-07-15 07:49:45,449 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-07-15 07:49:45,451 Run Pod: ojhcwq, replicas 8, status ready
LAUNCH INFO 2024-07-15 07:49:45,607 Watching Pod: ojhcwq, replicas 8, status running
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2024-07-15 07:49:48,214] [ WARNING] - if you run ring_flash_attention.py, please ensure you install the paddlenlp_ops by following the instructions provided at https://github.com/PaddlePaddle/PaddleNLP/blob/develop/csrc/README.md
[2024-07-15 07:49:49,643] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0715 07:49:49.644336  2696 tcp_utils.cc:181] The server starts to listen on IP_ANY:46524

yidu0924 commented 2 months ago

我发现在跑 run_check() 的时候可以成功连到127.0.0.1 但是一跑训练脚本就会产生去找另外一个地址然后连不上的情况： I0715 07:52:00.361024 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening. I0715 07:54:13.480999 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening.

yidu0924 commented 2 months ago

我发现在跑 run_check() 的时候可以成功连到127.0.0.1 但是一跑训练脚本就会产生去找另外一个地址然后连不上的情况： I0715 07:52:00.361024 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening. I0715 07:54:13.480999 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening.

paddle里面有没有把这个手动改成127.0.0.1的地方

DrownFish19 commented 2 months ago

可以使用master指定，

# 设置为etcd 服务，独立的 etcd 服务
python -m paddle.distributed.launch --master=etcd://10.11.60.193:2379 --nnodes=4 --devices=1,2,3  train.py

或者

# 设置为http 服务，训练节点和可用端口组成
python -m paddle.distributed.launch --master=10.11.60.193:2379 --nnodes=4 --devices=1,2,3  train.py

如果还是存在问题，建议根据官网说明，选择适合自己的配置，安装develop版本。

DrownFish19 commented 2 months ago

我发现在跑 run_check() 的时候可以成功连到127.0.0.1 但是一跑训练脚本就会产生去找另外一个地址然后连不上的情况： I0715 07:52:00.361024 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening. I0715 07:54:13.480999 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening.

如果172.31.3.19这个IP不是本机的IP，建议检查一下机器环境，paddle启动时会拉取环境变量，通过本机IP启动。

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 2 days ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

PaddlePaddle / PaddleNLP

[Question]: 求助，chatglm2 单卡sft内存溢出 #8612

请提出你的问题

报错如下 Error Message Summary: