OpenMOSS / MOSS

An open-source tool-augmented conversational language model from Fudan University
https://txsun1997.github.io/blogs/moss.html
Apache License 2.0
11.92k stars 1.14k forks source link

ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer #272

Open Daniel-1997 opened 1 year ago

Daniel-1997 commented 1 year ago

显卡配置:2张 V100 32G (共四张,有两张别人占用中,用完后可实现利用4卡V100) 按照默认accelerate配置报错:cuda out of memory,观察发现默认配置中 offload_optimizer_device 和 offload_param_device 参数均为none,后按照accelerate教程,将这两个参数均改成 cpu 报错:

image

accelerate 配置如下:

command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_save_16bit_model: true zero_stage: 3 zero3_init_flag: true zero_force_ds_cpu_optimizer: False distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false

和别人沟通后得知8*A100(40G)可按照官方的脚本直接训练,不用修改cpu offload参数,受限于机器资源,怎样能实现cpu offload 且不报错呢?需要配置deepspeed 配置文件(.json)吗?

insist93 commented 1 year ago

遇到一样的问题。请问解决了吗?

insist93 commented 1 year ago

遇到一样的问题。请问解决了吗?

DeepSpeed版本问题,降低到0.8.2版本就好了

Daniel-1997 commented 1 year ago

遇到一样的问题。请问解决了吗?

DeepSpeed版本问题,降低到0.8.2版本就好了

好的,我也试试,请问你微调成功了吗?成功了的话能看看内存消耗峰值达到多少?