ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer

Daniel-1997 commented 1 year ago

显卡配置：2张 V100 32G （共四张，有两张别人占用中，用完后可实现利用4卡V100）按照默认accelerate配置报错：cuda out of memory，观察发现默认配置中 offload_optimizer_device 和 offload_param_device 参数均为none，后按照accelerate教程，将这两个参数均改成 cpu 报错：

accelerate 配置如下：

command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_save_16bit_model: true zero_stage: 3 zero3_init_flag: true zero_force_ds_cpu_optimizer: False distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false

和别人沟通后得知8*A100(40G)可按照官方的脚本直接训练，不用修改cpu offload参数，受限于机器资源，怎样能实现cpu offload 且不报错呢？需要配置deepspeed 配置文件（.json）吗？

insist93 commented 1 year ago

遇到一样的问题。请问解决了吗？

insist93 commented 1 year ago

遇到一样的问题。请问解决了吗？

DeepSpeed版本问题，降低到0.8.2版本就好了

Daniel-1997 commented 1 year ago

遇到一样的问题。请问解决了吗？

DeepSpeed版本问题，降低到0.8.2版本就好了

好的，我也试试，请问你微调成功了吗？成功了的话能看看内存消耗峰值达到多少？

OpenMOSS / MOSS

ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer #272