使用fp8 后微调速度特别慢

finetune_moss.py 中修改如下 accelerator = Accelerator(mixed_precision='fp8')

环境用的nvidia的容器 nvcr.io/nvidia/pytorch:23.06-py3 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

因计算卡显存不足，DeepSpeed offload cpu

修改 sft.yaml 如下

command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp8 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false

我设置fp8格式微调后，训练速度变慢，是怎么回事呢？

DeepSpeed v0.9.5 FP8 unittest for H100 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/3731

难道是DeepSpeed offload cpu 后，cpu不支持fp8导致的？我的cpu是Intel® Xeon® w9-3495X Processor

OpenMOSS / MOSS

使用fp8 后微调速度特别慢 #355