[big-refactor] Accelerate launch FSDP Runtime Error

Is there someone running accelerate launch with FSDP successfully? Please share the accelerate config, thx.

I run into the following error:

RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

I am running eval on 4 GPUs, the error message is replicate on both GPUs. Typically one batch is completed on one of the GPUs before erroring out.

Accelerate conf: ''' compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: true fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false '''

EleutherAI / lm-evaluation-harness

[big-refactor] Accelerate launch FSDP Runtime Error #1003