EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.51k stars 1.72k forks source link

[big-refactor] Accelerate launch FSDP Runtime Error #892

Closed adamjackson2357 closed 6 months ago

adamjackson2357 commented 11 months ago

Hi when running accelerate launch with FSDP I run into the following error:

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

I am running eval on 2 GPUs, the error message is replicate on both GPUs. Typically one batch is completed on one of the GPUs before erroring out.

StellaAthena commented 11 months ago

What is the exact command you are running?

yurinoviello commented 10 months ago

Same issue on Nvidia L4 x 2

Command: accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25

Accelerate conf:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Error:

    ......
    File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

When i change _numprocesses to 1 it works.

Thanks

yurinoviello commented 10 months ago

Same issue on Nvidia L4 x 2

Command: accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25

Accelerate conf:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Error:

    ......
    File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

When i change _numprocesses to 1 it works.

Thanks

Using SIZE_BASED_WRAP it works (the memory allocated for each gpu is higher), is it normal?

I thought it was possible to use LLama2 with TRANSFORMER_BASED_WRAP.

sayan1101 commented 10 months ago

facing the same issue with 8xA6000

NanoCode012 commented 9 months ago

I have this issue on main branch (release 4.0) on 8xa100s 40gb when trying to eval 70B models.

Trying SIZED_BASED_WRAP gets me another issue:

File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/transformers/models/llama/modeling_llama.py", line 107, in forward
    return self.weight * hidden_states.to(input_dtype)
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 2

No wrap would just OOM.

My config is almost identical to above, just num gpu difference.

mstallone commented 9 months ago

Same issue here as well with TRANSFORMER_BASED_WRAP: RuntimeError: 'weight' must be 2-D

SIZED_BASED_WRAP seems to work but then NCCL timeouts (30minutes) on the last request batch. It is hanging on some processing.

haileyschoelkopf commented 6 months ago

We now recommend using vLLM instead of FSDP for fast / big model generation where possible.

1520 may also fix the NCCL timeouts due to a padding bug?