EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

[big-refactor] Accelerate launch FSDP Runtime Error #892

Closed adamjackson2357 closed 8 months ago

adamjackson2357 commented 1 year ago

Hi when running accelerate launch with FSDP I run into the following error:

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

I am running eval on 2 GPUs, the error message is replicate on both GPUs. Typically one batch is completed on one of the GPUs before erroring out.

StellaAthena commented 1 year ago

What is the exact command you are running?

yurinoviello commented 11 months ago

Same issue on Nvidia L4 x 2

Command: accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25

Accelerate conf:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Error:

    ......
    File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

When i change _numprocesses to 1 it works.

Thanks

yurinoviello commented 11 months ago

Same issue on Nvidia L4 x 2

Command: accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks arc_challenge --batch_size 1 --num_fewshot=25

Accelerate conf:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Error:

    ......
    File "/opt/conda/envs/eval/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

When i change _numprocesses to 1 it works.

Thanks

Using SIZE_BASED_WRAP it works (the memory allocated for each gpu is higher), is it normal?

I thought it was possible to use LLama2 with TRANSFORMER_BASED_WRAP.

sayan1101 commented 11 months ago

facing the same issue with 8xA6000

NanoCode012 commented 11 months ago

I have this issue on main branch (release 4.0) on 8xa100s 40gb when trying to eval 70B models.

Trying SIZED_BASED_WRAP gets me another issue:

File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/transformers/models/llama/modeling_llama.py", line 107, in forward
    return self.weight * hidden_states.to(input_dtype)
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 2

No wrap would just OOM.

My config is almost identical to above, just num gpu difference.

mstallone commented 11 months ago

Same issue here as well with TRANSFORMER_BASED_WRAP: RuntimeError: 'weight' must be 2-D

SIZED_BASED_WRAP seems to work but then NCCL timeouts (30minutes) on the last request batch. It is hanging on some processing.

haileyschoelkopf commented 8 months ago

We now recommend using vLLM instead of FSDP for fast / big model generation where possible.

1520 may also fix the NCCL timeouts due to a padding bug?