bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

[Urgent Issue] Cannot run HumanEval benchmarking on CodeLlama model #202

Closed cosmo3769 closed 3 months ago

cosmo3769 commented 4 months ago

I was using bigcode-evaluation-harness to HumanEval benchmark CodeLlama model.

I am using this command from docs README:

accelerate launch main.py \ --model codellama/CodeLlama-7b-hf \ --max_length_generation 200 \ --tasks humaneval \ --temperature 0.2 \ --n_samples 200 \ --batch_size 10 \ --allow_code_execution

It is successfully downloading the model shards. But when loading checkpoint shards, I am getting an error “raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)”. Even if I am using T4 in colab or 2*T4 in kaggle, I am still getting this error.

Full error log:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2024-02-29 16:01:55.718819: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-29 16:01:55.718877: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-29 16:01:55.725025: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-29 16:01:57.780304: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['humaneval']
Loading model in fp32
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Loading checkpoint shards:   0% 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'main.py', '--model', 'codellama/CodeLlama-7b-hf', '--max_length_generation', '200', '--tasks', 'humaneval', '--temperature', '0.2', '--limit', '50', '--n_samples', '200', '--batch_size', '10', '--allow_code_execution']' died with <Signals.SIGKILL: 9>.

When investigating, I found that even turning the GPU on, it is still only using the system RAM.

Here is my colab link.

How to resolve this? Any more setup I need to do in order to run this successfully? Thank you!

loubnabnl commented 4 months ago

Can you run accelerate config to enable GPU usage and not CPU?

cosmo3769 commented 4 months ago

Hi @loubnabnl,

I ran accelerate config with this setting:

Screenshot 2024-03-01 at 5 52 05 PM

But still getting the same problem:

Screenshot 2024-03-01 at 5 18 53 PM

Is there any setting I am choosing wrong? Also when checking my gpu with nvidia-smi, I can see the GPU is there but it is in Off state. Does this matter? If it does, how to turn it on? Thank you!

Screenshot 2024-03-01 at 5 24 06 PM
loubnabnl commented 4 months ago

It seems in Jupyter environement you need to manually create the yaml config file and reference it in your accelerate launch 1- create config.yaml:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: no
num_machines: 1
num_processes: 1
gpu_ids: all
use_cpu: false

2- run

!accelerate launch --config_file config.yaml main.py \
  --model bigcode/starcoderbase-1b \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --limit 50 \
  --n_samples 20 \
  --batch_size 20 \
  --use_auth_token \
  --allow_code_execution

This works for me: colab

cosmo3769 commented 4 months ago

Thanks a lot @loubnabnl. This worked for me too.

Also, I have one question: if I quantize the model to GGUF format, is it possible to run benchmark on this format too?

loubnabnl commented 4 months ago

Hi, we don't support GGUF format but you can try using 4bit or 8bit precision to reduce memory footprint when loading the model using the flag --load_in_4bit for example.

Btw --limit shouldn't impact the memory that's just the number of HumanEval problems to use. It's the --batch_size flag that determines how many samples to fit in a batch (out of n_samples) that you should lower, try setting it to 1 for lowest memory consumption (but eval will be slower).

For n_samples if you're using greedy (do_sample False) then set it to one because you don't sample, if you use sampling (do_sample True and temperature 0.2) then 20 should be enough for an accurate number.

cosmo3769 commented 3 months ago

Hi, we don't support GGUF format but you can try using 4bit or 8bit precision to reduce memory footprint when loading the model using the flag --load_in_4bit for example.

Btw --limit shouldn't impact the memory that's just the number of HumanEval problems to use. It's the --batch_size flag that determines how many samples to fit in a batch (out of n_samples) that you should lower, try setting it to 1 for lowest memory consumption (but eval will be slower).

For n_samples if you're using greedy (do_sample False) then set it to one because you don't sample, if you use sampling (do_sample True and temperature 0.2) then 20 should be enough for an accurate number.

Thank you for the clarification.