llama2 70B cause OOM - Githubissues

congdamaS commented 1 year ago

When testing with llama2 70B, the need memory is too large(>250GB). This issue is not seen in the original lm-evaluation-harness for English. How to set to test the llama2(70B)?

mkshing commented 11 months ago

@congdamaS we're trying to test 70B models soon. So, we will get back to you after that. Thanks!

yumemio commented 9 months ago

+1 on this. I'm evaluating an unquantized 7B model (stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b), but this eval script is eating 26GB of VRAM. Running inference with the same model using a bare-minimum transformers snippet consumes about 15GB. Does this script load anything other than the model itself onto the GPU?

dakotamahan-stability commented 9 months ago

not sure which branch you're on but

python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-70b-chat-hf,dtype=float16,use_accelerate=True --no_cache --num_fewshot=25 --tasks arc_challenge

works just fine with 70B parameter models on an a40 node

yumemio commented 9 months ago

Hi @dakotamahan-stability and thanks for the reply!

Sorry for the lack of information. I'm on the jp-stable branch (commit effdbea). Here's an example notebook that reproduces the issue:

Gist (example notebook)

I'm running this notebook on a Colab Pro+ VM. The eval script throws an OOM error when run with a V100 GPU (w/ 16.0 GB of VRAM):

Running loglikelihood requests
  0% 0/5595 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/lm-evaluation-harness/main.py", line 121, in <module>
    results = main(args, description_dict_path, output_path)
  File "/content/lm-evaluation-harness/main.py", line 96, in main
    results = evaluator.simple_evaluate(**eval_args)
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 87, in simple_evaluate
    results = evaluate(
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 287, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 980, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 193, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/content/lm-evaluation-harness/lm_eval/base.py", line 303, in _loglikelihood_tokens
    self._model_call(batched_inps), dim=-1
  File "/content/lm-evaluation-harness/lm_eval/models/gpt2.py", line 120, in _model_call
    return self.gpt2(inps)[0]
...
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

What's strange is that the below code uses just around 14.3 GB of VRAM on the exact same machine:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Setup model and tokenizer
model_name = "stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")

def format_prompt(input_text):
    prompt_template = """<s>[INST] <<SYS>>\nあなたは役立つアシスタントです。\n<<SYS>>\n\nユーザの質問に答えてください。\n\n{input}[/INST]"""
    return prompt_template.format(input=input_text)

def generate_text(input_text):
    formatted_prompt = format_prompt(input_text)
    input_ids = tokenizer.encode(
        formatted_prompt,
        add_special_tokens=False,
        return_tensors="pt"
    )

    # Set seed for reproducibility
    seed = 23
    torch.manual_seed(seed)

    tokens = model.generate(
        input_ids.to(device=model.device),
        max_new_tokens=1024,
        temperature=0.99,
        top_p=0.95,
        do_sample=True,
    )

    # Remove the input tokens from the generated tokens before decoding
    output_tokens = tokens[0][len(input_ids[0]):]
    return tokenizer.decode(output_tokens, skip_special_tokens=True)

prompt = "もう冬ですね。最近は寝室が寒くて寝られません。どうすればいいですか？"
generated_text = generate_text(prompt)
print(generated_text)

I'm wondering if I've misconfigured the eval script, or the script is prefetching/preloading the dataset to the GPU (which would make sense, given that the prompt in the snippet is short).

Stability-AI / lm-evaluation-harness

llama2 70B cause OOM #83