Open congdamaS opened 1 year ago
@congdamaS we're trying to test 70B models soon. So, we will get back to you after that. Thanks!
+1 on this. I'm evaluating an unquantized 7B model (stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b
), but this eval script is eating 26GB of VRAM. Running inference with the same model using a bare-minimum transformers
snippet consumes about 15GB. Does this script load anything other than the model itself onto the GPU?
not sure which branch you're on but
python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-70b-chat-hf,dtype=float16,use_accelerate=True --no_cache --num_fewshot=25 --tasks arc_challenge
works just fine with 70B parameter models on an a40 node
Hi @dakotamahan-stability and thanks for the reply!
Sorry for the lack of information. I'm on the jp-stable
branch (commit effdbea). Here's an example notebook that reproduces the issue:
I'm running this notebook on a Colab Pro+ VM. The eval script throws an OOM error when run with a V100 GPU (w/ 16.0 GB of VRAM):
Running loglikelihood requests
0% 0/5595 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/lm-evaluation-harness/main.py", line 121, in <module>
results = main(args, description_dict_path, output_path)
File "/content/lm-evaluation-harness/main.py", line 96, in main
results = evaluator.simple_evaluate(**eval_args)
File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
return fn(*args, **kwargs)
File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 87, in simple_evaluate
results = evaluate(
File "/content/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
return fn(*args, **kwargs)
File "/content/lm-evaluation-harness/lm_eval/evaluator.py", line 287, in evaluate
resps = getattr(lm, reqtype)([req.args for req in reqs])
File "/content/lm-evaluation-harness/lm_eval/base.py", line 980, in fn
rem_res = getattr(self.lm, attr)(remaining_reqs)
File "/content/lm-evaluation-harness/lm_eval/base.py", line 193, in loglikelihood
return self._loglikelihood_tokens(new_reqs)
File "/content/lm-evaluation-harness/lm_eval/base.py", line 303, in _loglikelihood_tokens
self._model_call(batched_inps), dim=-1
File "/content/lm-evaluation-harness/lm_eval/models/gpt2.py", line 120, in _model_call
return self.gpt2(inps)[0]
...
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
What's strange is that the below code uses just around 14.3 GB of VRAM on the exact same machine:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Setup model and tokenizer
model_name = "stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
def format_prompt(input_text):
prompt_template = """<s>[INST] <<SYS>>\nあなたは役立つアシスタントです。\n<<SYS>>\n\nユーザの質問に答えてください。\n\n{input}[/INST]"""
return prompt_template.format(input=input_text)
def generate_text(input_text):
formatted_prompt = format_prompt(input_text)
input_ids = tokenizer.encode(
formatted_prompt,
add_special_tokens=False,
return_tensors="pt"
)
# Set seed for reproducibility
seed = 23
torch.manual_seed(seed)
tokens = model.generate(
input_ids.to(device=model.device),
max_new_tokens=1024,
temperature=0.99,
top_p=0.95,
do_sample=True,
)
# Remove the input tokens from the generated tokens before decoding
output_tokens = tokens[0][len(input_ids[0]):]
return tokenizer.decode(output_tokens, skip_special_tokens=True)
prompt = "もう冬ですね。最近は寝室が寒くて寝られません。どうすればいいですか?"
generated_text = generate_text(prompt)
print(generated_text)
I'm wondering if I've misconfigured the eval script, or the script is prefetching/preloading the dataset to the GPU (which would make sense, given that the prompt in the snippet is short).
When testing with llama2 70B, the need memory is too large(>250GB). This issue is not seen in the original lm-evaluation-harness for English. How to set to test the llama2(70B)?