[OOM Error] Out of Memory with 32k tokens

JL-Cheng commented 1 year ago

Thank you for your valuable contribution! I have been experimenting with your evaluation codes on the LongChat-Lines dataset. However, I encountered an Out of Memory Error when the token length reached 32k.

I am fortunate to have multiple 80G A100 GPUs at my disposal. However, I noticed that your evaluation code does not incorporate parallel processing, and only one GPU is utilized during evaluation.

I would greatly appreciate it if you could provide more information about the resources used in the experimental section of your paper. Additionally, I am curious if you implemented any form of parallelization to enhance the evaluation process.

Thank you once again for your assistance!

JL-Cheng commented 1 year ago

You can consider using DeepSpeed-Inference to solve this problem, which may also speed up the inference process. It provides a simple implementation by making slight modifications to the model, input data, and launch method.

You can refer to the file ./python/eval/longeval/utils.py to modify the model and the input data

## ./python/eval/longeval/utils.py

import deepspeed

# AT LINE 82
-  model = model.cuda()
-  model.eval()
+  model = deepspeed.init_inference(
+    model=model,
+    mp_size=int(os.getenv("WORLD_SIZE", "1")),
+    dtype=torch.float16,
+    replace_with_kernel_inject=False,
+    max_out_tokens=35,
+)

# AT LINE 106
-  input = tokenizer(prompt, return_tensors="pt")
-  prompt_length = input.input_ids.shape[-1]
-
-  output = model.generate(input_ids=input.input_ids.to(model.device), min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

+  local_rank = int(os.getenv("LOCAL_RANK", "0"))
+  inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}")
+  prompt_length = inputs.shape[-1]
+  
+  output = model.generate(inputs, min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

Then, maybe you need to add one line in the file ./python/eval/longeval/eval.py:

## ./python/eval/longeval/utils.py

# AT LINE 58
+  parser.add_argument("--local_rank", type=int, default=0, help="local rank")

And you can use the DeepSpeed launcher deepspeed to launch inference on multiple GPUs:

## run.sh
#!/bin/bash

PRETRAINED_MODEL_DIR= xxx

deepspeed --num_gpus 2 ./eval/longeval/eval.py \
    --model-name-or-path $PRETRAINED_MODEL_DIR \
    --scale-context 1.0 \
    --base-model

mces89 commented 12 months ago

hi, for this line: inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}") does it mean for every gpu(local_rank), it will encode the same inputs? also which model-name-or-path do you use? Thanks.

JL-Cheng commented 11 months ago

hi, for this line: inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}") does it mean for every gpu(local_rank), it will encode the same inputs? also which model-name-or-path do you use? Thanks.

There is a blog about DeepSpeed Inference, which may help you to understand clearly how DeepSpeed accelerates inference.

For the first problem, DeepSpeed uses tensor parallelism to shard the model and generate results through communication. This means that each GPU will not encode the same inputs. For more details, you can refer to the issue: https://github.com/microsoft/DeepSpeed/issues/4154.

For the second problem, model-name-or-path refers to the local path where llama2-7b model is stored. You can download it from HuggingFace.

wutong4012 commented 5 months ago

hi, @JL-Cheng , how did you get 32K of data? As far as I know, in https://huggingface.co/datasets/abacusai/LongChat-Lines/viewer/default/100, the maximum data length is 26K.

abacusai / Long-Context

[OOM Error] Out of Memory with 32k tokens #5