【Issue Help】 run bs=8, input_len=2048 perf error

DeepTecher commented 3 months ago

We run on A100-40G to get all test with the below configuration：

{
    "model": "chatglm2-torch-fp16-6b",
    "test_accuracy": true,
    "test_perf": true,
    "min_new_tokens": 128,
    "max_new_tokens": 256,
    "tp_sizes": [1, 2],
    "batch_sizes":[1, 2, 4, 8],
    "input_tokens": [1024, 2048],
    "dataset": "llm_perf/datasets/merged_52_test.csv",
    "perf_time": 180
}

however, when it run on bs=8, input_len=2048, it raise error :

2024-06-01 12:58:21.950 reporter.py:136 [INFO]: Update reporter meta: TP=1, BS=8, Inputs=2048
^M  0%|          | 0/180 [00:00<?, ?s/s]^M  0%|          | 0/180 [00:00<?, ?s/s]2024-06-01 12:58:23.451 bench.py:157 [ERROR]: PERFORMANCE bench_4 error: local variable 'res' referenced before assignment
Process Process-28:
Traceback (most recent call last):
  File "/home/XXX/miniconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/XXX/miniconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/XXX/ByteMLPerf/byte_infer_perf/llm_perf/benchmark/bench.py", line 158, in benchmark
    raise e
  File "/home/XXX/ByteMLPerf/byte_infer_perf/llm_perf/benchmark/bench.py", line 155, in benchmark
    bench_performance(stub, index, workload, input_tokens, result_queue)
  File "/home/XXX/ByteMLPerf/byte_infer_perf/llm_perf/benchmark/bench.py", line 118, in bench_performance
    prompt_tokens = res["usage"]["prompt_tokens"]
UnboundLocalError: local variable 'res' referenced before assignment

suisiyuan commented 3 months ago

it looks like server oom and didn't return result. The default implementation is the original implementation, without introducing additional memory optimization methods.

suisiyuan commented 2 months ago

tp、kvcache、seperate schedule already realized in latest commit.

bytedance / ByteMLPerf

【Issue Help】 run bs=8, input_len=2048 perf error #78