【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing

danielhua23 commented 1 month ago

错误描述

机器: h100-80g-hbm3 基于以下的chatglm-6b-xxx.json配置测试，在tp1, bs24, inputlen1024下报OOM

修改json配置为以下，从tp1, bs24, inputlen1024开始跑，tp1, bs24, inputlen1024可以正常运行

从代码https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/launch.py#L260猜测是代码未能按照预期所示等待各配置子进程结束再launch下一个子进程，从而导致子进程发生GPU争抢，导致本能跑的配置在争抢环境下发生OOM

复现步骤

step1 launch container docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3 step2 enter dir of launch.py pip install -r requirements.txt step3 修改workloads/chatglm2-torch-fp16-6b.json如以上所示 step4 run python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b

danielhua23 commented 1 month ago

@suisiyuan 你好，有空的时候可以帮忙看一看不？

suisiyuan commented 1 month ago

@suisiyuan 你好，有空的时候可以帮忙看一看不？

好的，我这边看看，应该是进程管理的问题。

danielhua23 commented 1 month ago

@suisiyuan 你好，有空的时候可以帮忙看一看不？

好的，我这边看看，应该是进程管理的问题。

感谢你的时间

bytedance / ByteMLPerf

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

错误描述

复现步骤