Tencent / PatrickStar

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
BSD 3-Clause "New" or "Revised" License
747 stars 57 forks source link

RuntimeError: chunk move failed. #308

Closed ouyangliqi closed 2 years ago

ouyangliqi commented 2 years ago

While training a GPT3_6B model on 4x v100, the program stop because of runtime error at step 47. The exception show like this:

RuntimeError: chunk move failed. cpu has not 385.875968 MB memory space. Free space is 320.948224 MB. The reason may be that the overall memory of CPU and GPU is not enough for the model.

But the training progress only cost like 60% of the cpu memory, and the overall_cpu_mem_ratio is 0.9. 76127907-bcb1-4c69-91a0-10425b19874f

feifeibear commented 2 years ago

Hello, what is your chunk size setting? Some optimization option settings will affect the model scale and execution efficiency. Here I listed the factors may be useful. https://github.com/Tencent/PatrickStar/blob/master/doc/optimization_options.md

ouyangliqi commented 2 years ago

Since the problem was out of cpu memory, I have tried different setting with --with_mem_cache and --with_static_partition . And the overall_cpu_mem_ratio is 0.9. Below is my training setting.

export MODEL_NAME="GPT3_6B" export BS=16 export CS=184 export CPU_EBD=0 export SP=0 export ACT_OFFLOAD=0 export NO_RETRY=0 export SKIP_LOG_EXSIT=0 export AMM=1 export MSC=1 export CACHE=1 export GPU_NUM=4 export RES_CHECK=0 export MEM_PROF=1 export LOCAL_WORLD_SIZE=4

feifeibear commented 2 years ago

I have contacted ouyang offline. The issue is closed.