Closed ouyangliqi closed 2 years ago
Hello, what is your chunk size setting? Some optimization option settings will affect the model scale and execution efficiency. Here I listed the factors may be useful. https://github.com/Tencent/PatrickStar/blob/master/doc/optimization_options.md
Since the problem was out of cpu memory, I have tried different setting with --with_mem_cache
and --with_static_partition
. And the overall_cpu_mem_ratio
is 0.9. Below is my training setting.
export MODEL_NAME="GPT3_6B" export BS=16 export CS=184 export CPU_EBD=0 export SP=0 export ACT_OFFLOAD=0 export NO_RETRY=0 export SKIP_LOG_EXSIT=0 export AMM=1 export MSC=1 export CACHE=1 export GPU_NUM=4 export RES_CHECK=0 export MEM_PROF=1 export LOCAL_WORLD_SIZE=4
I have contacted ouyang offline. The issue is closed.
While training a GPT3_6B model on 4x v100, the program stop because of runtime error at step 47. The exception show like this:
RuntimeError: chunk move failed. cpu has not 385.875968 MB memory space. Free space is 320.948224 MB. The reason may be that the overall memory of CPU and GPU is not enough for the model.
But the training progress only cost like 60% of the cpu memory, and the overall_cpu_mem_ratio is 0.9.