Closed lambert-x closed 1 month ago
Hi @lambert-x. Could you show me the error information? Maybe I can help you with it.
It raises exit -9 error code without detailed bug information. I figured out the issue. The line above makes a CPU Mem peak and requires ~400 GB CPU Mem for initializing the model. I fixed it by creating job instance with higher GPU mem like 600GB.
I always failed with the line https://github.com/TencentARC/SEED-Story/blob/c1c08a09bfbfdfd3b1f568fc4420c6ccf83f2db5/src/train/train_clm_sft.py#L204. Could you please build a new environment and train the current codebase?