TencentARC / SEED-Story

SEED-Story: Multimodal Long Story Generation with Large Language Model
https://arxiv.org/abs/2407.08683
Other
675 stars 53 forks source link

Failed training with DDP on single-node 8 GPU #6

Closed lambert-x closed 1 month ago

lambert-x commented 1 month ago

I always failed with the line https://github.com/TencentARC/SEED-Story/blob/c1c08a09bfbfdfd3b1f568fc4420c6ccf83f2db5/src/train/train_clm_sft.py#L204. Could you please build a new environment and train the current codebase?

AndysonYs commented 1 month ago

Hi @lambert-x. Could you show me the error information? Maybe I can help you with it.

lambert-x commented 1 month ago

It raises exit -9 error code without detailed bug information. I figured out the issue. The line above makes a CPU Mem peak and requires ~400 GB CPU Mem for initializing the model. I fixed it by creating job instance with higher GPU mem like 600GB.