Open mesllo opened 2 years ago
The complete training has four stages, the third stage has 200 epochs and takes the longest. The training code prints the estimated time, you can try it
Thank you for your reply! I have tried it and even for the first stage, training takes about 24 hours on 4 GPUs which is not feasible for me. I can also try to take a multi-node approach and run on 8 GPUs across 2 nodes. Is the code already optimized for multiple nodes or do I have to do this myself? From what I can understand, it seems that the current code only considers a single node.
If I train with 8 GPUs, a batch size 4, and 100 epochs (50 for normal lr, 50 for decaying lr), how long would it take? I'm asking because I only have 4 GPUs and I am using a shared environment where I won't be able to train for more than 8 hours per run.