Closed TeddLi closed 9 months ago
@TeddLi Your logs suggest that 6 of the 8 processes have resumed the data loop, but the others haven't. Your script got stuck somewhere reading the data and then the barrier timed out. The problem is not with the barrier or with Fabric. You should investigate the data reading in this loop here https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/pretrain/tinyllama.py#L198-L208 and check why the for-loop isn't progressing on that rank.
The logic to resume the training in the original TinyLlama is quite expensive, in our version we replaced it by loading the state of the dataloader directly: https://github.com/Lightning-AI/lit-gpt/blob/00defdee53f9b19511057a51499e23af2b1558a3/pretrain/tinyllama.py#L112 (but this only works with the streaming dataset, we have a different data processing than them)
@TeddLi Any luck there investigating this?
Nope, I didn't find out why. If I set the step to 20000, then it works. But If I set it a bit longer, e.g, 200000 step. Then It will freeze. I doubt it might be GPU sync issue...
It might just be that the resuming of the data takes more than 30 minutes, and some processes are slower than the other, then time out at the 30 minute mark (the default for NCCL). In this case, one option is to increase the timeout to something higher:
from datetime import timedelta
# in the FSDPStrategy configure the timeout
strategy = FSDPStrategy(
timeout=timedelta(minutes=120), # default is 30
...
)
The better way would be to reimplement the resuming logic like we did in our version of TinyLlama like I pointed out in the previous comment.
Ah, looking closer at the error it's actually timing out in all-reduce. The title of this issue has misled me to think it's the barrier. But the resuming the dataloader part is totally fine.
If it's failing at all-reduce, that's probably at .backward()
. Can you confirm that? What changes have you made to the script?
@awaelchli Hey, Thanks for taking a close look at that. Honestly, I just switched a machine..... The new server provider used docker. I doubt its a hardware issue
Also I do try to extend the timeout from 30 min to 8 hours. But still, no luck to make it run properly. I am not sure if the extend time would solve it
Is it working on the new server after switching machine or do you still see the issue?
The issue is gone. For the machine that has that issue, I am still hold it though. If I just train from start, it won't hit any issue.
@awaelchli If you want to look into that, I can provide the info you need. Just close the ticket for now
Ok thanks. If it happens again in the future, let me know and we can do some debugging :)
Bug description
I hit this when resume my checkpoint:
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
More info
No response