Closed wprazuch closed 3 months ago
Thank you, Wojciech, for submitting this issue! The root cause is the same as in https://github.com/Lightning-AI/lightning-thunder/issues/292 and the problem is fixed with https://github.com/Lightning-AI/lightning-thunder/commit/860273231b64da7d9c5b242a62183c637823e3aa. The problem occurs because the container you're using is from the 7th of June, using the latest container should resolve the problem.
The error is raised also in the single GPU runs. So I'll remove the distributed label.
Thanks Ivan ! We will re-test the configurations with the new container next week - I will also check one of the above configurations with the new container today as a check. I will then close this issue if the error is not present.
Btw. I noted the labels you assigned to the issue - I will keep in mind adding them in the future. Sorry for not adding them this time.
Don't worry about the labels! I don't even know if it's possible to add them without "Collaborator" status.
@IvanYashchuk I had some time to re-test it - the issue is now solved with latest container version. Thanks for the support! All the best, Voytec
🐛 Bug
There is an error when running Thunder inductor for lit-gpt models:
dolly-v2-3b; 2 nodes; 8 gpus per node; FSDP zero2 & zero3:
phi-2; 1&2 nodes; 8 gpus per node, FSDP zero2 & zero3:
phi-2; 1 node, 1 gpu per node:
phi-2, DDP, 1&2 nodes, 8 GPUs per node:
To Reproduce
Steps to reproduce the behavior: It was tested on Slurm cluster. You can contact me on Slack for more details
Create a file script.sh:
And so on.
After you are logged into Slurm cluster run:
For 1 node 1 gpu case, it is enough to run:
Expected behavior We should be able to run the training.
Environment As in the docker image, tested on H100.
Additional Info It should be possible to reproduce the error by using torchrun on a multi-gpu device.
cc @apaz-cli @carmocca @crcrpar