Closed Teddy-Xiong closed 8 months ago
Hi Teddy,
This seems strange (the warnings up there are natural and not reasons of the error). Can you set some breakpoints (or line-by-line prints) to see where does this hang come from in the code?
Best Haoning
Hi Haoning,
Thanks for your fast reply. Following your suggestion, I found the program stuck when loading the mPLUG-Owl2 model:
Regards, Teddy
Hi Teddy,
Have you installed the flash_attn
?
Or, maybe you can try to modify the attn_implementation to "eager" here and see how it works.
Best Haoning
Hi Haoning,
Yes I have installed it following your instructions:
pip install flash_attn --no-build-isolation
Unfortunately, the program still hangs after I change it to "eager".
P.S. I found that the program can successfully load the mPLUG-Owl2 model and no longer hangs if I only enable a single GPU, so the problem should be related to multi-GPU usage. However, I face the CUDA out-of-memory issue later on with a single 40G A100 GPU. Do you have any suggestions on the multi-GPU issue and is there a way to train the model using a single GPU?
Regards, Teddy
Hi Teddy,
Sorry about the late reply. Training with a single 40G GPU is not achievable right now.
And for the hang, this looks like a deepspeed issue. Can you provide me with your environment versions?
Best Haoning
Hi Haoning,
Sorry for the late reply. Here's the full pip list of my environment. I simply created the environment following the instructions you provided:
Regards, Teddy
This seems okay.
Recently I notice that this error might come from the kernel version that I have noticed this warning
Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
on some of my devices.
Hi Haoning,
The program magically works after I pull the latest version. Maybe it's some random version conflicts. Thanks for your help.
Regards, Teddy
Hi, thanks for sharing this amazing work. However, when I try to do full training from the start, the program hangs after outputting the following:
Could you suggest any possible solution?