Q-Future / Q-Align

③[ICML2024] [IQA, IAA, VQA] All-in-one Foundation Model for visual scoring. Can efficiently fine-tune to downstream datasets.
https://q-align.github.io
Other
293 stars 19 forks source link

Program hangs when training from the start #7

Closed Teddy-Xiong closed 8 months ago

Teddy-Xiong commented 9 months ago

Hi, thanks for sharing this amazing work. However, when I try to do full training from the start, the program hangs after outputting the following:

Screenshot 2024-02-04 at 11 11 58 PM

Could you suggest any possible solution?

teowu commented 9 months ago

Hi Teddy,

This seems strange (the warnings up there are natural and not reasons of the error). Can you set some breakpoints (or line-by-line prints) to see where does this hang come from in the code?

Best Haoning

Teddy-Xiong commented 9 months ago

Hi Haoning,

Thanks for your fast reply. Following your suggestion, I found the program stuck when loading the mPLUG-Owl2 model:

Screenshot 2024-02-04 at 11 46 59 PM

Regards, Teddy

teowu commented 9 months ago

Hi Teddy,

Have you installed the flash_attn? Or, maybe you can try to modify the attn_implementation to "eager" here and see how it works.

Best Haoning

Teddy-Xiong commented 9 months ago

Hi Haoning,

Yes I have installed it following your instructions: pip install flash_attn --no-build-isolation

Unfortunately, the program still hangs after I change it to "eager".

P.S. I found that the program can successfully load the mPLUG-Owl2 model and no longer hangs if I only enable a single GPU, so the problem should be related to multi-GPU usage. However, I face the CUDA out-of-memory issue later on with a single 40G A100 GPU. Do you have any suggestions on the multi-GPU issue and is there a way to train the model using a single GPU?

Regards, Teddy

teowu commented 9 months ago

Hi Teddy,

Sorry about the late reply. Training with a single 40G GPU is not achievable right now.

And for the hang, this looks like a deepspeed issue. Can you provide me with your environment versions?

Best Haoning

Teddy-Xiong commented 8 months ago

Hi Haoning,

Sorry for the late reply. Here's the full pip list of my environment. I simply created the environment following the instructions you provided:

Screenshot 2024-02-28 at 10 10 57 AM Screenshot 2024-02-28 at 10 13 33 AM Screenshot 2024-02-28 at 10 13 56 AM

Regards, Teddy

teowu commented 8 months ago

This seems okay.

Recently I notice that this error might come from the kernel version that I have noticed this warning Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. on some of my devices.

Teddy-Xiong commented 8 months ago

Hi Haoning,

The program magically works after I pull the latest version. Maybe it's some random version conflicts. Thanks for your help.

Regards, Teddy