{'loss': 0.0, 'learning_rate': 1.6877637130801689e-07, 'epoch': 0.0}

whalefa1I commented 7 months ago

{'loss': 2.1559, 'learning_rate': 8.438818565400844e-08, 'epoch': 0.0}                                                                                                                                               
{'loss': 0.0, 'learning_rate': 1.6877637130801689e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 2.5316455696202533e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 3.3755274261603377e-07, 'epoch': 0.0}                                                                                                                                                 
{'loss': 0.0, 'learning_rate': 4.219409282700422e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 5.063291139240507e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 5.907172995780591e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 6.751054852320675e-07, 'epoch': 0.0}                                                                                                                                                  
{'loss': 0.0, 'learning_rate': 7.59493670886076e-07, 'epoch': 0.0}

During the pre-training phase, I can obtain the correct loss convergence. However, in the finetuning stage, except for the first iteration, the rest of the losses are all 0.0. Could you please tell me where the problem might be?

LinB203 commented 7 months ago

I would like to know what version of transformers you use and what's your run command? Any warnings during the run?

whalefa1I commented 7 months ago

I would like to know what version of transformers you use and what's your run command? Any warnings during the run?

transformer-engine            1.2.0+c9f2b5e
transformers                  4.37.1
transformers-stream-generator 0.0.4

command

bash scripts/v1_5/phi2/finetune.sh
or
bash scripts/v1_5/qwen/finetune.sh

No warnings except

[2024-01-30 09:49:32,552] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.

LinB203 commented 7 months ago

I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?

LinB203 commented 7 months ago

Inactive, close it. Feel free to open.

whalefa1I commented 7 months ago

I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?

The problem is resolved after the deepspeed and accelerate versions are corrected. The previous deepspeed version is the latest version

whalefa1I commented 7 months ago

I got error like this:

AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer

Should I do anything before running stage3, such as convert the model

LinB203 commented 7 months ago

Can not zero3 with MoE. https://github.com/microsoft/DeepSpeed/issues/2870 You can use zero2_offload.json to support bigger batch size.

whalefa1I commented 7 months ago

Can not zero3 with MoE. microsoft/DeepSpeed#2870 You can use zero2_offload.json to support bigger batch size.

but Im using this config scripts/zero2.json

LinB203 commented 7 months ago

Can you check whether the model goes into this if branch?

It should be executed when moe is turned on.

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/moellava/train/llava_trainer.py#L221

whalefa1I commented 7 months ago

Can you check whether the model goes into this if branch?

It should be executed when moe is turned on.

https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/moellava/train/llava_trainer.py#L221

I think it's in the right line

LinB203 commented 7 months ago

Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.

whalefa1I commented 7 months ago

Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.

It's okay. I'll check my codes first. Happy Friday

LinB203 commented 7 months ago

Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.

It's okay. I'll check my codes first. Happy Friday

That's our mistake. Checking out this issue. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/17#issuecomment-1925237894

QAQdev commented 4 months ago

I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?

The problem is resolved after the deepspeed and accelerate versions are corrected. The previous deepspeed version is the latest version

sorry to interrupt, i met the similar issue (loss to 0) on stablelm stage3, i don't quite get the solution from this issue, which version of deepspeed did you downgrade to? @LinB203 @whalefa1I Thanks a lot!

PKU-YuanGroup / MoE-LLaVA

{'loss': 0.0, 'learning_rate': 1.6877637130801689e-07, 'epoch': 0.0} #1