Closed whalefa1I closed 7 months ago
I would like to know what version of transformers you use and what's your run command? Any warnings during the run?
I would like to know what version of transformers you use and what's your run command? Any warnings during the run?
transformer-engine 1.2.0+c9f2b5e transformers 4.37.1 transformers-stream-generator 0.0.4
command
bash scripts/v1_5/phi2/finetune.sh or bash scripts/v1_5/qwen/finetune.sh
No warnings except
[2024-01-30 09:49:32,552] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?
Inactive, close it. Feel free to open.
I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?
The problem is resolved after the deepspeed and accelerate versions are corrected. The previous deepspeed version is the latest version
I got error like this:
AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer
Should I do anything before running stage3, such as convert the model
Can not zero3 with MoE. https://github.com/microsoft/DeepSpeed/issues/2870
You can use zero2_offload.json
to support bigger batch size.
Can not zero3 with MoE. microsoft/DeepSpeed#2870 You can use
zero2_offload.json
to support bigger batch size.
but Im using this config scripts/zero2.json
Can you check whether the model goes into this if
branch?
It should be executed when moe is turned on.
https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/moellava/train/llava_trainer.py#L221
Can you check whether the model goes into this
if
branch?It should be executed when moe is turned on.
https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/moellava/train/llava_trainer.py#L221
I think it's in the right line
Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.
Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.
It's okay. I'll check my codes first. Happy Friday
Sorry, this issue has already finished. Could you raise a new issue and post your command & transformers version & deepspeed version & torch version. Thanks.
It's okay. I'll check my codes first. Happy Friday
That's our mistake. Checking out this issue. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/17#issuecomment-1925237894
I'm sorry we weren't able to reproduce your error. By the way when did you pull your code. We did have some minor bug fixes in the last few days. can you pull the latest code and try it?
The problem is resolved after the deepspeed and accelerate versions are corrected. The previous deepspeed version is the latest version
sorry to interrupt, i met the similar issue (loss to 0) on stablelm stage3, i don't quite get the solution from this issue, which version of deepspeed did you downgrade to? @LinB203 @whalefa1I Thanks a lot!
During the pre-training phase, I can obtain the correct loss convergence. However, in the finetuning stage, except for the first iteration, the rest of the losses are all 0.0. Could you please tell me where the problem might be?