Open zhuango opened 1 month ago
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hello, I have also encountered this bug. I am referring to the documentation here: NeMo Framework PEFT with Llama2, Mixtral-8x7B, and Nemotron 4 340B.
When training Llama2 with Nemo, an error occurs if the following parameters are set: TP=2, sequence_parallel=True, and model.peft.peft_scheme="lora"
.
The error is a RuntimeError: "The size of tensor A (2256) must match the size of tensor B (4512) at non-singleton dimension 0."
Describe the bug
I was training to run sft based on Mixtral-8x7B-instruct model with tensor parallel size=4 (sequence parallel=True) and LoRA (target modules =[all]). It reports that the output dims of original module and the corresponding lora adapter module is not matched so they cannot be added together.
Steps/Code to reproduce bug
I used the recommended docker nvcr.io/nvidia/nemo:24.07 and my scripts is as follow:
And it runs into error like:
And I try to fix this by modify the following two .py files: /opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/mcore_mixins.py
/opt/NeMo/nemo/collections/nlp/modules/common/megatron/adapters/parallel_adapters.py
After the modifications, I can run sft with LoRA and tensor and sequence parallel, but I am not sure it runs correctly. Hope you guys can provide elegant solutions for it.
Expected behavior
LoRA can be used with tensor and sequence parallel.
Environment overview (please complete the following information)
nvcr.io/nvidia/nemo:24.07 docker run
Environment details
I used the default environment of the nemo docker