Closed LiJiaqi96 closed 3 weeks ago
Interesting. It runs normally for me. Could you try to use bf16
?
Thanks for your timely reply. How should I changed to bf16? Is it correct to change the following places:
videochat2_it_mistral.py
to torch_dtype=torch.bfloat16
config_7b_stage3.py
to fp16=False
videochat2_it_mistral.py
to torch_dtype=torch.bfloat16
.with torch.cuda.amp.autocast(enabled=config.fp16):
to with torch.cuda.amp.autocast(enabled=config.fp16, dtype=torch.bfloat16):
in train_pt.py
or train_it.py
.Ok many thanks, let me have a try.
BTW, what is the use of fp16=True
in config_7b_stage3.py
? In my experiment mentioned above, setting fp16=True
will result in an error and it was solved by changing it to fp16=False
. And what value should I set in this place?
fp16=True
will use mis precision in with torch.cuda.amp.autocast(enabled=config.fp16)
Using bfloat16 solves my problem, thanks!
Is it correct that I set the fp16=False
when using bfloat16?
It may not correct to set fp16=False
since it will close the mix-precision training and requires more GPU memory.
However, bf16
and fp32
both work for stable training. It's okay if you have enough GPU memory.
Thanks for clarifying that for me! The code throws another error if I set fp16=True
together with using bfloat16:
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
Is it the same case on your side?
Can you provide the full log? It runs normally for me to use mix-precision training with bf16
.
train_log.txt Please refer to the log. It might not be straightforward due to the use of DDP.
Can you try to add model.bfloat16()
in setup_model()
in share_utils.py
? Please refer the code here since it works for me.
Thanks for your suggestions. I tried to solve this issue but it doesn't work.
Finally I find the version of the peft
and transformers
packages was updated due to the installation of other packages. Using the correct version of packages solved all the errors mentioned above.
Sorry for the efforts and the time you paid on my issue. It's so nice of you to help me a lot.
BTW, I tested the memory usage with fp16=True
or fp16=False
, and find the GPU memory is similar. Is it better to keep fp16=True while training?
Thanks for your try! I think both is ok, because the model will call with self.maybe_autocast():
in forward function for mix-precision training.
Thanks for your suggestions :)
Hi, I tried to run the stage 3 training script of videochat2_mistral, but I got nan loss after the first iteration. I tried to use smaller lr but the loss remains to be nan.
Is there any idea for this issue? Thanks!