Open starship006 opened 3 months ago
I'm sorry I don't think I have run into any issue like this before. Do you think it might be because somewhere during loading and checkpointing the device is messed up? Is it that only thelm_optimizer
is crashing but critic_optimizer
works fine? It sounds weird to me too.
Yup, critic_optimizer works but lm_optimizer crashes when step
ping. Might be worth noting that we are currently trying to use bfloat16
? But so far, mostly unsure about whats going on currently. Might check and see if this replicates on this repo itself
I am working on a modified version of this repository with slight changes, so I am trying to see if this is an error on my side or not. My setup uses a distributed GPU setup using Accelerate. I am having some issues loading in the
lm_optimizer
. Here is my current saving and loading code inside oftrainer.py
:The code above works fine, but isn't loading in
lm_optimizer
. However, when uncommenting those lines of code, everything works untilself.lm_optimizer
tries to performlm_optimizer.step()
. The code errors with:RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32/64 notwithstanding
I'm currently pretty lost as to what the bug might be. I don't think I've changed any code which would be relevant to
lm_optimizer
. If this is something that you recognize/notice, I would very much appreciate it!