During training, I found that all parameters are trainable and set to fp32 precision since the ChameleonXLLMXForConditionalGeneration class doesn't have a get_trainable_params method. I wonder whether all parameters require training during the 3-stage FP-SFT and whether fp32 precision for all parameters is necessary?
For a parameter, as long as it is trainable (requires_grad=True), its fp32 version is necessary because the parameter updates have to be conducted in full precision. If the parameter is frozen, then we can simply keep its 16-bit version
In our experiments we keep all parameters trainable along the whole SFT process, but you may try different settings by adding a "get_trainable_params" method.
Thanks for your great work!
During training, I found that all parameters are trainable and set to fp32 precision since the
ChameleonXLLMXForConditionalGeneration
class doesn't have aget_trainable_params
method. I wonder whether all parameters require training during the 3-stage FP-SFT and whether fp32 precision for all parameters is necessary?The relevant code can be found at https://github.com/Alpha-VLLM/Lumina-mGPT/blob/104abe453ec1acca5863698629c4db2111b0b3fc/xllmx/solvers/finetune/finetune.py#L286-L294