Open DingQiang2018 opened 1 year ago
Thanks for your kind feedback and save_model()
has been updated according to your advice. FYI: https://github.com/OpenLMLab/LOMO/commit/06e50c07bb324be2863fe208012f8a9d6852b961
很荣幸我的建议被采纳。我还想问问您对之前save_model
出错的具体原因有什么看法吗?我还没想通。
It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model
went wrong previously? I have not figured out this.
很荣幸我的建议被采纳。我还想问问您对之前
save_model
出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on whysave_model
went wrong previously? I have not figured out this.
作者您好,我之所以想知道这个问题的答案,是因为我看到 LOMO 的优化器实现和save_model
的代码假设了同样的 deepspeed 划分参数的方式,即每个参数经过展平后划分成若干块,第 i 块分配给第 i 个进程。我不确定 deepspeed 是否是这样划分参数的。因此,我在上面提供的save_model
代码没有使用这样的假设,仅使用 deepspeed 提供的 deepspeed.zero.GatheredParameters
自动进行参数的聚合。让我意外的是,这一改动修复了save_model
的 bug。因此我推测save_model
出错的原因可能在于上述划分参数的假设不对。这动摇了我对 LOMO 优化器的实现的正确性的看法。希望作者能消除我的疑虑。
Hi, I want to know the answer to this question because I find the implementation of LOMO and the code of save_model
assume the same layout of partitioned parameters in deepspeed, that is, each parameter is flattened and divided into chuncks, with the i-th chunck sent to the i-th process. I'm not sure if deepspeed splits the parameters that way. Therefore, the code of save_model
I provided above does not use this assumption, only using deepspeed.Zero.GatheredParameters
provided by deepspeed to gather parameters automatically. To my surprise, this change fixes the bug. Therefore, I speculate that the bug may lie in the wrong assumptions of parameters partitioning. This has shaken my opinion about the correctness of the implementation of the LOMO optimizer. I hope the author can address my doubts.
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution?
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution? 还没有解决……
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution? 还没有解决……
我也不能 在 merge llama with lora 之后得到相同的结果,很奇怪
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。
很荣幸我的建议被采纳。我还想问问您对之前
save_model
出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on whysave_model
went wrong previously? I have not figured out this.
Hi, 我想知道使用ChatGLM2的loss两种保存方法会差多少,不知道您是否还有记录?BTW,LLaMA会有同样的问题吗?
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。
Hi, 我注意到了,但是我目前还是没办法做到 merge 之后的 model 有相同的eval resutls。。。。
我使用lomo(和zero3)在8张NVIDIA 3090 GPU上微调chatglm2-6b,并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后,我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码,重写了save_model(重写的代码如下),发现这个bug解决了。这说明原来版本的save_model有bug,但我还没有找到具体出错原因。 I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.