Closed bao-xiaoyi closed 10 months ago
I looked at the saved results. The model weights and optimizer parameters were successfully saved. But _save_other execution failed. The saved model can be used for inference, but it will generate a large number of repetitions.
In addition, I also found that the. model. pth file saved by ft (11G) is smaller than the initial. model. pth file (14G). Is this a normal phenomenon
The saved model can be used for inference, but it will generate a large number of repetitions.
Do you mean there are a lot of *of*.model.pth
files? If yes, this is expected behavior because models are split through model (i.e. tensor) parallelism into n parts, and each part is saved separately.
I looked at the saved results. The model weights and optimizer parameters were successfully saved. But _save_other execution failed.
Could you please check if the _save_optimizer
call is indeed completed on all processes? For example, you should see the same number of .optim.pth files as .model.pth files, and all the .optim.pth files have equal size. I am afraid that the problem you meet may not come from the _save_other
call because _save_other
involves no NCCL communication.
. model. pth file saved by ft (11G) is smaller than the initial. model. pth file (14G). Is this a normal phenomenon
It is not normal. Are the same number of .model.pth files (i.e. the model parallel size) the same? If not, it is expected that the size of each individual file will differ; if yes, please see the load_state_dict result to check if all params perfectly match. If they match, there should not be big problem
保存的模型可以用于推理,但会产生大量的重复。
你的意思是有很多
*of*.model.pth
文件吗?如果是,这是预期的行为,因为模型通过模型(即张量)并行性分为 n 个部分,并且每个部分单独保存。我查看了保存的结果。模型权重和优化器参数已成功保存。但_save_other执行失败。
您能否检查一下
_save_optimizer
所有进程的调用是否确实完成了?例如,您应该看到与 .model.pth 文件相同数量的 .optim.pth 文件,并且所有 .optim.pth 文件具有相同的大小。恐怕您遇到的问题可能不是来自通话,_save_other
因为_save_other
不涉及 NCCL 沟通。。模型。ft 保存的 pth 文件(11G)比初始的要小。模型。pth 文件(14G)。这是正常现象吗
这是不正常的。相同数量的.model.pth文件(即模型并行大小)是否相同?如果不是,预计每个单独文件的大小会有所不同;如果是,请查看 load_state_dict 结果以检查所有参数是否完全匹配。如果匹配的话应该问题不大
First answer the first question, The large amount of repetition I described is that the model inference will be brainless to the maximum length (the training data is far from that long). And will not terminate prematurely. The second question, The final print statement in _save_optimizer was not completed, but the file that should have been saved was saved. In other words, both the. model. pth and. optim. pth files have been saved and the quantity is correct. But there is a node in the rank specific file that has not been saved The third question, for the initial model, the size situation is as follows Amazingly, the reasoning process is normal, and no errors were found except for the length of the reasoning. And display that the total parameter quantity of the model is loaded correctly
The above figure shows the checkpoints saved by ft, while the following figure shows the initial model
我的想法是,在检查点保存过程中不应涉及NCCL通信操作。但在保存检查点期间或之后,该模型确实经历了nccl超时
And will not terminate prematurely.
If you use our default conversation template, each response should end with \n###
, do you see this in your response? If it exists in the response, I guess your are using MetaModel.generate without passing the additional_stop_symbols
argument, please set additional_stop_symbols=['\n###']
我的想法是,在检查点保存过程中不应涉及NCCL通信操作。但在保存检查点期间或之后,该模型确实经历了nccl超时
If the model.pth and optim.pth files are saved correctly, you may comment the save_other and save_rank_specific calls out before you figure out the cause of the NCCL error. The bug is weird.
optim.pth
Can I also annotate the saving of the optim.pth file? I don't think the reasoning process will involve
the optim.pth files are only useful to resume training. Without them your optimizer states will be lost. I you can bear this, it's okay to also comment it.
optim.pth 文件仅对恢复训练有用。没有它们,您的优化器状态将会丢失。我你能忍受这个,也可以评论一下。
Okay, I will continue to try when there are sufficient resources in the future. So the remaining question is that the model size has changed before and after training. What do you think is the reason? Or do we need to care about it
optim.pth 文件仅对恢复训练有用。没有它们,您的优化器状态将会丢失。我你能忍受这个,也可以评论一下。
Okay, I will continue to try when there are sufficient resources in the future. So the remaining question is that the model size has changed before and after training. What do you think is the reason? Or do we need to care about it
我看了下我们之前训练的,也确实出现了一样的现象,所以应该是没问题的。原因可能是格式转换的脚本多保存了一些东西,比如当tensor b = a[:2] 作为a的一个view被保存的时候事实上整个a都会被保存。不管怎样,你应该不需要担心这一点
Print information for worker nodes:
Specifically, during the first save of the model, where the master node prints information: