Closed mynewstart closed 9 months ago
Are you asking why the NVIDIA implementation version has the swiglu
argument and ours does not? The short answer is here: https://github.com/epfLLM/Megatron-LLM/blob/main/megatron/model/transformer.py#L108 -- we support other GLU-like activations. If you are asking about whether the ffn_hidden_size
variable is not adjusted by some additional term of 2 / 3 I am not sure, it may be an oversight. If you can tell me what your goal is I can probably give you a better answer.
Hi, thanks for replying! I want to pre-train the llama model, but the ffn_hidden_size in HF checkpoint file is not same as config.ffn_hidden_size as it adjusted by term of 2 / 3. So I pruned the model weights, but I think it's not a good choice?
In your code, I saw in the hf_to_megatron.py file, you didn't do this adjust, also in the arguments.py, you also didn't adjust args.ffn_hidden_size. Is this parameter adjustment made for the sake of training efficiency, so leaving it unchanged won't have any impact?
llama_s2layer = {7: 32, 13: 40, 30: 60, 34: 48, 65: 80, 70: 80}
llama_s2heads = {7: 32, 13: 40, 30: 52, 34: 64, 65: 64, 70: 64}
llama_s2dense = {7: 11008, 13: 13824, 30: 17920, 34: 22016, 65: 22016,
70: 28672} # should be (2/3)*4*d, but it isn't exaclty that
llama_s2hidden = {7: 4096, 13: 5120, 30: 6656, 34: 8192, 65: 8192, 70: 8192}
Sorry, I'm not quite sure what the problem is. It's probably my fault, since I'm not too familiar with this part of the code. But: I don't think I've encountered any problems with checkpoints not loading due to wrong dimensions. Is this a question about our conventions? Do you think we've made some problematic error?
I apologize for the unclear expression. In fact, my question is that the following code in "Megatron-LM" is included, but it is not present in your codebase. Can you provide a reason for commenting out this code? Is it solely related to training efficiency?
if args.swiglu:
# reduce the dimnesion for MLP since projections happens on
# two linear layers. this keeps the number of paramters in
# the same ballpark as the counterpart with 4*h size
# we keep it a multiple of 64, which means the actual tensor size
# will be a multiple of 64 / tp_size
args.ffn_hidden_size = int((4 * args.hidden_size * 2 / 3) / 64) * 64
Hi, sorry also. I'm probably being needlessly confusing because I don't 100% understand the context.
But I think the short answer is: we dropped the swiglu
argument because we generalized to different GLU-like activations. Then, strictly speaking, this logic did not work anymore so it probably got dropped, though it could have / should have been generalized also. So: it might be a bug. But I'm kind of surprised that we're able to load checkpoints, etc. if the dimensions mismatch. What would be the consequence of mistakenly omitting this tweak?
I'm pretty certain it was not for training efficiency.
Hi, I think this PR solved my question. Maybe we can follow the same logic to calculate ffn_hidden_size when args.ffn_hidden_size==None and args.glu_activation=Swiglu
Hi, I found the following code in arguments.py was deleted in your implementation, can i know why?