epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
504 stars 73 forks source link

[Swiglu] question about swiglu #64

Closed mynewstart closed 9 months ago

mynewstart commented 10 months ago

Hi, I found the following code in arguments.py was deleted in your implementation, can i know why?

    if args.swiglu:
        # reduce the dimnesion for MLP since projections happens on
        # two linear layers. this keeps the number of paramters in
        # the same ballpark as the counterpart with 4*h size
        # we keep it a multiple of 64, which means the actual tensor size
        # will be a multiple of 64 / tp_size
        args.ffn_hidden_size = int((4 * args.hidden_size * 2 / 3) / 64) * 64
kylematoba commented 10 months ago

Are you asking why the NVIDIA implementation version has the swiglu argument and ours does not? The short answer is here: https://github.com/epfLLM/Megatron-LLM/blob/main/megatron/model/transformer.py#L108 -- we support other GLU-like activations. If you are asking about whether the ffn_hidden_size variable is not adjusted by some additional term of 2 / 3 I am not sure, it may be an oversight. If you can tell me what your goal is I can probably give you a better answer.

mynewstart commented 10 months ago

Hi, thanks for replying! I want to pre-train the llama model, but the ffn_hidden_size in HF checkpoint file is not same as config.ffn_hidden_size as it adjusted by term of 2 / 3. So I pruned the model weights, but I think it's not a good choice?

In your code, I saw in the hf_to_megatron.py file, you didn't do this adjust, also in the arguments.py, you also didn't adjust args.ffn_hidden_size. Is this parameter adjustment made for the sake of training efficiency, so leaving it unchanged won't have any impact?

llama_s2layer = {7: 32, 13: 40, 30: 60, 34: 48, 65: 80, 70: 80}
llama_s2heads = {7: 32, 13: 40, 30: 52, 34: 64, 65: 64, 70: 64}
llama_s2dense = {7: 11008, 13: 13824, 30: 17920, 34: 22016, 65: 22016,
                 70: 28672}  # should be (2/3)*4*d, but it isn't exaclty that
llama_s2hidden = {7: 4096, 13: 5120, 30: 6656, 34: 8192, 65: 8192, 70: 8192}
kylematoba commented 10 months ago

Sorry, I'm not quite sure what the problem is. It's probably my fault, since I'm not too familiar with this part of the code. But: I don't think I've encountered any problems with checkpoints not loading due to wrong dimensions. Is this a question about our conventions? Do you think we've made some problematic error?

mynewstart commented 10 months ago

I apologize for the unclear expression. In fact, my question is that the following code in "Megatron-LM" is included, but it is not present in your codebase. Can you provide a reason for commenting out this code? Is it solely related to training efficiency?

    if args.swiglu:
        # reduce the dimnesion for MLP since projections happens on
        # two linear layers. this keeps the number of paramters in
        # the same ballpark as the counterpart with 4*h size
        # we keep it a multiple of 64, which means the actual tensor size
        # will be a multiple of 64 / tp_size
        args.ffn_hidden_size = int((4 * args.hidden_size * 2 / 3) / 64) * 64
kylematoba commented 10 months ago

Hi, sorry also. I'm probably being needlessly confusing because I don't 100% understand the context.

But I think the short answer is: we dropped the swiglu argument because we generalized to different GLU-like activations. Then, strictly speaking, this logic did not work anymore so it probably got dropped, though it could have / should have been generalized also. So: it might be a bug. But I'm kind of surprised that we're able to load checkpoints, etc. if the dimensions mismatch. What would be the consequence of mistakenly omitting this tweak?

I'm pretty certain it was not for training efficiency.

mynewstart commented 9 months ago

Hi, I think this PR solved my question. Maybe we can follow the same logic to calculate ffn_hidden_size when args.ffn_hidden_size==None and args.glu_activation=Swiglu