InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.99k stars 363 forks source link

[Feature] change InternLM2 modeling to unified type #1224

Open yinfan98 opened 5 months ago

yinfan98 commented 5 months ago

Motivation

when do the w8a8 quantization in pytorch engine, I found that InternLM2 modeling like. It use self.attention, self.feed_forward...


class InternLM2DecoderLayer(nn.Module):
  def __init__(self, config: InternLM2Config):
  super().__init__()
  self.hidden_size = config.hidden_size
   
  self.attention = INTERNLM2_ATTENTION_CLASSES[config.attn_implementation](config=config)
   
  self.feed_forward = InternLM2MLP(config)
  self.attention_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
  self.ffn_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

but others model modeling like, and all use self.self_attn, self.mlp, they have the diff name.

class LlamaDecoderLayer(nn.Module):
    """Decoder layer for Llama Model."""

    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = LlamaAttention(config=config)
        self.mlp = LlamaMLP(config)
        self.input_layernorm = LlamaRMSNorm(config.hidden_size,
                                            eps=config.rms_norm_eps)
        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size,
                                                     eps=config.rms_norm_eps)

using unifiy name is important for quantization, and for other feature like Medusa, or Sparse Gemm(they need to change self.mlp layer). If we do not unify the names, we would need to continuously iterate through class names to find the layers that need to be quantized or replaced. This makes the code difficult to understand and also results in additional time consumption.

and if necessary, I will fix it later...

Related resources

No response

Additional context

No response

HIT-cwh commented 5 months ago

Hi @yinfan98 ! Thank you for your advice.

In order to unify these names, it’s essential to ensure that the checkpoint being loaded before inference is also adjusted accordingly. Prior to the W8A8 inference, we need to smooth the activation outliers, which is handled through the script found at lmdeploy/lite/apis/smooth_quant.py.

This process involves reading the model weights via the ‘from_pretrained’ interface and then storing the refined model weights using the ‘save_pretrained’ interface. Note that we cannot change the weight name and module name of the model loaded by 'from_pretrained'. Hence, if there’s a need to unify these module names, besides updating ‘modeling.py’, we will also have to rename all the modules before we save the weights.

Feel free to continue the conversation on this.