pbarker commented 4 months ago

Hello, thank you for the amazing work, is it possible to use Qlora to fine tune the 4bit quant models?

nzomi commented 4 months ago

Hello, I tried using BitsAndBytesConfig to obtain and save a 4-bit model. However, I encountered an issue where it is crucial to generate a chat using the 4-bit model. Have you experienced a similar issue? By the way, I followed the instructions with AutoGPTQ to obtain another 4-bit model, but I received a message stating that

'internlmxcomposer2 isn't supported yet.'

Has anyone else encountered this issue? How can I resolve it?

pbarker commented 4 months ago

@nzomi https://github.com/AutoGPTQ/AutoGPTQ/pull/619 and https://github.com/AutoGPTQ/AutoGPTQ/pull/189

nzomi commented 4 months ago

@nzomi AutoGPTQ/AutoGPTQ#619 and AutoGPTQ/AutoGPTQ#189

@pbarker Thank you for mentioning that. Indeed, I also created an issue in their repository and the problem was fixed. However, I tried to quantize the 4KHD model, but its structure is a bit different from the 7B version, which has become another challenge...

pbarker commented 4 months ago

Hey @nzomi we are going to try and quant the 4khd model next week if you want to share notes, also if there is a maintainer that can give any tips we would appreciate it!

nzomi commented 4 months ago

Hey @nzomi we are going to try and quant the 4khd model next week if you want to share notes, also if there is a maintainer that can give any tips we would appreciate it!

@pbarker Sure thing! I used the same method mentioned below to get the 4-bit model.

hi，我们使用的auto-gptq默认的量化方法，没有引入量化训练，https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#quick-tour

Originally posted by @LightDXY in https://github.com/InternLM/InternLM-XComposer/issues/208#issuecomment-1985286884

However, I found some differences from the quick start quantization demo. First of all, if the layers_block_name is model.layers, the model will not be quantized since there are no linear layers in the inside_layer_modules, if you delete the suffix .linear in the inside_layer_modules, then all linear layer will be quantized, specifically Plora_A and Plora_B, but they cannot be quantized simply using the AutoGPTQ sice they accept the 'x' and 'im_mask' as inputs. You can quickly check the model structure; I've provided it below as well:

inside_layer_modules = [
        ["attention.wqkv.linear"],
        ["attention.wo.linear"],
        ["feed_forward.w1.linear", "feed_forward.w3.linear"],
        ["feed_forward.w2.linear"],
    ]

I think it is also impossible to quantize the vit module, so what we can do is just quantize the vision_proj and output module, which just contains a simple Linear layer; but that is not our goal, as we aim to quantize the model, not other modules (vit, vision_proj, ect).

I also checked the 4bit model provided by maintainers, and there is a extra linear layer in the models.layer , so they can simply quantize this linear layer, but how can they achieve that is a mystery.

nzomi commented 4 months ago

@pbarker I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.

pbarker commented 4 months ago

@myownskyW7 could you give us a bit of direction on these pieces?

I also checked the 4bit model provided by maintainers, and there is a extra linear layer in the models.layer , so they can simply quantize this linear layer, but how can they achieve that is a mystery.

and

I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.

Do you have any recommendations for producing a 4k quantized model?

nzomi commented 4 months ago

I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.

@pbarker Actually, this method failed, and I'm trying to locate the bug. To do this, you can check the model structure by printing the model (e.g., print(model)) and focus on the InternLM2Attention and InternLM2MLP classes. You might find the differences there. Additionally, check the build_mlp.py file to see the differences in the PLoRA module between the 4KHD model and the 4-bit model provided by the developers. They use different linear layers in this module.

nzomi commented 4 months ago

@pbarker I succesfully get the 4bit model.

I replaced all weight_name in the .bin file fromxx.weight to xx.linear.weight, where xx can be wo, wqkv, w1, w2, or w3. This was easily accomplished by loading the .bin file using torch.load and replacing all keys with their corresponding new keys.
Similarly, I performed the same replacements in the pytorch_model.bin.index.json file.
I made modifications to the PLoRA in build_mlp.py by replacing super().forward() with self.linear().

Finally, I executed the AutoGPTQ quantization process as outlined in their repository, with the same config as follows:

class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
layers_block_name = "model.layers"
outside_layer_modules = [
    'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output',
]
inside_layer_modules = [
    ["attention.wqkv.linear"],
    ["attention.wo.linear"],
    ["feed_forward.w1.linear", "feed_forward.w3.linear"],
    ["feed_forward.w2.linear"],
]

pbarker commented 4 months ago

Thanks @nzomi we are working to recreate, I guess we also have 2.5 to figure out 🙂

nzomi commented 4 months ago

@pbarker I hope this information helps you. Additionally, I found that the inference speed of the 4-bit model is not faster than the base model. If you encounter the same issue, please feel free to contact me.

zhuraromdev commented 4 months ago

Hello @nzomi , I hope, that you are doing well. I have a question regarding quantization of the model. I have followed all instruction, which you have described above, however I still have an issue with the last step.

Code:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.modeling import BaseGPTQForCausalLM
import logging

# Set up logging
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
    level=logging.INFO,
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Define the custom class for OPT model
class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
    layers_block_name = "model.layers"
    outside_layer_modules = [
        'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output',
    ]
    inside_layer_modules = [
        ["attention.wqkv.linear"],
        ["attention.wo.linear"],
        ["feed_forward.w1.linear", "feed_forward.w3.linear"],
        ["feed_forward.w2.linear"],
    ]

# Define model directories
local_model_dir = "internlm-xcomposer2-4khd-7b"#"internlm-xcomposer2-4khd-7b"
quantized_model_dir = "4bit-internlm-xcomposer2-4khd-7b"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) # here
examples = [
    tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.")
]

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# Load and quantize the model using the custom class
model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
model.quantize(examples)

# Save the quantized model
model.save_quantized(quantized_model_dir)
model.save_quantized(quantized_model_dir, use_safetensors=True)

# Load quantized model for inference
model = InternLMXComposer2QForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True) # here

# Inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# Or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[81], line 44
     37 quantize_config = BaseQuantizeConfig(
     38     bits=4,  # quantize model to 4-bit
     39     group_size=128,  # it is recommended to set the value to 128
     40     desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
     41 )
     43 # Load and quantize the model using the custom class
---> 44 model = InternLMXComposer2QForCausalLMM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
     45 model.quantize(examples)
     47 # Save the quantized model

File ~/miniconda3/envs/intern/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:752, in from_pretrained(cls, pretrained_model_name_or_path, quantize_config, max_memory, trust_remote_code, torch_dtype, **model_init_kwargs)

TypeError: internlmxcomposer2 isn't supported yet.

Do you have any suggestions how to fix it? Thank you in advance!

nzomi commented 4 months ago

Hello @nzomi , I hope, that you are doing well. I have a question regarding quantization of the model. I have followed all instruction, which you have described above, however I still have an issue with the last step.

Code:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.modeling import BaseGPTQForCausalLM
import logging

# Set up logging
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
    level=logging.INFO,
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Define the custom class for OPT model
class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
    layers_block_name = "model.layers"
    outside_layer_modules = [
        'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output',
    ]
    inside_layer_modules = [
        ["attention.wqkv.linear"],
        ["attention.wo.linear"],
        ["feed_forward.w1.linear", "feed_forward.w3.linear"],
        ["feed_forward.w2.linear"],
    ]

# Define model directories
local_model_dir = "internlm-xcomposer2-4khd-7b"#"internlm-xcomposer2-4khd-7b"
quantized_model_dir = "4bit-internlm-xcomposer2-4khd-7b"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) # here
examples = [
    tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.")
]

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# Load and quantize the model using the custom class
model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
model.quantize(examples)

# Save the quantized model
model.save_quantized(quantized_model_dir)
model.save_quantized(quantized_model_dir, use_safetensors=True)

# Load quantized model for inference
model = InternLMXComposer2QForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True) # here

# Inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# Or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[81], line 44
     37 quantize_config = BaseQuantizeConfig(
     38     bits=4,  # quantize model to 4-bit
     39     group_size=128,  # it is recommended to set the value to 128
     40     desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
     41 )
     43 # Load and quantize the model using the custom class
---> 44 model = InternLMXComposer2QForCausalLMM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
     45 model.quantize(examples)
     47 # Save the quantized model

File ~/miniconda3/envs/intern/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:752, in from_pretrained(cls, pretrained_model_name_or_path, quantize_config, max_memory, trust_remote_code, torch_dtype, **model_init_kwargs)

TypeError: internlmxcomposer2 isn't supported yet.

Do you have any suggestions how to fix it? Thank you in advance!

@zhuraromdev AutoGPTQ doesn't support InternLM2 at the moment, The simplest workaround is to change the model_type in config.json from internlmxcomposer2 to internlm. Alternatively, you can add a new class (the same custom class you defined) for InternLM2 in the source code at this path: AutoGPTQ/auto_gptq/modeling/internlmxcomposer2 (Don't forget to add the model_type in _const.py and import the model in __init__.py if you choose this way!). Hopefully, they will add support for this model type in the future. Another issue you might encounter is the NoneType error. The 4k model contains the plora_glb_GN and plora_sub_GN layers, which don't have any name_prefix. In AutoGPTQ, they select modules to dispatch using the get_module_by_name_prefix function, leading to a NoneType error for these two modules. I added these two lines of code to avoid this problem, but I'm still looking for a more general solution.

zhuraromdev commented 4 months ago

@nzomi Thank you a lot for your help, however I still have an issues with quantization.

Step, which was made:

Install AutoGPTQ from sources. My current version of auto-gptq is Version: 0.8.0.dev0+cu121. Also I was not doing any changes inside AutoGPTQ repo.
Made snapshot_download() of internlm/internlm-xcomposer2-4khd-7b repo.
Update .bin files and .json file. Also update of config.json: "model_type": "internlm". Also I have updated build_mlp.py, was changed class PLoRA.

Code:

class PLoRA(nn.Linear):
    def __init__(self,
                 in_features: int,
                 out_features: int,
                 bias: bool = True,
                 device=None,
                 dtype=None,
                 lora_r=8,
                 lora_alpha=16,
                 lora_dropout=0.05,
                 lora_len=0,
                 **kwargs) -> None:
        super().__init__(in_features, out_features, bias, device, dtype)

        # Create a linear layer for self.linear
        self.linear = nn.Linear(in_features, out_features, bias, device=device, dtype=dtype)

        self.lora_r = lora_r
        self.lora_alpha = lora_alpha
        self.lora_len = lora_len
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x
        self.lora_scaling = self.lora_alpha / self.lora_r

        self.Plora_A = nn.Linear(in_features,
                                self.lora_r,
                                bias=False,
                                device=device,
                                dtype=dtype)
        self.Plora_B = nn.Linear(self.lora_r,
                                out_features,
                                bias=False,
                                device=device,
                                dtype=dtype)

        self.reset_parameters()

    def reset_parameters(self):
        if hasattr(self, 'Plora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.Plora_A.weight, a=math.sqrt(5))
            nn.init.zeros_(self.Plora_B.weight)

    def forward(self, x, im_mask=None):
        B, N, C = x.shape
        x = x.reshape(-1, C)
        if im_mask is not None:
            im_mask = im_mask.view(-1)
        res = self.linear(x) # use the newly defined self.linear
        if im_mask is not None:
            if torch.sum(im_mask) > 0:
                part_x = x[im_mask]
                res[im_mask] += self.Plora_B(self.Plora_A(
                    self.lora_dropout(part_x))) * self.lora_scaling
            else:
                part_x = x[:1]
                res[:1] += self.Plora_B(self.Plora_A(
                    self.lora_dropout(part_x))) * 0

        return res.reshape(B, N, -1)

Structure of the folders with model info, looks like this:

Code and log for quantization of the model.


from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.modeling import BaseGPTQForCausalLM

Set up logging

logging.basicConfig( format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S" )

Define the custom class for OPT model

class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM): layers_block_name = "model.layers" outside_layer_modules = [ 'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output', ] inside_layer_modules = [ ["attention.wqkv.linear"], ["attention.wo.linear"], ["feed_forward.w1.linear", "feed_forward.w3.linear"], ["feed_forward.w2.linear"], ]

Define model directories

local_model_dir = "internlm-4khd-7b" quantized_model_dir = "4bit-internlm-4khd-7b"

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) examples = [ tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.") ]

Configure quantization

quantize_config = BaseQuantizeConfig( bits=4, # quantize model to 4-bit group_size=128, # it is recommended to set the value to 128 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad )

Load and quantize the model using the custom class

model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) model.quantize(examples)

Save the quantized model

model.save_quantized(quantized_model_dir) model.save_quantized(quantized_model_dir, use_safetensors=True)


Log:

You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors. You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors. Set max length to 16384 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.95s/it] Some weights of InternLMXComposer2ForCausalLM were not initialized from the model checkpoint at internlm-4khd-7b and are newly initialized: ['model.layers.4.feed_forward.w3.weight', 'vit.vision_tower.vision_model.post_layernorm.bias', 'model.layers.19.attention.wo.weight', 'model.layers.20.attention.wqkv.weight', 'model.layers.18.feed_forward.w3.weight', 'model.layers.28.attention.wqkv.weight', 'model.layers.27.feed_forward.w1.weight', 'model.layers.1.feed_forward.w2.weight', 'model.layers.29.attention.wo.weight', 'model.layers.9.attention.wqkv.weight', 'model.layers.18.feed_forward.w1.weight', 'model.layers.5.attention.wqkv.weight', 'model.layers.18.attention.wqkv.weight', 'model.layers.24.feed_forward.w3.weight', 'model.layers.11.feed_forward.w1.weight', 'model.layers.25.feed_forward.w2.weight', 'model.layers.27.attention.wqkv.weight', 'model.layers.4.feed_forward.w1.weight', 'model.layers.12.attention.wqkv.weight', 'model.layers.25.attention.wo.weight', 'model.layers.0.attention.wo.weight', 'model.layers.24.attention.wo.weight', 'model.layers.27.feed_forward.w2.weight', 'model.layers.21.attention.wo.weight', 'model.layers.15.feed_forward.w3.weight', 'model.layers.26.feed_forward.w1.weight', 'vit.vision_tower.vision_model.post_layernorm.weight', 'model.layers.29.feed_forward.w1.weight', 'model.layers.3.attention.wqkv.weight', 'model.layers.14.attention.wqkv.weight', 'model.layers.1.attention.wo.weight', 'model.layers.19.attention.wqkv.weight', 'model.layers.5.feed_forward.w2.weight', 'model.layers.5.attention.wo.weight', 'model.layers.15.feed_forward.w1.weight', 'model.layers.2.attention.wo.weight', 'model.layers.1.attention.wqkv.weight', 'model.layers.28.attention.wo.weight', 'model.layers.21.feed_forward.w1.weight', 'model.layers.27.feed_forward.w3.weight', 'model.layers.15.attention.wqkv.weight', 'model.layers.8.feed_forward.w1.weight', 'model.layers.27.attention.wo.weight', 'model.layers.23.attention.wqkv.weight', 'model.layers.14.feed_forward.w3.weight', 'model.layers.4.attention.wo.weight', 'model.layers.19.feed_forward.w1.weight', 'model.layers.12.attention.wo.weight', 'model.layers.9.attention.wo.weight', 'model.layers.21.feed_forward.w2.weight', 'model.layers.17.feed_forward.w3.weight', 'model.layers.17.feed_forward.w1.weight', 'model.layers.26.feed_forward.w3.weight', 'model.layers.31.feed_forward.w3.weight', 'model.layers.24.attention.wqkv.weight', 'model.layers.30.feed_forward.w2.weight', 'model.layers.18.feed_forward.w2.weight', 'model.layers.23.feed_forward.w3.weight', 'model.layers.6.feed_forward.w1.weight', 'model.layers.23.feed_forward.w2.weight', 'model.layers.16.feed_forward.w3.weight', 'model.layers.16.feed_forward.w1.weight', 'model.layers.6.attention.wqkv.weight', 'model.layers.16.attention.wqkv.weight', 'model.layers.12.feed_forward.w1.weight', 'model.layers.13.attention.wo.weight', 'model.layers.6.feed_forward.w3.weight', 'model.layers.13.feed_forward.w3.weight', 'model.layers.8.feed_forward.w2.weight', 'model.layers.29.feed_forward.w2.weight', 'model.layers.7.feed_forward.w3.weight', 'model.layers.14.attention.wo.weight', 'model.layers.6.attention.wo.weight', 'model.layers.30.feed_forward.w3.weight', 'model.layers.28.feed_forward.w3.weight', 'model.layers.22.feed_forward.w2.weight', 'model.layers.5.feed_forward.w1.weight', 'model.layers.15.feed_forward.w2.weight', 'model.layers.31.attention.wo.weight', 'model.layers.22.feed_forward.w1.weight', 'model.layers.0.feed_forward.w2.weight', 'model.layers.3.feed_forward.w1.weight', 'model.layers.1.feed_forward.w3.weight', 'model.layers.10.attention.wo.weight', 'model.layers.3.feed_forward.w2.weight', 'model.layers.8.attention.wo.weight', 'model.layers.18.attention.wo.weight', 'model.layers.6.feed_forward.w2.weight', 'model.layers.7.feed_forward.w2.weight', 'model.layers.25.feed_forward.w3.weight', 'model.layers.4.attention.wqkv.weight', 'model.layers.10.attention.wqkv.weight', 'model.layers.20.feed_forward.w3.weight', 'model.layers.4.feed_forward.w2.weight', 'model.layers.14.feed_forward.w1.weight', 'model.layers.8.attention.wqkv.weight', 'model.layers.7.feed_forward.w1.weight', 'model.layers.9.feed_forward.w3.weight', 'model.layers.8.feed_forward.w3.weight', 'model.layers.31.feed_forward.w1.weight', 'model.layers.30.attention.wqkv.weight', 'model.layers.24.feed_forward.w1.weight', 'model.layers.30.feed_forward.w1.weight', 'model.layers.31.attention.wqkv.weight', 'model.layers.7.attention.wo.weight', 'model.layers.10.feed_forward.w1.weight', 'model.layers.20.attention.wo.weight', 'model.layers.22.attention.wo.weight', 'model.layers.26.feed_forward.w2.weight', 'model.layers.13.feed_forward.w2.weight', 'model.layers.17.attention.wqkv.weight', 'model.layers.12.feed_forward.w2.weight', 'model.layers.28.feed_forward.w1.weight', 'model.layers.3.feed_forward.w3.weight', 'model.layers.19.feed_forward.w2.weight', 'model.layers.23.feed_forward.w1.weight', 'model.layers.0.feed_forward.w1.weight', 'model.layers.10.feed_forward.w3.weight', 'model.layers.28.feed_forward.w2.weight', 'model.layers.30.attention.wo.weight', 'model.layers.14.feed_forward.w2.weight', 'model.layers.12.feed_forward.w3.weight', 'model.layers.11.attention.wqkv.weight', 'model.layers.29.feed_forward.w3.weight', 'model.layers.3.attention.wo.weight', 'model.layers.29.attention.wqkv.weight', 'model.layers.20.feed_forward.w2.weight', 'model.layers.31.feed_forward.w2.weight', 'model.layers.9.feed_forward.w1.weight', 'model.layers.24.feed_forward.w2.weight', 'model.layers.17.feed_forward.w2.weight', 'model.layers.9.feed_forward.w2.weight', 'model.layers.11.attention.wo.weight', 'model.layers.23.attention.wo.weight', 'model.layers.26.attention.wo.weight', 'model.layers.10.feed_forward.w2.weight', 'model.layers.0.feed_forward.w3.weight', 'model.layers.2.feed_forward.w2.weight', 'model.layers.21.feed_forward.w3.weight', 'model.layers.25.attention.wqkv.weight', 'model.layers.1.feed_forward.w1.weight', 'model.layers.19.feed_forward.w3.weight', 'model.layers.21.attention.wqkv.weight', 'model.layers.13.attention.wqkv.weight', 'model.layers.17.attention.wo.weight', 'model.layers.2.attention.wqkv.weight', 'model.layers.20.feed_forward.w1.weight', 'model.layers.11.feed_forward.w2.weight', 'model.layers.16.feed_forward.w2.weight', 'model.layers.25.feed_forward.w1.weight', 'model.layers.15.attention.wo.weight', 'model.layers.5.feed_forward.w3.weight', 'model.layers.22.feed_forward.w3.weight', 'model.layers.2.feed_forward.w3.weight', 'model.layers.22.attention.wqkv.weight', 'model.layers.0.attention.wqkv.weight', 'model.layers.26.attention.wqkv.weight', 'model.layers.2.feed_forward.w1.weight', 'model.layers.11.feed_forward.w3.weight', 'model.layers.16.attention.wo.weight', 'model.layers.7.attention.wqkv.weight', 'model.layers.13.feed_forward.w1.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. INFO - Start quantizing layer 1/32 2024-07-11 11:29:43 INFO [auto_gptq.modeling._base] Start quantizing layer 1/32 INFO - Start quantizing layer 2/32 2024-07-11 11:29:44 INFO [auto_gptq.modeling._base] Start quantizing layer 2/32 INFO - Start quantizing layer 3/32 2024-07-11 11:29:44 INFO [auto_gptq.modeling._base] Start quantizing layer 3/32 INFO - Start quantizing layer 4/32 2024-07-11 11:29:45 INFO [auto_gptq.modeling._base] Start quantizing layer 4/32 INFO - Start quantizing layer 5/32 2024-07-11 11:29:45 INFO [auto_gptq.modeling._base] Start quantizing layer 5/32 INFO - Start quantizing layer 6/32 2024-07-11 11:29:46 INFO [auto_gptq.modeling._base] Start quantizing layer 6/32 INFO - Start quantizing layer 7/32 2024-07-11 11:29:46 INFO [auto_gptq.modeling._base] Start quantizing layer 7/32 INFO - Start quantizing layer 8/32 2024-07-11 11:29:47 INFO [auto_gptq.modeling._base] Start quantizing layer 8/32 INFO - Start quantizing layer 9/32 2024-07-11 11:29:48 INFO [auto_gptq.modeling._base] Start quantizing layer 9/32 INFO - Start quantizing layer 10/32 2024-07-11 11:29:48 INFO [auto_gptq.modeling._base] Start quantizing layer 10/32 INFO - Start quantizing layer 11/32 2024-07-11 11:29:49 INFO [auto_gptq.modeling._base] Start quantizing layer 11/32 INFO - Start quantizing layer 12/32 2024-07-11 11:29:49 INFO [auto_gptq.modeling._base] Start quantizing layer 12/32 INFO - Start quantizing layer 13/32 2024-07-11 11:29:50 INFO [auto_gptq.modeling._base] Start quantizing layer 13/32 INFO - Start quantizing layer 14/32 2024-07-11 11:29:51 INFO [auto_gptq.modeling._base] Start quantizing layer 14/32 INFO - Start quantizing layer 15/32 2024-07-11 11:29:51 INFO [auto_gptq.modeling._base] Start quantizing layer 15/32 INFO - Start quantizing layer 16/32 2024-07-11 11:29:52 INFO [auto_gptq.modeling._base] Start quantizing layer 16/32 INFO - Start quantizing layer 17/32 2024-07-11 11:29:53 INFO [auto_gptq.modeling._base] Start quantizing layer 17/32 INFO - Start quantizing layer 18/32 2024-07-11 11:29:53 INFO [auto_gptq.modeling._base] Start quantizing layer 18/32 INFO - Start quantizing layer 19/32 2024-07-11 11:29:54 INFO [auto_gptq.modeling._base] Start quantizing layer 19/32 INFO - Start quantizing layer 20/32 2024-07-11 11:29:55 INFO [auto_gptq.modeling._base] Start quantizing layer 20/32 INFO - Start quantizing layer 21/32 2024-07-11 11:29:55 INFO [auto_gptq.modeling._base] Start quantizing layer 21/32 INFO - Start quantizing layer 22/32 2024-07-11 11:29:56 INFO [auto_gptq.modeling._base] Start quantizing layer 22/32 INFO - Start quantizing layer 23/32 2024-07-11 11:29:57 INFO [auto_gptq.modeling._base] Start quantizing layer 23/32 INFO - Start quantizing layer 24/32 2024-07-11 11:29:57 INFO [auto_gptq.modeling._base] Start quantizing layer 24/32 INFO - Start quantizing layer 25/32 2024-07-11 11:29:58 INFO [auto_gptq.modeling._base] Start quantizing layer 25/32 INFO - Start quantizing layer 26/32 2024-07-11 11:29:59 INFO [auto_gptq.modeling._base] Start quantizing layer 26/32 INFO - Start quantizing layer 27/32 2024-07-11 11:29:59 INFO [auto_gptq.modeling._base] Start quantizing layer 27/32 INFO - Start quantizing layer 28/32 2024-07-11 11:30:00 INFO [auto_gptq.modeling._base] Start quantizing layer 28/32 INFO - Start quantizing layer 29/32 2024-07-11 11:30:00 INFO [auto_gptq.modeling._base] Start quantizing layer 29/32 INFO - Start quantizing layer 30/32 2024-07-11 11:30:01 INFO [auto_gptq.modeling._base] Start quantizing layer 30/32 INFO - Start quantizing layer 31/32 2024-07-11 11:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 31/32 INFO - Start quantizing layer 32/32 2024-07-11 11:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 32/32 2024-07-11 11:30:03 INFO [auto_gptq.modeling._utils] Packing model... 2024-07-11 11:30:03 INFO [auto_gptq.modeling._utils] Model packed.


6. Was create the folder with quantized model. Also I have updated ```config.json``` there: ```"model_type": "internlm"```
<img width="568" alt="Screenshot 2024-07-11 at 13 57 43" src="https://github.com/InternLM/InternLM-XComposer/assets/78348856/e8310f67-874c-480d-804b-ff45e18c175e">

7. Load quantized model for inference

quantized_model_dir = "4bit-internlm-4khd-7b"

Load quantized model for inference

model = InternLMXComposer2QForCausalLM.from_quantized( quantized_model_dir, device="cuda", local_files_only=True, trust_remote_code=True ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM

Inference with model.generate

print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

Or you can also use pipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) print(pipeline("auto-gptq is")[0]["generated_text"])


Error:

WARNING - Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source. 2024-07-11 11:41:30 WARNING [auto_gptq.modeling._base] Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source. WARNING - CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:

You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
You are using pytorch without CUDA support.
CUDA and nvcc are not installed in your device. 2024-07-11 11:41:30 WARNING [auto_gptq.modeling._base] CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
You are using pytorch without CUDA support.
CUDA and nvcc are not installed in your device. You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors. WARNING - ignoring unknown parameter in quantize_config.json: quant_method. 2024-07-11 11:41:30 WARNING [auto_gptq.modeling._base] ignoring unknown parameter in quantize_config.json: quant_method. Could not locate the modeling_internlm_xcomposer2.py inside 4bit-internlm-4khd-7b.

OSError Traceback (most recent call last) Cell In[16], line 4 1 quantized_model_dir = "4bit-internlm-4khd-7b" 3 # Load quantized model for inference ----> 4 model = InternLMXComposer2QForCausalLM.from_quantized( 5 quantized_model_dir, 6 device="cuda", 7 local_files_only=True, 8 trust_remote_code=True 9 ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM 11 # Inference with model.generate 12 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

File ~/.local/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:999, in BaseGPTQForCausalLM.from_quantized(cls, model_name_or_path, device_map, max_memory, device, low_cpu_mem_usage, use_triton, use_qigen, use_marlin, torch_dtype, inject_fused_attention, inject_fused_mlp, use_cuda_fp16, quantize_config, model_basename, use_safetensors, trust_remote_code, warmup_triton, trainable, disable_exllama, disable_exllamav2, **kwargs) 996 init_contexts.append(accelerate.init_empty_weights(include_buffers=False)) 998 with ContextManagers(init_contexts): --> 999 model = AutoModelForCausalLM.from_config( 1000 config, trust_remote_code=trust_remote_code, torch_dtype=torch_dtype 1001 ) 1003 layers = find_layers(model) 1004 ignore_layers = [cls.lm_head_name] + cls.outside_layer_modules

File ~/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:437, in _BaseAutoModelClass.from_config(cls, config, kwargs) 435 else: 436 repo_id = config.name_or_path --> 437 model_class = get_class_from_dynamic_module(class_ref, repo_id, kwargs) 438 if os.path.isdir(config._name_or_path): 439 model_class.register_for_auto_class(cls.name)

File ~/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:485, in get_class_from_dynamic_module(class_reference, pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, code_revision, **kwargs) 483 code_revision = revision 484 # And lastly we get the class inside our newly created module --> 485 final_module = get_cached_module_file( 486 repo_id, 487 module_file + ".py", 488 cache_dir=cache_dir, 489 force_download=force_download, 490 resume_download=resume_download, 491 proxies=proxies, 492 token=token, 493 revision=code_revision, 494 local_files_only=local_files_only, 495 repo_type=repo_type, 496 ) 497 return get_class_in_module(class_name, final_module.replace(".py", ""))

File ~/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:292, in get_cached_module_file(pretrained_model_name_or_path, module_file, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, _commit_hash, **deprecated_kwargs) 289 new_files = [] 290 try: 291 # Load from URL or cache if already cached --> 292 resolved_module_file = cached_file( 293 pretrained_model_name_or_path, 294 module_file, 295 cache_dir=cache_dir, 296 force_download=force_download, 297 proxies=proxies, 298 resume_download=resume_download, 299 local_files_only=local_files_only, 300 token=token, 301 revision=revision, 302 repo_type=repo_type, 303 _commit_hash=_commit_hash, 304 ) 305 if not is_local and cached_module != resolved_module_file: 306 new_files.append(module_file)

File ~/.local/lib/python3.10/site-packages/transformers/utils/hub.py:400, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs) 398 if not os.path.isfile(resolved_file): 399 if _raise_exceptions_for_missing_entries: --> 400 raise EnvironmentError( 401 f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout " 402 f"'https://huggingface.co/{path_or_repo_id}/{revision}' for available files." 403 ) 404 else: 405 return None

OSError: 4bit-internlm-4khd-7b does not appear to have a file named modeling_internlm_xcomposer2.py. Checkout 'https://huggingface.co/4bit-internlm-4khd-7b/None' for available files.



Do you have any suggestion how to fix this issue? And also I am not sure, that the process of quantization was successfully finished, as in the log I see layers as ```model.layers.22.attention.wo.weight```, which should be replaced with ```model.layers.22.attention.wo.linear.weight```

nzomi commented 4 months ago

@zhuraromdev Did you put other files liketokenization_internlm2.py into this quantize dir? The whole dir should contain these files as follows. And I think maybe you just forget to change the layer.22

zhuraromdev commented 4 months ago

@nzomi Nope, I didn't And should I replace files, which were created during quantization? Screenshot 2024-07-11 at 14 39 04

zhuraromdev commented 4 months ago

@nzomi I have run the code:

quantized_model_dir = "OLD_4bit-internlm-xcomposer2-4khd-7b"

# Load quantized model for inference
model = InternLMXComposer2QForCausalLM.from_quantized(
    quantized_model_dir,
    device="cuda",
    local_files_only=True,
    trust_remote_code=True
) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM

# Inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# Or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

Now getting a lot of log regarding not quantized layers:

...
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc1 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc1 is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc2 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc2 is not quantized.
INFO - The layer vision_proj.0 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vision_proj.0 is not quantized.
INFO - The layer vision_proj.2 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vision_proj.2 is not quantized.

And error about NoneType:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 4
      1 quantized_model_dir = "OLD_4bit-internlm-xcomposer2-4khd-7b"
      3 # Load quantized model for inference
----> 4 model = InternLMXComposer2QForCausalLM.from_quantized(
      5     quantized_model_dir,
      6     device="cuda",
      7     local_files_only=True,
      8     trust_remote_code=True
      9 ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM
     11 # Inference with model.generate
     12 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

File ~/.local/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:1246, in BaseGPTQForCausalLM.from_quantized(cls, model_name_or_path, device_map, max_memory, device, low_cpu_mem_usage, use_triton, use_qigen, use_marlin, torch_dtype, inject_fused_attention, inject_fused_mlp, use_cuda_fp16, quantize_config, model_basename, use_safetensors, trust_remote_code, warmup_triton, trainable, disable_exllama, disable_exllamav2, **kwargs)
   1243         inject_fused_attention = False
   1244         inject_fused_mlp = False
-> 1246 accelerate.utils.modeling.load_checkpoint_in_model(
   1247     model,
   1248     dtype=torch_dtype,  # This is very hacky but works due to https://github.com/huggingface/accelerate/blob/bd72a5f1a80d5146554458823f8aeda0a9db5297/src/accelerate/utils/modeling.py#L292
   1249     checkpoint=model_save_name,
   1250     device_map=device_map,
   1251     offload_state_dict=True,
   1252     offload_buffers=True,
   1253 )
   1255 # TODO: Why are we using this custom function and not dispatch_model?
   1256 model = simple_dispatch_model(model, device_map)

File ~/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py:1797, in load_checkpoint_in_model(model, checkpoint, device_map, offload_folder, dtype, offload_state_dict, offload_buffers, keep_in_fp32_modules, offload_8bit_bnb, strict)
   1795                 offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
   1796         else:
-> 1797             set_module_tensor_to_device(
   1798                 model,
   1799                 param_name,
   1800                 param_device,
   1801                 value=param,
   1802                 dtype=new_dtype,
   1803                 fp16_statistics=fp16_statistics,
   1804             )
   1806 # Force Python to clean up.
   1807 del loaded_checkpoint

File ~/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py:382, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    375 device_quantization = None
    376 with torch.no_grad():
    377     # leave it on cpu first before moving them to cuda
    378     # # fix the case where the device is meta, we don't want to put it on cpu because there is no data =0
    379     if (
    380         param is not None
    381         and param.device.type != "cuda"
--> 382         and torch.device(device).type == "cuda"
    383         and param_cls.__name__ in ["Int8Params", "FP4Params", "Params4bit"]
    384     ):
    385         device_quantization = device
    386         device = "cpu"

TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
 * (torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!)
 * (str type, int index)

Can I fix issue with NoneType with this? Screenshot 2024-07-11 at 14 47 55 And run pip install -vvv --no-build-isolation -e . after update?

nzomi commented 4 months ago

@zhuraromdev yes only the model.layers.attention.xxx and model.layers.feed_forward.xxx will be quantizeds if you followed this custom class. And it's not neccessary to make build again if you already change the code in _utils, I also recommend you adding some breakpoints to check where the NoneType comes from, for me it is the plora_glb_GN andplora_sub_GN so I add those two lines in that file.

zhuraromdev commented 4 months ago

@nzomi Thank you a lot! So according to this log, something went wrong during quantization, isn't it?

INFO - The layer model.layers.0.attention.wqkv is not quantized.
INFO - The layer model.layers.0.attention.wo is not quantized.
INFO - The layer model.layers.0.feed_forward.w1 is not quantized.
INFO - The layer model.layers.0.feed_forward.w3 is not quantized.
INFO - The layer model.layers.0.feed_forward.w2 is not quantized.
INFO - The layer model.layers.1.attention.wqkv is not quantized.
INFO - The layer model.layers.1.attention.wo is not quantized.

nzomi commented 4 months ago

@zhuraromdev I guess you didn't replace the key correctly from this log, maybe you can replace them once more btw, the awq quantization provided by LMdeploy make it easier to quantize the 4khd model, and faster inference! But you can only infer the quantized model with lmdeploy pipeline then.

zhuraromdev commented 4 months ago

@nzomi it seems, that pytorch_model.bin.index.json was replaced correct. Also will check LMdeploy, thank you! Also providing the code for replacing:

Code for replacement of keys in json:

import os
import json

def replace_keys_in_json(file_path):
    # Verify if the file exists
    if not os.path.isfile(file_path):
        print(f"File not found: {file_path}")
        print("Current working directory:", os.getcwd())
        print("Directory contents:", os.listdir(os.path.dirname(file_path) or '.'))
        return

    # Load the .json file
    with open(file_path, 'r') as f:
        data = json.load(f)

    # Define the replacement mapping
    replacements = {
        'wo.weight': 'wo.linear.weight',
        'wqkv.weight': 'wqkv.linear.weight',
        'w1.weight': 'w1.linear.weight',
        'w2.weight': 'w2.linear.weight',
        'w3.weight': 'w3.linear.weight'
    }

    # Replace keys within the weight_map dictionary
    weight_map = data.get('weight_map', {})
    new_weight_map = {}
    for key, value in weight_map.items():
        new_key = key
        for old, new in replacements.items():
            if old in key:
                new_key = key.replace(old, new)
                break
        new_weight_map[new_key] = value

    # Update the data dictionary with the new weight_map
    data['weight_map'] = new_weight_map

    # Save the modified dictionary back to a .json file
    new_file_path = file_path.replace('.json', '_modified.json')
    with open(new_file_path, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Modified .json file saved as {new_file_path}")

Code for replacement of keys in bin:

import torch
import os

def replace_keys_in_bin(file_path):
    # Ensure the path is absolute
    absolute_file_path = os.path.abspath(file_path)
    print(f"Using absolute file path: {absolute_file_path}")

    # Verify if the file exists
    if not os.path.isfile(absolute_file_path):
        print(f"File not found: {absolute_file_path}")
        print("Current working directory:", os.getcwd())
        print("Directory contents:", os.listdir(os.path.dirname(absolute_file_path) or '.'))
        return

    try:
        # Load the .bin file
        model_dict = torch.load(absolute_file_path)
    except Exception as e:
        print(f"Error loading file: {e}")
        return

    # Define the replacement mapping
    replacements = {
        'wo.weight': 'wo.linear.weight',
        'wqkv.weight': 'wqkv.linear.weight',
        'w1.weight': 'w1.linear.weight',
        'w2.weight': 'w2.linear.weight',
        'w3.weight': 'w3.linear.weight'
    }

    # Create a new dictionary with the replaced keys
    new_model_dict = {}
    for key, value in model_dict.items():
        new_key = key
        for old, new in replacements.items():
            if old in key:
                new_key = key.replace(old, new)
                break
        new_model_dict[new_key] = value

    # Save the modified dictionary back to a .bin file
    new_file_path = absolute_file_path.replace('.bin', '_modified.bin')
    torch.save(new_model_dict, new_file_path)
    print(f"Modified .bin file saved as {new_file_path}")

zhuraromdev commented 3 months ago

@nzomi Hey, thank you for advising me lmdeploy for quantization, it was working fine for me :) However, when I was trying to use there quanitized model for fine tuning by InternLM-XComposer it was not working, as the only way, how I can access the model is with lmdeploy inference.

So I have decided to come back to quantization with AutoGPTQ. I have changed the source code, as you have described above and the quantization process was finish successfully. However during loading of quant model, I have some issues:

All layers, which should be quantized are not quantized at all (for example, INFO - The layer model.layers.0.attention.wo is not quantized.).
Also I have tried to save model as bin, not safetensor format to explore the layer. I have find out, that it has replaced and orig layers: model.layers.3.attention.wqkv.weight | model.layers.3.attention.wqkv.linear.weight. Did it happened because of this? Some weights of InternLMXComposer2ForCausalLM were not initialized from the model checkpoint at internlm_xcomposer2_4khd_7b_repo and are newly initialized: ['model.layers.0.attention.wo.weight', 'model.layers.0.attention.wqkv.weight', 'model.layers.0.feed_forward.w1.weight', 'model.layers.0.feed_forward.w2.weight', 'model.layers.0.feed_forward.w3.weight', 'model.layers.1.attention.wo.weight', 'model.layers.1.attention.wqkv.weight', 'model.layers.1.feed_forward.w1.weight', 'model.layers.1.feed_forward.w2.weight', 'model.layers.1.feed_forward.w3.weight', 'model.layers.10.attention.wo.weight', 'model.layers.10.attention.wqkv.weight', 'model.layers.10.feed_forward.w1.weight', 'model.layers.10.feed_forward.w2.weight', 'model.layers.10.feed_forward.w3.weight', 'model.layers.11.attention.wo.weight', 'model.layers.11.attention.wqkv.weight', 'model.layers.11.feed_forward.w1.weight', 'model.layers.11.feed_forward.w2.weight', 'model.layers.11.feed_forward.w3.weight', 'model.layers.12.attention.wo.weight', 'model.layers.12.attention.wqkv.weight', 'model.layers.12.feed_forward.w1.weight', 'model.layers.12.feed_forward.w2.weight', 'model.layers.12.feed_forward.w3.weight', 'model.layers.13.attention.wo.weight', 'model.layers.13.attention.wqkv.weight', 'model.layers.13.feed_forward.w1.weight', 'model.layers.13.feed_forward.w2.weight', 'model.layers.13.feed_forward.w3.weight', 'model.layers.14.attention.wo.weight', 'model.layers.14.attention.wqkv.weight', 'model.layers.14.feed_forward.w1.weight', 'model.layers.14.feed_forward.w2.weight', 'model.layers.14.feed_forward.w3.weight', 'model.layers.15.attention.wo.weight', 'model.layers.15.attention.wqkv.weight', 'model.layers.15.feed_forward.w1.weight', 'model.layers.15.feed_forward.w2.weight', 'model.layers.15.feed_forward.w3.weight', 'model.layers.16.attention.wo.weight', 'model.layers.16.attention.wqkv.weight', 'model.layers.16.feed_forward.w1.weight', 'model.layers.16.feed_forward.w2.weight', 'model.layers.16.feed_forward.w3.weight', 'model.layers.17.attention.wo.weight', 'model.layers.17.attention.wqkv.weight', 'model.layers.17.feed_forward.w1.weight', 'model.layers.17.feed_forward.w2.weight', 'model.layers.17.feed_forward.w3.weight', 'model.layers.18.attention.wo.weight', 'model.layers.18.attention.wqkv.weight', 'model.layers.18.feed_forward.w1.weight', 'model.layers.18.feed_forward.w2.weight', 'model.layers.18.feed_forward.w3.weight', 'model.layers.19.attention.wo.weight', 'model.layers.19.attention.wqkv.weight', 'model.layers.19.feed_forward.w1.weight', 'model.layers.19.feed_forward.w2.weight', 'model.layers.19.feed_forward.w3.weight', 'model.layers.2.attention.wo.weight', 'model.layers.2.attention.wqkv.weight', 'model.layers.2.feed_forward.w1.weight', 'model.layers.2.feed_forward.w2.weight', 'model.layers.2.feed_forward.w3.weight', 'model.layers.20.attention.wo.weight', 'model.layers.20.attention.wqkv.weight', 'model.layers.20.feed_forward.w1.weight', 'model.layers.20.feed_forward.w2.weight', 'model.layers.20.feed_forward.w3.weight', 'model.layers.21.attention.wo.weight', 'model.layers.21.attention.wqkv.weight', 'model.layers.21.feed_forward.w1.weight', 'model.layers.21.feed_forward.w2.weight', 'model.layers.21.feed_forward.w3.weight', 'model.layers.22.attention.wo.weight', 'model.layers.22.attention.wqkv.weight', 'model.layers.22.feed_forward.w1.weight', 'model.layers.22.feed_forward.w2.weight', 'model.layers.22.feed_forward.w3.weight', 'model.layers.23.attention.wo.weight', 'model.layers.23.attention.wqkv.weight', 'model.layers.23.feed_forward.w1.weight', 'model.layers.23.feed_forward.w2.weight', 'model.layers.23.feed_forward.w3.weight', 'model.layers.24.attention.wo.weight', 'model.layers.24.attention.wqkv.weight', 'model.layers.24.feed_forward.w1.weight', 'model.layers.24.feed_forward.w2.weight', 'model.layers.24.feed_forward.w3.weight', 'model.layers.25.attention.wo.weight', 'model.layers.25.attention.wqkv.weight', 'model.layers.25.feed_forward.w1.weight', 'model.layers.25.feed_forward.w2.weight', 'model.layers.25.feed_forward.w3.weight', 'model.layers.26.attention.wo.weight', 'model.layers.26.attention.wqkv.weight', 'model.layers.26.feed_forward.w1.weight', 'model.layers.26.feed_forward.w2.weight', 'model.layers.26.feed_forward.w3.weight', 'model.layers.27.attention.wo.weight', 'model.layers.27.attention.wqkv.weight', 'model.layers.27.feed_forward.w1.weight', 'model.layers.27.feed_forward.w2.weight', 'model.layers.27.feed_forward.w3.weight', 'model.layers.28.attention.wo.weight', 'model.layers.28.attention.wqkv.weight', 'model.layers.28.feed_forward.w1.weight', 'model.layers.28.feed_forward.w2.weight', 'model.layers.28.feed_forward.w3.weight', 'model.layers.29.attention.wo.weight', 'model.layers.29.attention.wqkv.weight', 'model.layers.29.feed_forward.w1.weight', 'model.layers.29.feed_forward.w2.weight', 'model.layers.29.feed_forward.w3.weight', 'model.layers.3.attention.wo.weight', 'model.layers.3.attention.wqkv.weight', 'model.layers.3.feed_forward.w1.weight', 'model.layers.3.feed_forward.w2.weight', 'model.layers.3.feed_forward.w3.weight', 'model.layers.30.attention.wo.weight', 'model.layers.30.attention.wqkv.weight', 'model.layers.30.feed_forward.w1.weight', 'model.layers.30.feed_forward.w2.weight', 'model.layers.30.feed_forward.w3.weight', 'model.layers.31.attention.wo.weight', 'model.layers.31.attention.wqkv.weight', 'model.layers.31.feed_forward.w1.weight', 'model.layers.31.feed_forward.w2.weight', 'model.layers.31.feed_forward.w3.weight', 'model.layers.4.attention.wo.weight', 'model.layers.4.attention.wqkv.weight', 'model.layers.4.feed_forward.w1.weight', 'model.layers.4.feed_forward.w2.weight', 'model.layers.4.feed_forward.w3.weight', 'model.layers.5.attention.wo.weight', 'model.layers.5.attention.wqkv.weight', 'model.layers.5.feed_forward.w1.weight', 'model.layers.5.feed_forward.w2.weight', 'model.layers.5.feed_forward.w3.weight', 'model.layers.6.attention.wo.weight', 'model.layers.6.attention.wqkv.weight', 'model.layers.6.feed_forward.w1.weight', 'model.layers.6.feed_forward.w2.weight', 'model.layers.6.feed_forward.w3.weight', 'model.layers.7.attention.wo.weight', 'model.layers.7.attention.wqkv.weight', 'model.layers.7.feed_forward.w1.weight', 'model.layers.7.feed_forward.w2.weight', 'model.layers.7.feed_forward.w3.weight', 'model.layers.8.attention.wo.weight', 'model.layers.8.attention.wqkv.weight', 'model.layers.8.feed_forward.w1.weight', 'model.layers.8.feed_forward.w2.weight', 'model.layers.8.feed_forward.w3.weight', 'model.layers.9.attention.wo.weight', 'model.layers.9.attention.wqkv.weight', 'model.layers.9.feed_forward.w1.weight', 'model.layers.9.feed_forward.w2.weight', 'model.layers.9.feed_forward.w3.weight', 'vit.vision_tower.vision_model.post_layernorm.bias', 'vit.vision_tower.vision_model.post_layernorm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Do you have any suggestion, how can I resolve it?

nzomi commented 3 months ago

@zhuraromdev it is a bit wierd, did you check your .bin weight keys after replacement?

InternLM / InternLM-XComposer

QLora fine tuning? #337

Set up logging

Define the custom class for OPT model

Define model directories

Load the tokenizer

Configure quantization

Load and quantize the model using the custom class

Save the quantized model

Load quantized model for inference

Inference with model.generate

Or you can also use pipeline