Open pbarker opened 4 months ago
Hello, I tried using BitsAndBytesConfig
to obtain and save a 4-bit model. However, I encountered an issue where it is crucial to generate a chat using the 4-bit model. Have you experienced a similar issue?
By the way, I followed the instructions with AutoGPTQ
to obtain another 4-bit model, but I received a message stating that
'internlmxcomposer2 isn't supported yet.'
Has anyone else encountered this issue? How can I resolve it?
@nzomi AutoGPTQ/AutoGPTQ#619 and AutoGPTQ/AutoGPTQ#189
@pbarker Thank you for mentioning that. Indeed, I also created an issue in their repository and the problem was fixed. However, I tried to quantize the 4KHD model, but its structure is a bit different from the 7B version, which has become another challenge...
Hey @nzomi we are going to try and quant the 4khd model next week if you want to share notes, also if there is a maintainer that can give any tips we would appreciate it!
Hey @nzomi we are going to try and quant the 4khd model next week if you want to share notes, also if there is a maintainer that can give any tips we would appreciate it!
@pbarker Sure thing! I used the same method mentioned below to get the 4-bit model.
hi,我们使用的auto-gptq默认的量化方法,没有引入量化训练,https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#quick-tour
Originally posted by @LightDXY in https://github.com/InternLM/InternLM-XComposer/issues/208#issuecomment-1985286884
However, I found some differences from the quick start quantization demo. First of all, if the layers_block_name
is model.layers
, the model will not be quantized since there are no linear layers in the inside_layer_modules
, if you delete the suffix .linear
in the inside_layer_modules
, then all linear layer will be quantized, specifically Plora_A
and Plora_B
, but they cannot be quantized simply using the AutoGPTQ sice they accept the 'x' and 'im_mask' as inputs. You can quickly check the model structure; I've provided it below as well:
inside_layer_modules = [
["attention.wqkv.linear"],
["attention.wo.linear"],
["feed_forward.w1.linear", "feed_forward.w3.linear"],
["feed_forward.w2.linear"],
]
I think it is also impossible to quantize the vit
module, so what we can do is just quantize the vision_proj
and output
module, which just contains a simple Linear layer; but that is not our goal, as we aim to quantize the model
, not other modules (vit
, vision_proj
, ect).
I also checked the 4bit model provided by maintainers, and there is a extra linear layer in the models.layer
, so they can simply quantize this linear layer, but how can they achieve that is a mystery.
@pbarker I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.
@myownskyW7 could you give us a bit of direction on these pieces?
I also checked the 4bit model provided by maintainers, and there is a extra linear layer in the models.layer , so they can simply quantize this linear layer, but how can they achieve that is a mystery.
and
I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.
Do you have any recommendations for producing a 4k quantized model?
I noticed that in the VL-7B 4-bit model, the PLoRA class in build-mlp.py differs from the one used for the VL-7B-4KHD model. Specifically, the latter utilizes super().forward(x) in place of nn.Linear(). I believe that modifying it to use nn.Linear() and fine-tuning it from scratch might resolve the issue.
@pbarker Actually, this method failed, and I'm trying to locate the bug. To do this, you can check the model structure by printing the model (e.g., print(model)
) and focus on the InternLM2Attention
and InternLM2MLP
classes. You might find the differences there. Additionally, check the build_mlp.py
file to see the differences in the PLoRA
module between the 4KHD model and the 4-bit model provided by the developers. They use different linear layers in this module.
@pbarker I succesfully get the 4bit model.
xx.weight
to xx.linear.weight
, where xx can be wo, wqkv, w1, w2, or w3
. This was easily accomplished by loading the .bin file using torch.load
and replacing all keys with their corresponding new keys.pytorch_model.bin.index.json
file.build_mlp.py
by replacing super().forward()
with self.linear()
.class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
layers_block_name = "model.layers"
outside_layer_modules = [
'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output',
]
inside_layer_modules = [
["attention.wqkv.linear"],
["attention.wo.linear"],
["feed_forward.w1.linear", "feed_forward.w3.linear"],
["feed_forward.w2.linear"],
]
Thanks @nzomi we are working to recreate, I guess we also have 2.5 to figure out 🙂
@pbarker I hope this information helps you. Additionally, I found that the inference speed of the 4-bit model is not faster than the base model. If you encounter the same issue, please feel free to contact me.
Hello @nzomi , I hope, that you are doing well. I have a question regarding quantization of the model. I have followed all instruction, which you have described above, however I still have an issue with the last step.
Code:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.modeling import BaseGPTQForCausalLM
import logging
# Set up logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S"
)
# Define the custom class for OPT model
class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
layers_block_name = "model.layers"
outside_layer_modules = [
'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output',
]
inside_layer_modules = [
["attention.wqkv.linear"],
["attention.wo.linear"],
["feed_forward.w1.linear", "feed_forward.w3.linear"],
["feed_forward.w2.linear"],
]
# Define model directories
local_model_dir = "internlm-xcomposer2-4khd-7b"#"internlm-xcomposer2-4khd-7b"
quantized_model_dir = "4bit-internlm-xcomposer2-4khd-7b"
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) # here
examples = [
tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.")
]
# Configure quantization
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
# Load and quantize the model using the custom class
model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
model.quantize(examples)
# Save the quantized model
model.save_quantized(quantized_model_dir)
model.save_quantized(quantized_model_dir, use_safetensors=True)
# Load quantized model for inference
model = InternLMXComposer2QForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True) # here
# Inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
# Or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[81], line 44
37 quantize_config = BaseQuantizeConfig(
38 bits=4, # quantize model to 4-bit
39 group_size=128, # it is recommended to set the value to 128
40 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
41 )
43 # Load and quantize the model using the custom class
---> 44 model = InternLMXComposer2QForCausalLMM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here
45 model.quantize(examples)
47 # Save the quantized model
File ~/miniconda3/envs/intern/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:752, in from_pretrained(cls, pretrained_model_name_or_path, quantize_config, max_memory, trust_remote_code, torch_dtype, **model_init_kwargs)
TypeError: internlmxcomposer2 isn't supported yet.
Do you have any suggestions how to fix it? Thank you in advance!
Hello @nzomi , I hope, that you are doing well. I have a question regarding quantization of the model. I have followed all instruction, which you have described above, however I still have an issue with the last step.
Code:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from auto_gptq.modeling import BaseGPTQForCausalLM import logging # Set up logging logging.basicConfig( format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S" ) # Define the custom class for OPT model class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM): layers_block_name = "model.layers" outside_layer_modules = [ 'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output', ] inside_layer_modules = [ ["attention.wqkv.linear"], ["attention.wo.linear"], ["feed_forward.w1.linear", "feed_forward.w3.linear"], ["feed_forward.w2.linear"], ] # Define model directories local_model_dir = "internlm-xcomposer2-4khd-7b"#"internlm-xcomposer2-4khd-7b" quantized_model_dir = "4bit-internlm-xcomposer2-4khd-7b" # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) # here examples = [ tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.") ] # Configure quantization quantize_config = BaseQuantizeConfig( bits=4, # quantize model to 4-bit group_size=128, # it is recommended to set the value to 128 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad ) # Load and quantize the model using the custom class model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here model.quantize(examples) # Save the quantized model model.save_quantized(quantized_model_dir) model.save_quantized(quantized_model_dir, use_safetensors=True) # Load quantized model for inference model = InternLMXComposer2QForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True) # here # Inference with model.generate print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0])) # Or you can also use pipeline pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) print(pipeline("auto-gptq is")[0]["generated_text"])
Error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[81], line 44 37 quantize_config = BaseQuantizeConfig( 38 bits=4, # quantize model to 4-bit 39 group_size=128, # it is recommended to set the value to 128 40 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad 41 ) 43 # Load and quantize the model using the custom class ---> 44 model = InternLMXComposer2QForCausalLMM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) # here 45 model.quantize(examples) 47 # Save the quantized model File ~/miniconda3/envs/intern/lib/python3.9/site-packages/auto_gptq/modeling/_base.py:752, in from_pretrained(cls, pretrained_model_name_or_path, quantize_config, max_memory, trust_remote_code, torch_dtype, **model_init_kwargs) TypeError: internlmxcomposer2 isn't supported yet.
Do you have any suggestions how to fix it? Thank you in advance!
@zhuraromdev AutoGPTQ doesn't support InternLM2 at the moment, The simplest workaround is to change the model_type
in config.json
from internlmxcomposer2
to internlm
. Alternatively, you can add a new class (the same custom class you defined) for InternLM2 in the source code at this path: AutoGPTQ/auto_gptq/modeling/internlmxcomposer2 (Don't forget to add the model_type
in _const.py
and import the model in __init__.py
if you choose this way!). Hopefully, they will add support for this model type in the future.
Another issue you might encounter is the NoneType
error. The 4k model contains the plora_glb_GN
and plora_sub_GN
layers, which don't have any name_prefix. In AutoGPTQ, they select modules to dispatch using the get_module_by_name_prefix
function, leading to a NoneType
error for these two modules. I added these two lines of code to avoid this problem, but I'm still looking for a more general solution.
@nzomi Thank you a lot for your help, however I still have an issues with quantization.
Step, which was made:
Version: 0.8.0.dev0+cu121
. Also I was not doing any changes inside AutoGPTQ repo.snapshot_download()
of internlm/internlm-xcomposer2-4khd-7b
repo. "model_type": "internlm"
. Also I have updated build_mlp.py
, was changed class PLoRA.Code:
class PLoRA(nn.Linear):
def __init__(self,
in_features: int,
out_features: int,
bias: bool = True,
device=None,
dtype=None,
lora_r=8,
lora_alpha=16,
lora_dropout=0.05,
lora_len=0,
**kwargs) -> None:
super().__init__(in_features, out_features, bias, device, dtype)
# Create a linear layer for self.linear
self.linear = nn.Linear(in_features, out_features, bias, device=device, dtype=dtype)
self.lora_r = lora_r
self.lora_alpha = lora_alpha
self.lora_len = lora_len
if lora_dropout > 0.:
self.lora_dropout = nn.Dropout(p=lora_dropout)
else:
self.lora_dropout = lambda x: x
self.lora_scaling = self.lora_alpha / self.lora_r
self.Plora_A = nn.Linear(in_features,
self.lora_r,
bias=False,
device=device,
dtype=dtype)
self.Plora_B = nn.Linear(self.lora_r,
out_features,
bias=False,
device=device,
dtype=dtype)
self.reset_parameters()
def reset_parameters(self):
if hasattr(self, 'Plora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.kaiming_uniform_(self.Plora_A.weight, a=math.sqrt(5))
nn.init.zeros_(self.Plora_B.weight)
def forward(self, x, im_mask=None):
B, N, C = x.shape
x = x.reshape(-1, C)
if im_mask is not None:
im_mask = im_mask.view(-1)
res = self.linear(x) # use the newly defined self.linear
if im_mask is not None:
if torch.sum(im_mask) > 0:
part_x = x[im_mask]
res[im_mask] += self.Plora_B(self.Plora_A(
self.lora_dropout(part_x))) * self.lora_scaling
else:
part_x = x[:1]
res[:1] += self.Plora_B(self.Plora_A(
self.lora_dropout(part_x))) * 0
return res.reshape(B, N, -1)
Structure of the folders with model info, looks like this:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.modeling import BaseGPTQForCausalLM
logging.basicConfig( format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S" )
class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM): layers_block_name = "model.layers" outside_layer_modules = [ 'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output', ] inside_layer_modules = [ ["attention.wqkv.linear"], ["attention.wo.linear"], ["feed_forward.w1.linear", "feed_forward.w3.linear"], ["feed_forward.w2.linear"], ]
local_model_dir = "internlm-4khd-7b" quantized_model_dir = "4bit-internlm-4khd-7b"
tokenizer = AutoTokenizer.from_pretrained(local_model_dir, use_fast=True, trust_remote_code=True) examples = [ tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.") ]
quantize_config = BaseQuantizeConfig( bits=4, # quantize model to 4-bit group_size=128, # it is recommended to set the value to 128 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad )
model = InternLMXComposer2QForCausalLM.from_pretrained(local_model_dir, quantize_config, local_files_only=True, trust_remote_code=True) model.quantize(examples)
model.save_quantized(quantized_model_dir) model.save_quantized(quantized_model_dir, use_safetensors=True)
Log:
You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors. You are using a model of type internlm to instantiate a model of type internlm2. This is not supported for all configurations of models and can yield errors. Set max length to 16384 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.95s/it] Some weights of InternLMXComposer2ForCausalLM were not initialized from the model checkpoint at internlm-4khd-7b and are newly initialized: ['model.layers.4.feed_forward.w3.weight', 'vit.vision_tower.vision_model.post_layernorm.bias', 'model.layers.19.attention.wo.weight', 'model.layers.20.attention.wqkv.weight', 'model.layers.18.feed_forward.w3.weight', 'model.layers.28.attention.wqkv.weight', 'model.layers.27.feed_forward.w1.weight', 'model.layers.1.feed_forward.w2.weight', 'model.layers.29.attention.wo.weight', 'model.layers.9.attention.wqkv.weight', 'model.layers.18.feed_forward.w1.weight', 'model.layers.5.attention.wqkv.weight', 'model.layers.18.attention.wqkv.weight', 'model.layers.24.feed_forward.w3.weight', 'model.layers.11.feed_forward.w1.weight', 'model.layers.25.feed_forward.w2.weight', 'model.layers.27.attention.wqkv.weight', 'model.layers.4.feed_forward.w1.weight', 'model.layers.12.attention.wqkv.weight', 'model.layers.25.attention.wo.weight', 'model.layers.0.attention.wo.weight', 'model.layers.24.attention.wo.weight', 'model.layers.27.feed_forward.w2.weight', 'model.layers.21.attention.wo.weight', 'model.layers.15.feed_forward.w3.weight', 'model.layers.26.feed_forward.w1.weight', 'vit.vision_tower.vision_model.post_layernorm.weight', 'model.layers.29.feed_forward.w1.weight', 'model.layers.3.attention.wqkv.weight', 'model.layers.14.attention.wqkv.weight', 'model.layers.1.attention.wo.weight', 'model.layers.19.attention.wqkv.weight', 'model.layers.5.feed_forward.w2.weight', 'model.layers.5.attention.wo.weight', 'model.layers.15.feed_forward.w1.weight', 'model.layers.2.attention.wo.weight', 'model.layers.1.attention.wqkv.weight', 'model.layers.28.attention.wo.weight', 'model.layers.21.feed_forward.w1.weight', 'model.layers.27.feed_forward.w3.weight', 'model.layers.15.attention.wqkv.weight', 'model.layers.8.feed_forward.w1.weight', 'model.layers.27.attention.wo.weight', 'model.layers.23.attention.wqkv.weight', 'model.layers.14.feed_forward.w3.weight', 'model.layers.4.attention.wo.weight', 'model.layers.19.feed_forward.w1.weight', 'model.layers.12.attention.wo.weight', 'model.layers.9.attention.wo.weight', 'model.layers.21.feed_forward.w2.weight', 'model.layers.17.feed_forward.w3.weight', 'model.layers.17.feed_forward.w1.weight', 'model.layers.26.feed_forward.w3.weight', 'model.layers.31.feed_forward.w3.weight', 'model.layers.24.attention.wqkv.weight', 'model.layers.30.feed_forward.w2.weight', 'model.layers.18.feed_forward.w2.weight', 'model.layers.23.feed_forward.w3.weight', 'model.layers.6.feed_forward.w1.weight', 'model.layers.23.feed_forward.w2.weight', 'model.layers.16.feed_forward.w3.weight', 'model.layers.16.feed_forward.w1.weight', 'model.layers.6.attention.wqkv.weight', 'model.layers.16.attention.wqkv.weight', 'model.layers.12.feed_forward.w1.weight', 'model.layers.13.attention.wo.weight', 'model.layers.6.feed_forward.w3.weight', 'model.layers.13.feed_forward.w3.weight', 'model.layers.8.feed_forward.w2.weight', 'model.layers.29.feed_forward.w2.weight', 'model.layers.7.feed_forward.w3.weight', 'model.layers.14.attention.wo.weight', 'model.layers.6.attention.wo.weight', 'model.layers.30.feed_forward.w3.weight', 'model.layers.28.feed_forward.w3.weight', 'model.layers.22.feed_forward.w2.weight', 'model.layers.5.feed_forward.w1.weight', 'model.layers.15.feed_forward.w2.weight', 'model.layers.31.attention.wo.weight', 'model.layers.22.feed_forward.w1.weight', 'model.layers.0.feed_forward.w2.weight', 'model.layers.3.feed_forward.w1.weight', 'model.layers.1.feed_forward.w3.weight', 'model.layers.10.attention.wo.weight', 'model.layers.3.feed_forward.w2.weight', 'model.layers.8.attention.wo.weight', 'model.layers.18.attention.wo.weight', 'model.layers.6.feed_forward.w2.weight', 'model.layers.7.feed_forward.w2.weight', 'model.layers.25.feed_forward.w3.weight', 'model.layers.4.attention.wqkv.weight', 'model.layers.10.attention.wqkv.weight', 'model.layers.20.feed_forward.w3.weight', 'model.layers.4.feed_forward.w2.weight', 'model.layers.14.feed_forward.w1.weight', 'model.layers.8.attention.wqkv.weight', 'model.layers.7.feed_forward.w1.weight', 'model.layers.9.feed_forward.w3.weight', 'model.layers.8.feed_forward.w3.weight', 'model.layers.31.feed_forward.w1.weight', 'model.layers.30.attention.wqkv.weight', 'model.layers.24.feed_forward.w1.weight', 'model.layers.30.feed_forward.w1.weight', 'model.layers.31.attention.wqkv.weight', 'model.layers.7.attention.wo.weight', 'model.layers.10.feed_forward.w1.weight', 'model.layers.20.attention.wo.weight', 'model.layers.22.attention.wo.weight', 'model.layers.26.feed_forward.w2.weight', 'model.layers.13.feed_forward.w2.weight', 'model.layers.17.attention.wqkv.weight', 'model.layers.12.feed_forward.w2.weight', 'model.layers.28.feed_forward.w1.weight', 'model.layers.3.feed_forward.w3.weight', 'model.layers.19.feed_forward.w2.weight', 'model.layers.23.feed_forward.w1.weight', 'model.layers.0.feed_forward.w1.weight', 'model.layers.10.feed_forward.w3.weight', 'model.layers.28.feed_forward.w2.weight', 'model.layers.30.attention.wo.weight', 'model.layers.14.feed_forward.w2.weight', 'model.layers.12.feed_forward.w3.weight', 'model.layers.11.attention.wqkv.weight', 'model.layers.29.feed_forward.w3.weight', 'model.layers.3.attention.wo.weight', 'model.layers.29.attention.wqkv.weight', 'model.layers.20.feed_forward.w2.weight', 'model.layers.31.feed_forward.w2.weight', 'model.layers.9.feed_forward.w1.weight', 'model.layers.24.feed_forward.w2.weight', 'model.layers.17.feed_forward.w2.weight', 'model.layers.9.feed_forward.w2.weight', 'model.layers.11.attention.wo.weight', 'model.layers.23.attention.wo.weight', 'model.layers.26.attention.wo.weight', 'model.layers.10.feed_forward.w2.weight', 'model.layers.0.feed_forward.w3.weight', 'model.layers.2.feed_forward.w2.weight', 'model.layers.21.feed_forward.w3.weight', 'model.layers.25.attention.wqkv.weight', 'model.layers.1.feed_forward.w1.weight', 'model.layers.19.feed_forward.w3.weight', 'model.layers.21.attention.wqkv.weight', 'model.layers.13.attention.wqkv.weight', 'model.layers.17.attention.wo.weight', 'model.layers.2.attention.wqkv.weight', 'model.layers.20.feed_forward.w1.weight', 'model.layers.11.feed_forward.w2.weight', 'model.layers.16.feed_forward.w2.weight', 'model.layers.25.feed_forward.w1.weight', 'model.layers.15.attention.wo.weight', 'model.layers.5.feed_forward.w3.weight', 'model.layers.22.feed_forward.w3.weight', 'model.layers.2.feed_forward.w3.weight', 'model.layers.22.attention.wqkv.weight', 'model.layers.0.attention.wqkv.weight', 'model.layers.26.attention.wqkv.weight', 'model.layers.2.feed_forward.w1.weight', 'model.layers.11.feed_forward.w3.weight', 'model.layers.16.attention.wo.weight', 'model.layers.7.attention.wqkv.weight', 'model.layers.13.feed_forward.w1.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. INFO - Start quantizing layer 1/32 2024-07-11 11:29:43 INFO [auto_gptq.modeling._base] Start quantizing layer 1/32 INFO - Start quantizing layer 2/32 2024-07-11 11:29:44 INFO [auto_gptq.modeling._base] Start quantizing layer 2/32 INFO - Start quantizing layer 3/32 2024-07-11 11:29:44 INFO [auto_gptq.modeling._base] Start quantizing layer 3/32 INFO - Start quantizing layer 4/32 2024-07-11 11:29:45 INFO [auto_gptq.modeling._base] Start quantizing layer 4/32 INFO - Start quantizing layer 5/32 2024-07-11 11:29:45 INFO [auto_gptq.modeling._base] Start quantizing layer 5/32 INFO - Start quantizing layer 6/32 2024-07-11 11:29:46 INFO [auto_gptq.modeling._base] Start quantizing layer 6/32 INFO - Start quantizing layer 7/32 2024-07-11 11:29:46 INFO [auto_gptq.modeling._base] Start quantizing layer 7/32 INFO - Start quantizing layer 8/32 2024-07-11 11:29:47 INFO [auto_gptq.modeling._base] Start quantizing layer 8/32 INFO - Start quantizing layer 9/32 2024-07-11 11:29:48 INFO [auto_gptq.modeling._base] Start quantizing layer 9/32 INFO - Start quantizing layer 10/32 2024-07-11 11:29:48 INFO [auto_gptq.modeling._base] Start quantizing layer 10/32 INFO - Start quantizing layer 11/32 2024-07-11 11:29:49 INFO [auto_gptq.modeling._base] Start quantizing layer 11/32 INFO - Start quantizing layer 12/32 2024-07-11 11:29:49 INFO [auto_gptq.modeling._base] Start quantizing layer 12/32 INFO - Start quantizing layer 13/32 2024-07-11 11:29:50 INFO [auto_gptq.modeling._base] Start quantizing layer 13/32 INFO - Start quantizing layer 14/32 2024-07-11 11:29:51 INFO [auto_gptq.modeling._base] Start quantizing layer 14/32 INFO - Start quantizing layer 15/32 2024-07-11 11:29:51 INFO [auto_gptq.modeling._base] Start quantizing layer 15/32 INFO - Start quantizing layer 16/32 2024-07-11 11:29:52 INFO [auto_gptq.modeling._base] Start quantizing layer 16/32 INFO - Start quantizing layer 17/32 2024-07-11 11:29:53 INFO [auto_gptq.modeling._base] Start quantizing layer 17/32 INFO - Start quantizing layer 18/32 2024-07-11 11:29:53 INFO [auto_gptq.modeling._base] Start quantizing layer 18/32 INFO - Start quantizing layer 19/32 2024-07-11 11:29:54 INFO [auto_gptq.modeling._base] Start quantizing layer 19/32 INFO - Start quantizing layer 20/32 2024-07-11 11:29:55 INFO [auto_gptq.modeling._base] Start quantizing layer 20/32 INFO - Start quantizing layer 21/32 2024-07-11 11:29:55 INFO [auto_gptq.modeling._base] Start quantizing layer 21/32 INFO - Start quantizing layer 22/32 2024-07-11 11:29:56 INFO [auto_gptq.modeling._base] Start quantizing layer 22/32 INFO - Start quantizing layer 23/32 2024-07-11 11:29:57 INFO [auto_gptq.modeling._base] Start quantizing layer 23/32 INFO - Start quantizing layer 24/32 2024-07-11 11:29:57 INFO [auto_gptq.modeling._base] Start quantizing layer 24/32 INFO - Start quantizing layer 25/32 2024-07-11 11:29:58 INFO [auto_gptq.modeling._base] Start quantizing layer 25/32 INFO - Start quantizing layer 26/32 2024-07-11 11:29:59 INFO [auto_gptq.modeling._base] Start quantizing layer 26/32 INFO - Start quantizing layer 27/32 2024-07-11 11:29:59 INFO [auto_gptq.modeling._base] Start quantizing layer 27/32 INFO - Start quantizing layer 28/32 2024-07-11 11:30:00 INFO [auto_gptq.modeling._base] Start quantizing layer 28/32 INFO - Start quantizing layer 29/32 2024-07-11 11:30:00 INFO [auto_gptq.modeling._base] Start quantizing layer 29/32 INFO - Start quantizing layer 30/32 2024-07-11 11:30:01 INFO [auto_gptq.modeling._base] Start quantizing layer 30/32 INFO - Start quantizing layer 31/32 2024-07-11 11:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 31/32 INFO - Start quantizing layer 32/32 2024-07-11 11:30:02 INFO [auto_gptq.modeling._base] Start quantizing layer 32/32 2024-07-11 11:30:03 INFO [auto_gptq.modeling._utils] Packing model... 2024-07-11 11:30:03 INFO [auto_gptq.modeling._utils] Model packed.
6. Was create the folder with quantized model. Also I have updated ```config.json``` there: ```"model_type": "internlm"```
<img width="568" alt="Screenshot 2024-07-11 at 13 57 43" src="https://github.com/InternLM/InternLM-XComposer/assets/78348856/e8310f67-874c-480d-804b-ff45e18c175e">
7. Load quantized model for inference
quantized_model_dir = "4bit-internlm-4khd-7b"
model = InternLMXComposer2QForCausalLM.from_quantized( quantized_model_dir, device="cuda", local_files_only=True, trust_remote_code=True ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) print(pipeline("auto-gptq is")[0]["generated_text"])
Error:
WARNING - Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source. 2024-07-11 11:41:30 WARNING [auto_gptq.modeling._base] Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source. WARNING - CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
OSError Traceback (most recent call last) Cell In[16], line 4 1 quantized_model_dir = "4bit-internlm-4khd-7b" 3 # Load quantized model for inference ----> 4 model = InternLMXComposer2QForCausalLM.from_quantized( 5 quantized_model_dir, 6 device="cuda", 7 local_files_only=True, 8 trust_remote_code=True 9 ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM 11 # Inference with model.generate 12 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
File ~/.local/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:999, in BaseGPTQForCausalLM.from_quantized(cls, model_name_or_path, device_map, max_memory, device, low_cpu_mem_usage, use_triton, use_qigen, use_marlin, torch_dtype, inject_fused_attention, inject_fused_mlp, use_cuda_fp16, quantize_config, model_basename, use_safetensors, trust_remote_code, warmup_triton, trainable, disable_exllama, disable_exllamav2, **kwargs) 996 init_contexts.append(accelerate.init_empty_weights(include_buffers=False)) 998 with ContextManagers(init_contexts): --> 999 model = AutoModelForCausalLM.from_config( 1000 config, trust_remote_code=trust_remote_code, torch_dtype=torch_dtype 1001 ) 1003 layers = find_layers(model) 1004 ignore_layers = [cls.lm_head_name] + cls.outside_layer_modules
File ~/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:437, in _BaseAutoModelClass.from_config(cls, config, kwargs) 435 else: 436 repo_id = config.name_or_path --> 437 model_class = get_class_from_dynamic_module(class_ref, repo_id, kwargs) 438 if os.path.isdir(config._name_or_path): 439 model_class.register_for_auto_class(cls.name)
File ~/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:485, in get_class_from_dynamic_module(class_reference, pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, code_revision, **kwargs) 483 code_revision = revision 484 # And lastly we get the class inside our newly created module --> 485 final_module = get_cached_module_file( 486 repo_id, 487 module_file + ".py", 488 cache_dir=cache_dir, 489 force_download=force_download, 490 resume_download=resume_download, 491 proxies=proxies, 492 token=token, 493 revision=code_revision, 494 local_files_only=local_files_only, 495 repo_type=repo_type, 496 ) 497 return get_class_in_module(class_name, final_module.replace(".py", ""))
File ~/.local/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:292, in get_cached_module_file(pretrained_model_name_or_path, module_file, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, repo_type, _commit_hash, **deprecated_kwargs) 289 new_files = [] 290 try: 291 # Load from URL or cache if already cached --> 292 resolved_module_file = cached_file( 293 pretrained_model_name_or_path, 294 module_file, 295 cache_dir=cache_dir, 296 force_download=force_download, 297 proxies=proxies, 298 resume_download=resume_download, 299 local_files_only=local_files_only, 300 token=token, 301 revision=revision, 302 repo_type=repo_type, 303 _commit_hash=_commit_hash, 304 ) 305 if not is_local and cached_module != resolved_module_file: 306 new_files.append(module_file)
File ~/.local/lib/python3.10/site-packages/transformers/utils/hub.py:400, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs) 398 if not os.path.isfile(resolved_file): 399 if _raise_exceptions_for_missing_entries: --> 400 raise EnvironmentError( 401 f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout " 402 f"'https://huggingface.co/{path_or_repo_id}/{revision}' for available files." 403 ) 404 else: 405 return None
OSError: 4bit-internlm-4khd-7b does not appear to have a file named modeling_internlm_xcomposer2.py. Checkout 'https://huggingface.co/4bit-internlm-4khd-7b/None' for available files.
Do you have any suggestion how to fix this issue? And also I am not sure, that the process of quantization was successfully finished, as in the log I see layers as ```model.layers.22.attention.wo.weight```, which should be replaced with ```model.layers.22.attention.wo.linear.weight```
@zhuraromdev Did you put other files liketokenization_internlm2.py
into this quantize dir? The whole dir should contain these files as follows. And I think maybe you just forget to change the layer.22
@nzomi Nope, I didn't And should I replace files, which were created during quantization?
@nzomi I have run the code:
quantized_model_dir = "OLD_4bit-internlm-xcomposer2-4khd-7b"
# Load quantized model for inference
model = InternLMXComposer2QForCausalLM.from_quantized(
quantized_model_dir,
device="cuda",
local_files_only=True,
trust_remote_code=True
) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM
# Inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
# Or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
Now getting a lot of log regarding not quantized layers:
...
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc1 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc1 is not quantized.
INFO - The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc2 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vit.vision_tower.vision_model.encoder.layers.23.mlp.fc2 is not quantized.
INFO - The layer vision_proj.0 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vision_proj.0 is not quantized.
INFO - The layer vision_proj.2 is not quantized.
2024-07-11 12:42:36 INFO [auto_gptq.modeling._base] The layer vision_proj.2 is not quantized.
And error about NoneType:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[24], line 4
1 quantized_model_dir = "OLD_4bit-internlm-xcomposer2-4khd-7b"
3 # Load quantized model for inference
----> 4 model = InternLMXComposer2QForCausalLM.from_quantized(
5 quantized_model_dir,
6 device="cuda",
7 local_files_only=True,
8 trust_remote_code=True
9 ) # here: AutoGPTQForCausalLM, InternLMXComposer2QForCausalLM
11 # Inference with model.generate
12 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
File ~/.local/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:1246, in BaseGPTQForCausalLM.from_quantized(cls, model_name_or_path, device_map, max_memory, device, low_cpu_mem_usage, use_triton, use_qigen, use_marlin, torch_dtype, inject_fused_attention, inject_fused_mlp, use_cuda_fp16, quantize_config, model_basename, use_safetensors, trust_remote_code, warmup_triton, trainable, disable_exllama, disable_exllamav2, **kwargs)
1243 inject_fused_attention = False
1244 inject_fused_mlp = False
-> 1246 accelerate.utils.modeling.load_checkpoint_in_model(
1247 model,
1248 dtype=torch_dtype, # This is very hacky but works due to https://github.com/huggingface/accelerate/blob/bd72a5f1a80d5146554458823f8aeda0a9db5297/src/accelerate/utils/modeling.py#L292
1249 checkpoint=model_save_name,
1250 device_map=device_map,
1251 offload_state_dict=True,
1252 offload_buffers=True,
1253 )
1255 # TODO: Why are we using this custom function and not dispatch_model?
1256 model = simple_dispatch_model(model, device_map)
File ~/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py:1797, in load_checkpoint_in_model(model, checkpoint, device_map, offload_folder, dtype, offload_state_dict, offload_buffers, keep_in_fp32_modules, offload_8bit_bnb, strict)
1795 offload_weight(param, param_name, state_dict_folder, index=state_dict_index)
1796 else:
-> 1797 set_module_tensor_to_device(
1798 model,
1799 param_name,
1800 param_device,
1801 value=param,
1802 dtype=new_dtype,
1803 fp16_statistics=fp16_statistics,
1804 )
1806 # Force Python to clean up.
1807 del loaded_checkpoint
File ~/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py:382, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
375 device_quantization = None
376 with torch.no_grad():
377 # leave it on cpu first before moving them to cuda
378 # # fix the case where the device is meta, we don't want to put it on cpu because there is no data =0
379 if (
380 param is not None
381 and param.device.type != "cuda"
--> 382 and torch.device(device).type == "cuda"
383 and param_cls.__name__ in ["Int8Params", "FP4Params", "Params4bit"]
384 ):
385 device_quantization = device
386 device = "cpu"
TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
* (torch.device device)
didn't match because some of the arguments have invalid types: (!NoneType!)
* (str type, int index)
Can I fix issue with NoneType with this?
And run pip install -vvv --no-build-isolation -e .
after update?
@zhuraromdev yes only the model.layers.attention.xxx
and model.layers.feed_forward.xxx
will be quantizeds if you followed this custom class. And it's not neccessary to make build again if you already change the code in _utils, I also recommend you adding some breakpoints to check where the NoneType comes from, for me it is the plora_glb_GN
andplora_sub_GN
so I add those two lines in that file.
@nzomi Thank you a lot! So according to this log, something went wrong during quantization, isn't it?
INFO - The layer model.layers.0.attention.wqkv is not quantized.
INFO - The layer model.layers.0.attention.wo is not quantized.
INFO - The layer model.layers.0.feed_forward.w1 is not quantized.
INFO - The layer model.layers.0.feed_forward.w3 is not quantized.
INFO - The layer model.layers.0.feed_forward.w2 is not quantized.
INFO - The layer model.layers.1.attention.wqkv is not quantized.
INFO - The layer model.layers.1.attention.wo is not quantized.
@zhuraromdev I guess you didn't replace the key correctly from this log, maybe you can replace them once more btw, the awq quantization provided by LMdeploy make it easier to quantize the 4khd model, and faster inference! But you can only infer the quantized model with lmdeploy pipeline then.
@nzomi it seems, that pytorch_model.bin.index.json
was replaced correct. Also will check LMdeploy, thank you! Also providing the code for replacing:
Code for replacement of keys in json:
import os
import json
def replace_keys_in_json(file_path):
# Verify if the file exists
if not os.path.isfile(file_path):
print(f"File not found: {file_path}")
print("Current working directory:", os.getcwd())
print("Directory contents:", os.listdir(os.path.dirname(file_path) or '.'))
return
# Load the .json file
with open(file_path, 'r') as f:
data = json.load(f)
# Define the replacement mapping
replacements = {
'wo.weight': 'wo.linear.weight',
'wqkv.weight': 'wqkv.linear.weight',
'w1.weight': 'w1.linear.weight',
'w2.weight': 'w2.linear.weight',
'w3.weight': 'w3.linear.weight'
}
# Replace keys within the weight_map dictionary
weight_map = data.get('weight_map', {})
new_weight_map = {}
for key, value in weight_map.items():
new_key = key
for old, new in replacements.items():
if old in key:
new_key = key.replace(old, new)
break
new_weight_map[new_key] = value
# Update the data dictionary with the new weight_map
data['weight_map'] = new_weight_map
# Save the modified dictionary back to a .json file
new_file_path = file_path.replace('.json', '_modified.json')
with open(new_file_path, 'w') as f:
json.dump(data, f, indent=4)
print(f"Modified .json file saved as {new_file_path}")
Code for replacement of keys in bin:
import torch
import os
def replace_keys_in_bin(file_path):
# Ensure the path is absolute
absolute_file_path = os.path.abspath(file_path)
print(f"Using absolute file path: {absolute_file_path}")
# Verify if the file exists
if not os.path.isfile(absolute_file_path):
print(f"File not found: {absolute_file_path}")
print("Current working directory:", os.getcwd())
print("Directory contents:", os.listdir(os.path.dirname(absolute_file_path) or '.'))
return
try:
# Load the .bin file
model_dict = torch.load(absolute_file_path)
except Exception as e:
print(f"Error loading file: {e}")
return
# Define the replacement mapping
replacements = {
'wo.weight': 'wo.linear.weight',
'wqkv.weight': 'wqkv.linear.weight',
'w1.weight': 'w1.linear.weight',
'w2.weight': 'w2.linear.weight',
'w3.weight': 'w3.linear.weight'
}
# Create a new dictionary with the replaced keys
new_model_dict = {}
for key, value in model_dict.items():
new_key = key
for old, new in replacements.items():
if old in key:
new_key = key.replace(old, new)
break
new_model_dict[new_key] = value
# Save the modified dictionary back to a .bin file
new_file_path = absolute_file_path.replace('.bin', '_modified.bin')
torch.save(new_model_dict, new_file_path)
print(f"Modified .bin file saved as {new_file_path}")
@nzomi Hey, thank you for advising me lmdeploy for quantization, it was working fine for me :) However, when I was trying to use there quanitized model for fine tuning by InternLM-XComposer it was not working, as the only way, how I can access the model is with lmdeploy inference.
So I have decided to come back to quantization with AutoGPTQ. I have changed the source code, as you have described above and the quantization process was finish successfully. However during loading of quant model, I have some issues:
INFO - The layer model.layers.0.attention.wo is not quantized.
). model.layers.3.attention.wqkv.weight | model.layers.3.attention.wqkv.linear.weight
. Did it happened because of this?
Some weights of InternLMXComposer2ForCausalLM were not initialized from the model checkpoint at internlm_xcomposer2_4khd_7b_repo and are newly initialized: ['model.layers.0.attention.wo.weight', 'model.layers.0.attention.wqkv.weight', 'model.layers.0.feed_forward.w1.weight', 'model.layers.0.feed_forward.w2.weight', 'model.layers.0.feed_forward.w3.weight', 'model.layers.1.attention.wo.weight', 'model.layers.1.attention.wqkv.weight', 'model.layers.1.feed_forward.w1.weight', 'model.layers.1.feed_forward.w2.weight', 'model.layers.1.feed_forward.w3.weight', 'model.layers.10.attention.wo.weight', 'model.layers.10.attention.wqkv.weight', 'model.layers.10.feed_forward.w1.weight', 'model.layers.10.feed_forward.w2.weight', 'model.layers.10.feed_forward.w3.weight', 'model.layers.11.attention.wo.weight', 'model.layers.11.attention.wqkv.weight', 'model.layers.11.feed_forward.w1.weight', 'model.layers.11.feed_forward.w2.weight', 'model.layers.11.feed_forward.w3.weight', 'model.layers.12.attention.wo.weight', 'model.layers.12.attention.wqkv.weight', 'model.layers.12.feed_forward.w1.weight', 'model.layers.12.feed_forward.w2.weight', 'model.layers.12.feed_forward.w3.weight', 'model.layers.13.attention.wo.weight', 'model.layers.13.attention.wqkv.weight', 'model.layers.13.feed_forward.w1.weight', 'model.layers.13.feed_forward.w2.weight', 'model.layers.13.feed_forward.w3.weight', 'model.layers.14.attention.wo.weight', 'model.layers.14.attention.wqkv.weight', 'model.layers.14.feed_forward.w1.weight', 'model.layers.14.feed_forward.w2.weight', 'model.layers.14.feed_forward.w3.weight', 'model.layers.15.attention.wo.weight', 'model.layers.15.attention.wqkv.weight', 'model.layers.15.feed_forward.w1.weight', 'model.layers.15.feed_forward.w2.weight', 'model.layers.15.feed_forward.w3.weight', 'model.layers.16.attention.wo.weight', 'model.layers.16.attention.wqkv.weight', 'model.layers.16.feed_forward.w1.weight', 'model.layers.16.feed_forward.w2.weight', 'model.layers.16.feed_forward.w3.weight', 'model.layers.17.attention.wo.weight', 'model.layers.17.attention.wqkv.weight', 'model.layers.17.feed_forward.w1.weight', 'model.layers.17.feed_forward.w2.weight', 'model.layers.17.feed_forward.w3.weight', 'model.layers.18.attention.wo.weight', 'model.layers.18.attention.wqkv.weight', 'model.layers.18.feed_forward.w1.weight', 'model.layers.18.feed_forward.w2.weight', 'model.layers.18.feed_forward.w3.weight', 'model.layers.19.attention.wo.weight', 'model.layers.19.attention.wqkv.weight', 'model.layers.19.feed_forward.w1.weight', 'model.layers.19.feed_forward.w2.weight', 'model.layers.19.feed_forward.w3.weight', 'model.layers.2.attention.wo.weight', 'model.layers.2.attention.wqkv.weight', 'model.layers.2.feed_forward.w1.weight', 'model.layers.2.feed_forward.w2.weight', 'model.layers.2.feed_forward.w3.weight', 'model.layers.20.attention.wo.weight', 'model.layers.20.attention.wqkv.weight', 'model.layers.20.feed_forward.w1.weight', 'model.layers.20.feed_forward.w2.weight', 'model.layers.20.feed_forward.w3.weight', 'model.layers.21.attention.wo.weight', 'model.layers.21.attention.wqkv.weight', 'model.layers.21.feed_forward.w1.weight', 'model.layers.21.feed_forward.w2.weight', 'model.layers.21.feed_forward.w3.weight', 'model.layers.22.attention.wo.weight', 'model.layers.22.attention.wqkv.weight', 'model.layers.22.feed_forward.w1.weight', 'model.layers.22.feed_forward.w2.weight', 'model.layers.22.feed_forward.w3.weight', 'model.layers.23.attention.wo.weight', 'model.layers.23.attention.wqkv.weight', 'model.layers.23.feed_forward.w1.weight', 'model.layers.23.feed_forward.w2.weight', 'model.layers.23.feed_forward.w3.weight', 'model.layers.24.attention.wo.weight', 'model.layers.24.attention.wqkv.weight', 'model.layers.24.feed_forward.w1.weight', 'model.layers.24.feed_forward.w2.weight', 'model.layers.24.feed_forward.w3.weight', 'model.layers.25.attention.wo.weight', 'model.layers.25.attention.wqkv.weight', 'model.layers.25.feed_forward.w1.weight', 'model.layers.25.feed_forward.w2.weight', 'model.layers.25.feed_forward.w3.weight', 'model.layers.26.attention.wo.weight', 'model.layers.26.attention.wqkv.weight', 'model.layers.26.feed_forward.w1.weight', 'model.layers.26.feed_forward.w2.weight', 'model.layers.26.feed_forward.w3.weight', 'model.layers.27.attention.wo.weight', 'model.layers.27.attention.wqkv.weight', 'model.layers.27.feed_forward.w1.weight', 'model.layers.27.feed_forward.w2.weight', 'model.layers.27.feed_forward.w3.weight', 'model.layers.28.attention.wo.weight', 'model.layers.28.attention.wqkv.weight', 'model.layers.28.feed_forward.w1.weight', 'model.layers.28.feed_forward.w2.weight', 'model.layers.28.feed_forward.w3.weight', 'model.layers.29.attention.wo.weight', 'model.layers.29.attention.wqkv.weight', 'model.layers.29.feed_forward.w1.weight', 'model.layers.29.feed_forward.w2.weight', 'model.layers.29.feed_forward.w3.weight', 'model.layers.3.attention.wo.weight', 'model.layers.3.attention.wqkv.weight', 'model.layers.3.feed_forward.w1.weight', 'model.layers.3.feed_forward.w2.weight', 'model.layers.3.feed_forward.w3.weight', 'model.layers.30.attention.wo.weight', 'model.layers.30.attention.wqkv.weight', 'model.layers.30.feed_forward.w1.weight', 'model.layers.30.feed_forward.w2.weight', 'model.layers.30.feed_forward.w3.weight', 'model.layers.31.attention.wo.weight', 'model.layers.31.attention.wqkv.weight', 'model.layers.31.feed_forward.w1.weight', 'model.layers.31.feed_forward.w2.weight', 'model.layers.31.feed_forward.w3.weight', 'model.layers.4.attention.wo.weight', 'model.layers.4.attention.wqkv.weight', 'model.layers.4.feed_forward.w1.weight', 'model.layers.4.feed_forward.w2.weight', 'model.layers.4.feed_forward.w3.weight', 'model.layers.5.attention.wo.weight', 'model.layers.5.attention.wqkv.weight', 'model.layers.5.feed_forward.w1.weight', 'model.layers.5.feed_forward.w2.weight', 'model.layers.5.feed_forward.w3.weight', 'model.layers.6.attention.wo.weight', 'model.layers.6.attention.wqkv.weight', 'model.layers.6.feed_forward.w1.weight', 'model.layers.6.feed_forward.w2.weight', 'model.layers.6.feed_forward.w3.weight', 'model.layers.7.attention.wo.weight', 'model.layers.7.attention.wqkv.weight', 'model.layers.7.feed_forward.w1.weight', 'model.layers.7.feed_forward.w2.weight', 'model.layers.7.feed_forward.w3.weight', 'model.layers.8.attention.wo.weight', 'model.layers.8.attention.wqkv.weight', 'model.layers.8.feed_forward.w1.weight', 'model.layers.8.feed_forward.w2.weight', 'model.layers.8.feed_forward.w3.weight', 'model.layers.9.attention.wo.weight', 'model.layers.9.attention.wqkv.weight', 'model.layers.9.feed_forward.w1.weight', 'model.layers.9.feed_forward.w2.weight', 'model.layers.9.feed_forward.w3.weight', 'vit.vision_tower.vision_model.post_layernorm.bias', 'vit.vision_tower.vision_model.post_layernorm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Do you have any suggestion, how can I resolve it?
@zhuraromdev it is a bit wierd, did you check your .bin weight keys after replacement?
Hello, thank you for the amazing work, is it possible to use Qlora to fine tune the 4bit quant models?