Open milo157 opened 9 months ago
The error here seems to come from AutoModelForCausalLM.from_pretrained
and AutoModelForCausalLM.from_config
yielding incompatible model structures for the same model here, most likely due to some special post-init code hooked into the GPTQ model loading process in transformers
when using from_pretrained
.
from_pretrained
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(64000, 7168, padding_idx=0)
(layers): ModuleList(
(0-59): 60 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(rotary_emb): LlamaRotaryEmbedding()
(k_proj): QuantLinear()
(o_proj): QuantLinear()
(q_proj): QuantLinear()
(v_proj): QuantLinear()
)
(mlp): LlamaMLP(
(act_fn): SiLU()
(down_proj): QuantLinear()
(gate_proj): QuantLinear()
(up_proj): QuantLinear()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=7168, out_features=64000, bias=False)
)
from_config
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(64000, 7168, padding_idx=0)
(layers): ModuleList(
(0-59): 60 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=7168, out_features=7168, bias=False)
(k_proj): Linear(in_features=7168, out_features=1024, bias=False)
(v_proj): Linear(in_features=7168, out_features=1024, bias=False)
(o_proj): Linear(in_features=7168, out_features=7168, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=7168, out_features=20480, bias=False)
(up_proj): Linear(in_features=7168, out_features=20480, bias=False)
(down_proj): Linear(in_features=20480, out_features=7168, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=7168, out_features=64000, bias=False)
)
For tensorizer's load_into_module
method to work, the model skeleton being loaded into must match how it appeared when it was serialized.
I am not familiar with whether AutoGPTQ supports a way to initialize its correct structure solely from a config, without initializing weights. A potential workaround is to save the structure directly using pickling:
serialise_model(model, "./test.tensors")
+ from types import SimpleNamespace
+ # model.quantize_config is essentially a SimpleNamespace but missing pickle support
+ model.quantize_config = SimpleNamespace(**vars(model.quantize_config))
+ torch.save(model.to("meta"), "./test_model_structure.pt")
(See their source for the original definition of quantize_config
)
And load the structure at deserialization time:
def deserialise_saved_model(model_path, model_id, plaid=True):
- config = AutoConfig.from_pretrained(model_id)
print(("Initialising empty model"), file=sys.stderr)
start = time.time()
- with no_init_or_tensor():
- model = AutoModelForCausalLM.from_config(config)
+ model = torch.load("./test_model_structure.pt")
end_init = time.time() - start
This is, essentially, saving the complement of a state_dict
, in that it saves everything but the weights. It will still make full use of tensorizer's optimized loading, as the torch.load
step loading the model structure only accounts for ~30–40 ms loading metadata, while the TensorDeserializer
does all the work of loading actual weights. At the time of writing this, your code runs fine with these patches applied.
Using pickling in this way is unsupported on the transformers
side and likely brittle—so, I would check the relevant transformers
/auto_gptq
/optimum
documentation on GPTQ models to find if they have methods to officially support instantiating models with uninitialized weights (for use with TensorDeserializer.load_into_module
), or if they support loading weights from a state_dict
(for use with the TensorDeserializer
mapping interface), as better options.
Unfortunately there is no bug fix to offer for this on the tensorizer side, since this issue is with supported usage patterns of external libraries. I hope this helps.
I am trying to use tensorizer to serliaize/deserialize the following model on HF: TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ however I am getting an error that I am unsure how to resolve.
The model serializes correctly but on deserialization I get the error: KeyError: "attribute 'bias' already exists"
Code to reproduce:
pip install tensorizer accelerate transformers auto-gptq optimum
Error Trace: