coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
171 stars 23 forks source link

Deserialisation issue: KeyError: "attribute 'bias' already exists" #67

Open milo157 opened 9 months ago

milo157 commented 9 months ago

I am trying to use tensorizer to serliaize/deserialize the following model on HF: TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ however I am getting an error that I am unsure how to resolve.

The model serializes correctly but on deserialization I get the error: KeyError: "attribute 'bias' already exists"

Code to reproduce:

pip install tensorizer accelerate transformers auto-gptq optimum

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, AutoConfig
from tensorizer import TensorDeserializer, TensorSerializer
from tensorizer.utils import no_init_or_tensor
import time
import sys

model_name_or_path = "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             revision="main")

def serialise_model(model, save_path):
    try:
        serializer = TensorSerializer(save_path)
        start = time.time()
        serializer.write_module(model)
        end = time.time()
        print((f"Serialising model took {end - start} seconds"),  file=sys.stderr)
        serializer.close()
        return True
    except Exception as e:
        print("Serialisation failed with error: ", e,  file=sys.stderr)
        return False

serialise_model(model, "./test.tensors")

def deserialise_saved_model(model_path, model_id, plaid=True):
    config = AutoConfig.from_pretrained(model_id)

    print(("Initialising empty model"),  file=sys.stderr)
    start = time.time()
    with no_init_or_tensor():
        model = AutoModelForCausalLM.from_config(config)
    end_init = time.time() - start

    deserializer = TensorDeserializer(model_path, plaid_mode=True)

    print(("Loading model"),  file=sys.stderr)
    start = time.time()
    deserializer.load_into_module(model)
    end = time.time()
    deserializer.close()

    print(f"Initialising empty model took {end_init} seconds",  file=sys.stderr)
    print((f"\nDeserialising model took {end - start} seconds\n"),  file=sys.stderr)

    return model

model = deserialise_saved_model("./test.tensors", "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ")

Error Trace:

Initialising empty model
Loading model
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 model = deserialise_saved_model("./test.tensors", "TheBloke/Capybara-Tess-Yi-34B-200K-GPTQ")

Cell In[4], line 28, in deserialise_saved_model(model_path, model_id, plaid)
     26 print(("Loading model"),  file=sys.stderr)
     27 start = time.time()
---> 28 deserializer.load_into_module(model)
     29 end = time.time()
     30 deserializer.close()

File /usr/local/lib/python3.10/dist-packages/tensorizer/serialization.py:1855, in TensorDeserializer.load_into_module(self, m, filter_func, verify_hash)
   1853     module.register_parameter(attr, tensor)
   1854 elif entry.type is TensorType.BUFFER:
-> 1855     module.register_buffer(attr, tensor)
   1856 elif entry.type is TensorType.STATE_DICT:
   1857     raise NotImplementedError(
   1858         "This was serialized using"
   1859         " TensorSerializer.write_state_dict(), so it cannot be"
   (...)
   1862         " state_dict mapping instead."
   1863     )

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:538, in Module.register_buffer(self, name, tensor, persistent)
    536     raise KeyError("buffer name can't be empty string \"\"")
    537 elif hasattr(self, name) and name not in self._buffers:
--> 538     raise KeyError(f"attribute '{name}' already exists")
    539 elif tensor is not None and not isinstance(tensor, torch.Tensor):
    540     raise TypeError(f"cannot assign '{torch.typename(tensor)}' object to buffer '{name}' "
    541                     "(torch Tensor or None required)"
    542                     )

KeyError: "attribute 'bias' already exists"
Eta0 commented 4 months ago

The error here seems to come from AutoModelForCausalLM.from_pretrained and AutoModelForCausalLM.from_config yielding incompatible model structures for the same model here, most likely due to some special post-init code hooked into the GPTQ model loading process in transformers when using from_pretrained.

Output of from_pretrained
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(64000, 7168, padding_idx=0)
    (layers): ModuleList(
      (0-59): 60 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (k_proj): QuantLinear()
          (o_proj): QuantLinear()
          (q_proj): QuantLinear()
          (v_proj): QuantLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLU()
          (down_proj): QuantLinear()
          (gate_proj): QuantLinear()
          (up_proj): QuantLinear()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=7168, out_features=64000, bias=False)
)
Output of from_config
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(64000, 7168, padding_idx=0)
    (layers): ModuleList(
      (0-59): 60 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=7168, out_features=7168, bias=False)
          (k_proj): Linear(in_features=7168, out_features=1024, bias=False)
          (v_proj): Linear(in_features=7168, out_features=1024, bias=False)
          (o_proj): Linear(in_features=7168, out_features=7168, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=7168, out_features=20480, bias=False)
          (up_proj): Linear(in_features=7168, out_features=20480, bias=False)
          (down_proj): Linear(in_features=20480, out_features=7168, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=7168, out_features=64000, bias=False)
)

For tensorizer's load_into_module method to work, the model skeleton being loaded into must match how it appeared when it was serialized.

I am not familiar with whether AutoGPTQ supports a way to initialize its correct structure solely from a config, without initializing weights. A potential workaround is to save the structure directly using pickling:

  serialise_model(model, "./test.tensors")
+ from types import SimpleNamespace
+ # model.quantize_config is essentially a SimpleNamespace but missing pickle support
+ model.quantize_config = SimpleNamespace(**vars(model.quantize_config))
+ torch.save(model.to("meta"), "./test_model_structure.pt")

(See their source for the original definition of quantize_config)

And load the structure at deserialization time:

  def deserialise_saved_model(model_path, model_id, plaid=True):
-     config = AutoConfig.from_pretrained(model_id)

      print(("Initialising empty model"),  file=sys.stderr)
      start = time.time()
-     with no_init_or_tensor():
-         model = AutoModelForCausalLM.from_config(config)
+     model = torch.load("./test_model_structure.pt")
      end_init = time.time() - start

This is, essentially, saving the complement of a state_dict, in that it saves everything but the weights. It will still make full use of tensorizer's optimized loading, as the torch.load step loading the model structure only accounts for ~30–40 ms loading metadata, while the TensorDeserializer does all the work of loading actual weights. At the time of writing this, your code runs fine with these patches applied.

Using pickling in this way is unsupported on the transformers side and likely brittle—so, I would check the relevant transformers/auto_gptq/optimum documentation on GPTQ models to find if they have methods to officially support instantiating models with uninitialized weights (for use with TensorDeserializer.load_into_module), or if they support loading weights from a state_dict (for use with the TensorDeserializer mapping interface), as better options.

Unfortunately there is no bug fix to offer for this on the tensorizer side, since this issue is with supported usage patterns of external libraries. I hope this helps.