Exception: Saved weights version (0) does not match the codebook version (1).

KnutJaegersberg commented 8 months ago

I started quantizing an LLM with a slightly older version of your library. It took a while to calculate 8k context hessians, so when it finally finished the library I used was already outdated. I get the exception when I try to do inference with the model on you newer library version. I can get inference to work with the older version. Are my weights, which took a week to calculate, now quasi deprecated?

https://huggingface.co/KnutJaegersberg/Tess-M-34B-2bit

Traceback (most recent call last): File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 126, in main(args) File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 112, in main outputs = model.generate(input_ids=inputs['input_ids'].cuda(), File "/home/knut/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 1606, in generate return self.greedy_search( File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 2454, in greedy_search outputs = self( File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/knut/New Folder/quip-sharp/model/llama.py", line 1056, in forward outputs = self.model( File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/knut/New Folder/quip-sharp/model/llama.py", line 943, in forward layer_outputs = decoder_layer( File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/knut/New Folder/quip-sharp/model/llama.py", line 652, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/knut/New Folder/quip-sharp/model/llama.py", line 453, in forward query_states, key_states, value_states = self.qkv_proj(hidden_states.to(torch.float32)) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/knut/New Folder/quip-sharp/lib/linear/fused_quantized_linear.py", line 20, in forward fused_output = super(FusedQuantizedLinear, self).forward(input) File "/home/knut/New Folder/quip-sharp/lib/linear/quantized_linear.py", line 86, in forward raise Exception( Exception: Saved weights version (0) does not match the codebook version (1). Please download the latest weights from https://huggingface.co/relaxml

tsengalb99 commented 8 months ago

Not really. The main difference between the two E8P versions is version 1 packs the codebook, uses a different indexing scheme, and uses the new Fused/QuantizedLinear class. You can try to do a index lookup and packing with you existing weights, but it would probably be faster to requantize your model. How much time does requantizing take? I'm guessing most of your 1 week was spent generating Hessians and quantizing the model should only take a few hours.

KnutJaegersberg commented 8 months ago

yeah, I got the same thought this morning, too. Maybe I can reuse the hessians. I'll try that once I got the compute again.

KnutJaegersberg commented 8 months ago

quantizing and conversion to hf format takes less than 24 hours. It's important to note that I'm just here for the practical interest to explore how to squeeze out the most out of my hardware. I have not yet fully understood how quip# works.

KnutJaegersberg commented 8 months ago

is this a good introduction? https://www.youtube.com/watch?v=6wEVz0wkhCM

tsengalb99 commented 8 months ago

QuIP# uses incoherence from original QuIP (https://openreview.net/forum?id=xrk9g5vcXR) and lattice codebooks to do quantization. I would recommend reading the QuIP# blog post (https://cornell-relaxml.github.io/quip-sharp/) to learn more about QuIP#. To learn more about original QuIP, you can read the paper and/or look at the NeurIPS video + poster (https://neurips.cc/virtual/2023/poster/69982).

KnutJaegersberg commented 8 months ago

Thanks. I'm requantizing the weights with the hessians I got using the latest version. I've finished the orca-70b model, uploading it now to hf. Once finished, I'll also upload the hessians :)

Thanks for making this!

Cornell-RelaxML / quip-sharp

Exception: Saved weights version (0) does not match the codebook version (1). #31