Closed KnutJaegersberg closed 8 months ago
Not really. The main difference between the two E8P versions is version 1 packs the codebook, uses a different indexing scheme, and uses the new Fused/QuantizedLinear class. You can try to do a index lookup and packing with you existing weights, but it would probably be faster to requantize your model. How much time does requantizing take? I'm guessing most of your 1 week was spent generating Hessians and quantizing the model should only take a few hours.
yeah, I got the same thought this morning, too. Maybe I can reuse the hessians. I'll try that once I got the compute again.
quantizing and conversion to hf format takes less than 24 hours. It's important to note that I'm just here for the practical interest to explore how to squeeze out the most out of my hardware. I have not yet fully understood how quip# works.
is this a good introduction? https://www.youtube.com/watch?v=6wEVz0wkhCM
QuIP# uses incoherence from original QuIP (https://openreview.net/forum?id=xrk9g5vcXR) and lattice codebooks to do quantization. I would recommend reading the QuIP# blog post (https://cornell-relaxml.github.io/quip-sharp/) to learn more about QuIP#. To learn more about original QuIP, you can read the paper and/or look at the NeurIPS video + poster (https://neurips.cc/virtual/2023/poster/69982).
Thanks. I'm requantizing the weights with the hessians I got using the latest version. I've finished the orca-70b model, uploading it now to hf. Once finished, I'll also upload the hessians :)
Thanks for making this!
I started quantizing an LLM with a slightly older version of your library. It took a while to calculate 8k context hessians, so when it finally finished the library I used was already outdated. I get the exception when I try to do inference with the model on you newer library version. I can get inference to work with the older version. Are my weights, which took a week to calculate, now quasi deprecated?
https://huggingface.co/KnutJaegersberg/Tess-M-34B-2bit
Traceback (most recent call last): File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 126, in
main(args)
File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 112, in main
outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
File "/home/knut/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 1606, in generate
return self.greedy_search(
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
outputs = self(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 1056, in forward
outputs = self.model(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 943, in forward
layer_outputs = decoder_layer(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 652, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 453, in forward
query_states, key_states, value_states = self.qkv_proj(hidden_states.to(torch.float32))
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, **kwargs)
File "/home/knut/New Folder/quip-sharp/lib/linear/fused_quantized_linear.py", line 20, in forward
fused_output = super(FusedQuantizedLinear, self).forward(input)
File "/home/knut/New Folder/quip-sharp/lib/linear/quantized_linear.py", line 86, in forward
raise Exception(
Exception: Saved weights version (0) does not match the codebook version (1). Please download the latest weights from https://huggingface.co/relaxml