Closed vackosar closed 1 year ago
Upgrade BMInf to version 2.0.1 and try the examples in example/huggingface
.
Wonderful!
The quantization is not supported right, some check breaks in HF. I cannot load GPT-J on Collab, but gpt2-large seems to load into 8GB VRAM. Is that right?
Yes, and I can fit it into much less also, even 4gb. Nice!
How much more would the quantization help? And is it feasible?
Quantization reduces memory usage by half, and doubles the speed of inference when the model is very large. BMInf provides a way to automate quantization (by replacing the linear layer with a quantized linear layer), although this approach can affect model performance.
Doubling is a lot!
The problem is that setting quantization to True on GPT-2 HF model will cause an exception during text generation. Is there a way to prevent that?
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
[<ipython-input-7-d73609974428>](https://localhost:8080/#) in <module>
1 prompt = "To be or not to be, that"
2 input_ids = tokenizer(prompt, return_tensors="pt").input_ids
----> 3 gen_tokens = model.generate(
4 input_ids.cuda(),
5 do_sample=True,
8 frames
[/usr/local/lib/python3.8/dist-packages/cpm_kernels/kernels/gemm.py](https://localhost:8080/#) in gemm_int8(m, k, n, batchA, batchB, aT, bT, A, B, out, stream)
137 device.use()
138
--> 139 assert m % 4 == 0 and n % 4 == 0 and k % 4 == 0
140 assert batchA == batchB or batchA == 1 or batchB == 1
141
AssertionError:
The problem you are encountering is probably because the GEMM kernel for int8 needs the input matrix size to be a multiple of 4.
Ok, so this is about the internal dimension of the model? If yes, then there isn't much to do about this right?
No, it is about the batch size. Right, it seems to work now.
oh! i meet the same problem now. how do you soloved it ?
Is your feature request related to a problem? Please describe. For example I cannot get HF Bert working. I don't know when I can use your project
Describe the solution you'd like Can you provide full examples with some known models from the HF in a Collab Notebook?
Describe alternatives you've considered