OpenBMB / BMInf

Efficient Inference for Big Models
Apache License 2.0
572 stars 67 forks source link

[FEATURE] full examples with some known models from the HF in a Collab Notebook #63

Closed vackosar closed 1 year ago

vackosar commented 1 year ago

Is your feature request related to a problem? Please describe. For example I cannot get HF Bert working. I don't know when I can use your project

import bminf
import torch 

encoded_input_cpu = tokenizer(text, return_tensors='pt').to('cpu')
model = BertModel.from_pretrained("bert-base-uncased").to('cpu')
# apply wrapper
with torch.cuda.device(0):
    model = bminf.wrapper(model.to('cpu'))
    with print_time_delta('generate'):
      output = model(**encoded_input_cpu)

Describe the solution you'd like Can you provide full examples with some known models from the HF in a Collab Notebook?

Describe alternatives you've considered

a710128 commented 1 year ago

Upgrade BMInf to version 2.0.1 and try the examples in example/huggingface.

vackosar commented 1 year ago

Wonderful!

The quantization is not supported right, some check breaks in HF. I cannot load GPT-J on Collab, but gpt2-large seems to load into 8GB VRAM. Is that right?

vackosar commented 1 year ago

Yes, and I can fit it into much less also, even 4gb. Nice!

vackosar commented 1 year ago

How much more would the quantization help? And is it feasible?

a710128 commented 1 year ago

Quantization reduces memory usage by half, and doubles the speed of inference when the model is very large. BMInf provides a way to automate quantization (by replacing the linear layer with a quantized linear layer), although this approach can affect model performance.

vackosar commented 1 year ago

Doubling is a lot!

The problem is that setting quantization to True on GPT-2 HF model will cause an exception during text generation. Is there a way to prevent that?

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-7-d73609974428>](https://localhost:8080/#) in <module>
      1 prompt = "To be or not to be, that"
      2 input_ids = tokenizer(prompt, return_tensors="pt").input_ids
----> 3 gen_tokens = model.generate(
      4     input_ids.cuda(),
      5     do_sample=True,

8 frames
[/usr/local/lib/python3.8/dist-packages/cpm_kernels/kernels/gemm.py](https://localhost:8080/#) in gemm_int8(m, k, n, batchA, batchB, aT, bT, A, B, out, stream)
    137     device.use()
    138 
--> 139     assert m % 4 == 0 and n % 4 == 0 and k % 4 == 0
    140     assert batchA == batchB or batchA == 1 or batchB == 1
    141 

AssertionError:
a710128 commented 1 year ago

The problem you are encountering is probably because the GEMM kernel for int8 needs the input matrix size to be a multiple of 4.

vackosar commented 1 year ago

Ok, so this is about the internal dimension of the model? If yes, then there isn't much to do about this right?

vackosar commented 1 year ago

No, it is about the batch size. Right, it seems to work now.

TodayWei commented 6 months ago

oh! i meet the same problem now. how do you soloved it ?