Not work with 4bit quant

laoda513 commented 1 year ago

with de demo in readme 👍 🚀 Try new 20B LLMs demo in Kaggle

switch to using 4bit:

`with accelerate.init_empty_weights(): model = transformers.AutoModelForCausalLM.from_config(transformers.AutoConfig.from_pretrained(".../hf-LLaMA/13B")).half()

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
   # bnb_4bit_compute_dtype=torch.bfloat16,
  #  bnb_4bit_use_double_quant=True,
   # bnb_4bit_quant_type='nf4',
)

model = tp.TensorParallelPreTrainedModel(
    model,
    device_ids=["cuda:0", "cuda:1", "cuda:2"], 
)

model = replace_with_bnb_linear(model, ["lm_head"], None, nf4_config)
model.is_loaded_in_8bit = 1
model.is_loaded_in_4bit = True`

for inference: it not works without '.cuda()' RuntimeError: All tensors must be on devices[0]: 0

but works with inputs = tokenizer("cat:", return_tensors="pt")["input_ids"].to("cuda:0")

for training with peft lora it does not work. with to("cuda:0") File "/home/user/miniconda3/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (10010x5120 and 1x4587520)

also not work with cuda(): RuntimeError: All tensors must be on devices[0]: 0

laoda513 commented 1 year ago

BTW, is there any plan to support with gptq?

BlackSamorez commented 1 year ago

@laoda513 Hi! I'm not familiar with GPTQ, so I'm not sure about adding support for it. About bitsandbytes 4bit: is it even released yet? I thought it's still in closed beta.

laoda513 commented 1 year ago

@laoda513 Hi! I'm not familiar with GPTQ, so I'm not sure about adding support for it. About bitsandbytes 4bit: is it even released yet? I thought it's still in closed beta.

yes it's not released yet... Maybe I am too impatient 😂😂😂

laoda513 commented 1 year ago

Oh， my apologize.. After upgrade peft to the main branch, it works~

BlackSamorez commented 1 year ago

@laoda513 to fix RuntimeError: All tensors must be on devices[0]: 0 simply put your input tensors on cuda:0. I don't know why it's suddenly necessary: any cuda worked in the past. I'll look into it.

BlackSamorez commented 1 year ago

I'll close this issue since #80 is very similar to it. Please continue the discussion there.

BlackSamorez / tensor_parallel

Not work with 4bit quant #79