Closed tj-cycyota closed 1 year ago
Isn't load_in_8bit=True doing that? It's sent to BitsAndBytesConfig and adds it to the model no?
Not in my testing. The only way it works (e.g. actually load the model on a smaller GPU) is with setting model_kwargs={'load_in_8bit': True}
so change to this right?
# Note: if you use dolly 12B or smaller model but a GPU with less than 24GB RAM, use 8bit. This requires %pip install bitsandbytes
# instruct_pipeline = pipeline(model=model_name, load_in_8bit=True, trust_remote_code=True, device_map="auto", model_kwargs={'load_in_8bit': True})
You have load_in_8bit=True
in there twice. It should be this, with that param in model_kwargs
:
# Note: if you use dolly 12B or smaller model but a GPU with less than 24GB RAM, use 8bit. This requires %pip install bitsandbytes
# instruct_pipeline = pipeline(model=model_name, trust_remote_code=True, device_map="auto", model_kwargs={'load_in_8bit': True})
Thanks, I'm adding it in the next release
In the Dolly demo, notebook
03-Q&A-prompt-engineering-for-dolly
, there is sample code provided (originally commented out) that looks like:# Note: if you use dolly 12B or smaller model but a GPU with less than 24GB RAM, use 8bit. This requires %pip install bitsandbytes
# instruct_pipeline = pipeline(model=model_name, load_in_8bit=True, trust_remote_code=True, device_map="auto")
However, the correct way to pass the
load_in_8bit
param according to the Databricks Dolly Docs is as:instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50, model_kwargs={'load_in_8bit': True})