artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.97k stars 819 forks source link

Open_Llama compatibility #59

Closed jav-ed closed 9 months ago

jav-ed commented 1 year ago

Open_Llama is licensed under Apache, thus I prefer it over Metas Llama. I can load and use 3B version of open_Llama to some extent. It does not provide one answer, but multiple, and then starts to having a chat with itself. Does anybody know, how to get rid of this problem?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "openlm-research/open_llama_3b_600bt_preview'"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

prompt = 'Q: What is the captial of Pakistan?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32)

print(tokenizer.decode(generation_output[0]))

My second question is, how is it possible that I can load gpt-neox, which has around 40GB, but get an error when trying to load the open_Llama 7B verison (not enough RAM)? I receive the same error when trying to load the new falcon LLM.

hemangjoshi37a commented 1 year ago

Hello @jav-ed,

I appreciate your preference for the Apache-licensed Open_Llama over Meta's Llama. Your problem is intriguing, and I believe it comes down to how the generate function works. The generate function by default continues generating text until it reaches a specified limit or comes across a stop sequence, which in the case of conversational models is often set to a token representing the end of a turn in conversation.

One potential way to address this is to use the eos_token_id parameter in the generate function, which would make the function stop generating when it encounters the end-of-sentence token. Here is an example:

generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32, eos_token_id=tokenizer.eos_token_id)

Please note that this is a simplified solution and might require additional adjustments to work best with Open_Llama.

For your second question about the RAM issue with Open_Llama 7B and Falcon LLM, it seems like these models might have larger memory footprints due to their architecture or parameter configurations. It might be worthwhile to look into memory optimization strategies, such as gradient checkpointing, or using smaller batch sizes. If your current environment can't accommodate these models, consider using cloud-based solutions which offer flexible memory allocations.

Best, @hemangjoshi37a

jav-ed commented 1 year ago

Thank you @hemangjoshi37a for your fast reply. I also asked a similar question at https://github.com/openlm-research/open_llama. Their response to the question of why multiple answers are given is that fine-tuning would be required, https://github.com/openlm-research/open_llama/issues/29:

This is expected. Since OpenLLaMA is a base model, you'll need to finetune it yourself to make it a chatbot that answers your questions. This is called instruction finetuning and is exactly what recent works like Alpaca, Vicuna and Koala did.

Anyhow, your suggestion was tested: 0

We can see that it improves the output quality, i.e., two times it responds with Islamabad and the last time, it might also want to say Islamabad, which is correct. Pakistan and Karachi are not the capital of Pakistan. However, still, it would be great to have only one answer. So maybe no fine-tuning is required, but only some settings need to be modified.


Also, thank you for answering the second question. Unfortunately, I don't see any benefit in scaling GPU RAM or regular RAM. I believe the T4 with around 15 GPU RAM is more than enough. If it requires more than that, then I consider this LLM not to be a valid option.

jav-ed commented 1 year ago

There is now also open Alpaca and it does work with qlora. Note, while open_Llama is indeed open, openAlpaca might not be considered fully open https://github.com/yxuansu/OpenAlpaca

The data, i.e. openalpaca.json, we use to fine-tune the model contains ~15k instances and is constructed from the databricks-dolly-15k dataset by removing samples that are too long. Following the original databricks-dolly-15k dataset, our data is also licensed under the CC BY-SA 3.0 license which allows it to be used in any academic and commerical purposes.

Here you can see an example code for how to run openAlpaca: https://github.com/yxuansu/OpenAlpaca/issues/3#issuecomment-1566783813