How to finetune RWKV? - Githubissues

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.32k stars 838 forks source link

How to finetune RWKV? #94

Open kxzxvbk opened 1 year ago

kxzxvbk commented 1 year ago

Hi, thanks for your work :) Now, I'm wondering how can I finetune RWKV given a pretrained model. I know that there is one repo (https://github.com/Blealtan/RWKV-LM-LoRA ) using LoRA for finetuning. But I suppose that this repo is not good enough for reasons:

It does not support for other finetuning methods (e.g. Prefix Tuning, P-Tuning, ...) . These methods are integrated in this repo: https://github.com/huggingface/peft. I wonder whether we can use peft to finetune RWKV?
RWKV-LM-LoRA does not support other RWKV inferences (especially the pip package). Since I cannot install NVCC in my computer, I 'm not able to compile the CUDA kernel. The only choice for me to run RWKV is using the pip package. I really wonder if there are any ideas about these, thanks for your help :)

zeroplum commented 1 year ago

https://github.com/BlinkDL/RWKV-v2-RNN-Pile

kxzxvbk commented 1 year ago

https://github.com/BlinkDL/RWKV-v2-RNN-Pile

What kind of finetuning methods does this use? I think it tunes all parameters in the model?

kxzxvbk commented 1 year ago

I get a wonderful solution about this problem. Since the latest version of transformers support RWKV, I can now use peft to finetune RWKV. Here is the demo code:

from transformers import AutoTokenizer, RwkvForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_int8_training

target_modules = ["feed_forward.value"]
config = LoraConfig(
    r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
)

tokenizer = AutoTokenizer.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = RwkvForCausalLM.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = prepare_model_for_int8_training(model)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

lora_model.train()
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = lora_model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs.loss, outputs.logits
print(loss, logits)

muhammed-saeed commented 1 year ago

assume that I have training data - json or tsv - in the format {"instruction": THE INSTRUCTION", input:"THE INPUT", output:"DESIRED OUTPUT"} how can I modify your peft code to work with this data ?

kxzxvbk commented 1 year ago

assume that I have training data - json or tsv - in the format {"instruction": THE INSTRUCTION", input:"THE INPUT", output:"DESIRED OUTPUT"} how can I modify your peft code to work with this data ?

Hope that this repo can help you: https://github.com/tatsu-lab/stanford_alpaca

muhammed-saeed commented 1 year ago

assume that I have training data - json or tsv - in the format {"instruction": THE INSTRUCTION", input:"THE INPUT", output:"DESIRED OUTPUT"} how can I modify your peft code to work with this data ?

Hope that this repo can help you: https://github.com/tatsu-lab/stanford_alpaca

Thanks for your response, I have question can I use the same training code there but instead of passing llama model I pass to the model the RWKV models ?

SetoKaiba commented 1 year ago

I get a wonderful solution about this problem. Since the latest version of transformers support RWKV, I can now use peft to finetune RWKV. Here is the demo code:

from transformers import AutoTokenizer, RwkvForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_int8_training

target_modules = ["feed_forward.value"]
config = LoraConfig(
    r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
)

tokenizer = AutoTokenizer.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = RwkvForCausalLM.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = prepare_model_for_int8_training(model)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

lora_model.train()
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = lora_model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs.loss, outputs.logits
print(loss, logits)

Can this code snippet be used to fine tune world model? It seems that the world model use a different tokenizer and vocab list.

winglian commented 10 months ago

I get a wonderful solution about this problem. Since the latest version of transformers support RWKV, I can now use peft to finetune RWKV. Here is the demo code:

from transformers import AutoTokenizer, RwkvForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_int8_training

target_modules = ["feed_forward.value"]
config = LoraConfig(
    r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
)

tokenizer = AutoTokenizer.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = RwkvForCausalLM.from_pretrained("URL_OF_HUGGINGFACE", trust_remote_code=True)
model = prepare_model_for_int8_training(model)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

lora_model.train()
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = lora_model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs.loss, outputs.logits
print(loss, logits)

I assume this is only for RWKV4? @BlinkDL is there any timeline for getting RWKV5 into transformers?

EasonXiao-888 commented 5 months ago

hello i want to fine-tune RWKV on 4096 context length， but it will take an error by if seq_len > rwkv_cuda_kernel.max_seq_length: raise ValueError( f"Cannot process a batch with {seq_len} tokens at the same time, use a maximum of " f"{rwkv_cuda_kernel.max_seq_length} with this model." ) I would like to know if you have encountered it or know how to solve it?