RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.42k stars 98 forks source link

UnExpected Outputs #64

Closed xdevfaheem closed 1 year ago

xdevfaheem commented 1 year ago

I'm Using https://huggingface.co/xzuyn/RWKV-4-Raven-7B-v11x-Eng99-Other1-20230429-ctx8192-ggml-q5_1 ggml weights with rwkv.cpp modified infrence scripts, which is

import argparse
import os
import pathlib
import time
import tokenizers
from typing import Optional, List, Mapping, Any
from langchain.llms.base import LLM
from rwkv_utils import sampling
from rwkv_utils import rwkv_cpp_model
from rwkv_utils import rwkv_cpp_shared_library
import fire

class RWKV_LLM():

    rwkv_model: Optional[str] = None

    def __init__(
        self,
        model_path: Optional[str],
        temperature: float = 0.8,
        top_p: float = 0.5,
        max_tokens: int = 100,
        tokenizer_path: Optional[str] = "../utils/20B_tokenizer.json"
    ):
        super().__init__()
        self.model_path = model_path
        self.temperature = temperature
        self.top_p = top_p
        self.max_tokens = max_tokens
        self.tokenizer_path = tokenizer_path

        assert self.model_path, "Please Provide The Path of the Model"
        assert self.tokenizer_path, "Please Provide The Path of the RWKV Tokenizer"
        assert self.temperature, "Please Provide The Temperature"
        assert self.top_p, "Please Provide The Top Probability for Sampling"
        assert self.max_tokens, "Please Provide Max Token to Generate"

    def generate_prompt(self, instruction: str, input_ctxt: str = None) -> str:
        if input_ctxt:
            return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_ctxt}

### Response:"""
        else:
            return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

    def initialize_model(self):
        # load tokenizer
        self.tokenizer = tokenizers.Tokenizer.from_file(str(self.tokenizer_path))

        # load RWKV Model
        library = rwkv_cpp_shared_library.load_rwkv_shared_library()
        print(f'System info: {library.rwkv_get_system_info_string()}')
        print('Loading RWKV model....')
        self.rwkv_model = rwkv_cpp_model.RWKVModel(library, self.model_path, thread_count=2)
        print('Loaded Successfully....')

    def ask(self, prompt: str, stop: Optional[List[str]] = None) -> str:

        if stop is not None:
            pass

        if self.rwkv_model is None:
            self.initialize_model()

        # Generates completions from RWKV model based on a prompt.
        prompt = self.generate_prompt(prompt)
        prompt_tokens = self.tokenizer.encode(prompt).ids 
        print(f'{len(prompt_tokens)} tokens in prompt')

        init_logits, init_state = None, None

        for token in prompt_tokens:
            init_logits, init_state = self.rwkv_model.eval(token, init_state, init_state, init_logits)

        start = time.time()

        logits, state = init_logits.clone(), init_state.clone()

        for i in range(self.max_tokens):

            token = sampling.sample_logits(logits, self.temperature, self.top_p)
            print(self.tokenizer.decode([token]), end='')
            logits, state = self.rwkv_model.eval(token, state, state, logits)

        delay = time.time() - start
        print(']\n\nTook %.3f sec, %d ms per token' % (delay, delay / self.max_tokens * 1000))

def main(model_path):
    llm = RWKV_LLM(model_path)
    llm.ask(input("> "))

if __name__ == '__main__':
    fire.Fire(main)

I Guess it to generate good outputs, but i got the following outputs

> Write a Poem About AI
System info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
Loading RWKV model....
Loaded Successfully....
35 tokens in prompt
/content/Intellique/llms/rwkv_utils/rwkv_cpp_model.py:100: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  state_out.storage().data_ptr(),

Ai, the machine that thinks and dreams,
A powerful force that cannot be stopped,
A machine that knows no bounds,
A machine that dreams and thinks and dreams.
Ai, the machine that dreams,
A machine that dreams,
Ai, the machine that dreams,
A machine that dreams,
Ai, the machine that dreams,
A machine that dreams,
Ai, the machine that dreams,
A machine that dreams,
A]

I Tried Multiple Times with Multiple Prompts, But I Got No Luck

So is there anything i can do to get good outputs or what's the problem here

saharNooby commented 1 year ago

Looks like a typical LLM behavior to me. If something was broken, you would see complete gibberish of random tokens, or a single token would repeat infinetely.

I suggest implementing repetition penalty (presence penalty) to decrease repetitions, there is an example in chat_wwith_bot.py

Other suggestions:

xdevfaheem commented 1 year ago

Looks like a typical LLM behavior to me. If something was broken, you would see complete gibberish of random tokens, or a single token would repeat infinetely.

I suggest implementing repetition penalty (presence penalty) to decrease repetitions, there is an example in chat_wwith_bot.py

Other suggestions:

  • for Raven models, use prompt format recommended by BlinkDL -- here are example prompts that should work
  • use 14B instead of 7B
  • use less confusing (for the model) input formatting: Write a Poem About AI -> Write a poem about AI

Oh... it doesn't seem gibberish to u mate.

I'm Bit Excited to Run RWKV 7B on rwkv.cpp When I Found About it after but after hours of setting up. I Got This Output. I Expected RWKV to give good results.

So How Should I Get Good results with Faster Infrence. I'll Sure Check Your Above Advices.

xdevfaheem commented 1 year ago

btw, I'm using the prompt from Official RWKV Huggingface Space


def generate_prompt(instruction, input=None):
--
  | instruction = instruction.strip().replace('\r\n','\n').replace('\n\n','\n')
  | input = input.strip().replace('\r\n','\n').replace('\n\n','\n')
  | if input:
  | return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
  |  
  | # Instruction:
  | {instruction}
  |  
  | # Input:
  | {input}
  |  
  | # Response:
  | """
  | else:
  | return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
  |  
  | # Instruction:
  | {instruction}
  |  
  | # Response:
  | """