Closed hhjin closed 11 months ago
Looks like you use wrong tokenizer... Or wrong model. Try with raven. Also use bigger one, maybe 3B
Hi!
You are using RWKV World
model, which uses world
tokenizer. By default, generate_completions.py
/ chat_with_bot.py
uses 20B
tokenizer, which will give garbage output when using it with the RWKV World
model.
You need to explicitly specify world
tokenizer when running the script:
python rwkv/chat_with_bot.py rwkv-cpp-world-1.5B-q8_0.bin world
Or wrong model. Try with raven. Also use bigger one, maybe 3B
From my experience, even 1B5 models are fluent and (when used with the correct tokenizer) generate okay texts.
Thanks. I can get desired output after adding the tokenizer type. The 7B world cpp-q8_0 model runs with about 10 tokens/sec speed on my 16GB M1 mac book.
python rwkv/generate_completions.py rwkv-cpp-readflow-7B-ctx32k-q8_0.bin world
Loading world tokenizer
System info: AVX=0 AVX2=0 AVX512=0 FMA=0 NEON=1 ARM_FMA=1 F16C=0 FP16_VA=1 WASM_SIMD=0 BLAS=1 SSE3=0 VSX=0
Loading RWKV model
91 tokens in prompt
--- Generation 0 ---
# rwkv.cpp
This is a port of [BlinkDL/RWKV-LM](https://github.com/BlinkDL/RWKV-LM) to [ggerganov/ggml](https://github.com/ggerganov/ggml).
Besides usual **FP32**, it supports **FP16** and **quantized INT4** inference on CPU. This project is **CPU only**.[
# Example
```cpp
#include "ggml.h"
#include "common.h"
#include "logger.h"
int main()
{
// Load model
model::Ptr model = model::load("E:/workspace/ggml/examples/yolo/yolo.onnx");
// Convert input to float
real input_data = 1.0;
real input_data_f = model->input_to_float]
Took 9.785 sec, 97 ms per token
I successfully compiled and converted the quantization model on Mac M1. But when I use chat_with_bot.py, the output is meaningless words or a lot of repeated characters. I tried with the same situation on a Windows machine. Anything wrong? Is it related to the model? I downloaded the latest RWKV World 4, and different models have this problem.