使用Taiwan-LLaMa-13b-1.0.Q8_0.gguf, inference產生的回答是空白的

gymeee0715 commented 1 year ago

您好, 我使用從huggingface上的Taiwan-LLaMa-13b-1.0.Q8_0.gguf,, 然後用以下 code 做inference

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    "audreyt/Taiwan-LLaMa-v1.0-GGUF", 
    model_file="Taiwan-LLaMa-13b-1.0.Q8_0.gguf",
    model_type="llama",
    gpu_layers=1,  
    )
while (True):
    print('\n')
    a = "你好"  
    prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT: """
    prompt = prompt_template.format(a)
    print(llm(prompt))

我的輸出都是空白的 ctransformers版本是0.2.27 torch版本是1.13.1 我使用的系統是macos 13.5

penut85420 commented 1 year ago

個人猜測可能是 ctransformers 的問題，用同樣的模型權重檔，直接使用 llama.cpp 測試這組 Prompt 是沒問題的。可以把 llm.is_eos_token 覆寫掉來繞過這個機制，然後稍微修改 Prompt Template 透過 Stop Token 來處理：

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "audreyt/Taiwan-LLaMa-v1.0-GGUF",
    model_file="Taiwan-LLaMa-13b-1.0.Q8_0.gguf",
    model_type="llama",
    gpu_layers=100,
)

# 停用原本的 EOS Token 機制
llm.is_eos_token = lambda x: False

while True:
    print("\n")
    a = "你好"
    # 改用 "### USER: {}\n### ASSISTANT: " 的格式
    prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. ### USER: {}\n### ASSISTANT: """
    prompt = prompt_template.format(a)

    # 透過偵測 "###" 的出現來結束生成
    print(llm(prompt, stop=["###"]))

gymeee0715 commented 1 year ago

@penut85420 感謝您的回覆，目前model已經可以產生回應的句子，但是模型會產生很多不相干的回應句，像是下面這樣．

input :早安
output :早安！我今天可以如何協助您？ **這是一段對話** **輸入對話內容** *回答問題* (C) 早安！我可以如何協助您？

---

**提示：** 使用上面的對話來自動化與使用者的對話。

**版本：** v0.3

**擁有者：** Simeon Williams

**跨平台社交媒體ID：** Sim2K @ [Twitter](https://Twitter.com/Sim2K)，[Instagram](https://Instagram.com/Sim2K)，[LinkedIn](https://linkedin.com/in/sim2k)，[Facebook](https://www.facebook.com/Sim2K/)。

**聯絡電報：** [@Sim2K on Telegram](https://t.me/Sim2K)

**目的：**

penut85420 commented 1 year ago

@gymeee0715 沒錯，因為 EOS 被繞過去了，所以模型產生 EOS 後如果沒有停下來，就會開始本能的隨意生成文字。

因此你必須透過其他 Prompt Template 來決定額外的生成停止點，如同我上方的回覆，在 USER 與 ASSISTANT 前面加上 ### 並放進 stop 參數內。

或者是考慮改用 LangChain，其底層也支援 llama.cpp (ggml) 框架，請參考以下範例：

"""
Environment:

conda create -n tw-llama python=3.11
conda activate tw-llama
conda install -c "nvidia/label/cuda-11.8.0" cuda
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
pip install langchain
"""

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {user_message} ASSISTANT: "
)
prompt = prompt.format(user_message="你好")

llm = LlamaCpp(
    model_path="Taiwan-LLaMa-13b-1.0.Q8_0.gguf",
    n_gpu_layers=100,
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    verbose=True,
)

resp = llm(
    prompt,
    callbacks=[StreamingStdOutCallbackHandler()],
    stop=["\n\n", "\n", "###"],
)

print(f"resp: {resp}")

更多詳細資訊請參考 LangChain 官方文件

gymeee0715 commented 1 year ago

@penut85420 感謝您的回覆，參考您的範例，改成用LangChain之後就可以產生回應了，

from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

prompt = PromptTemplate.from_template(
    "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {user_message} ASSISTANT: "
)

llm = LlamaCpp(
    model_path="model/Taiwan-LLaMa-v1.0-GGUF/Taiwan-LLaMa-13b-1.0.Q8_0.gguf",
    n_gpu_layers=1,
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    verbose=True,
)
while True:

    prompt = prompt.format(user_message=input("Q:"))

    resp = llm(
        prompt,
        callbacks=[StreamingStdOutCallbackHandler()],
        stop=["\n\n", "\n", "###"],
    )

    print(f"resp: {resp}")

但是會遇到Llama.generate: prefix-match hit的問題，會一直產生重複的句子，我換成其他gguf的模型也會有一樣的問題

Q:你好
你好！有什麼我可以幫助你的嗎？resp: 你好！有什麼我可以幫助你的嗎？
Q:我要怎麼去高雄
Llama.generate: prefix-match hit
哈囉！我今天可以如何協助你？resp: 哈囉！我今天可以如何協助你？
Q:我想知道唐鳳是誰
Llama.generate: prefix-match hit
哈囉！我今天可以幫你什麼忙嗎？resp: 哈囉！我今天可以幫你什麼忙嗎？

penut85420 commented 1 year ago

@gymeee0715 這個程式有個小 Bug，在迴圈裡面 prompt 已經被取代了，所以後續的 format 並沒有作用

while True:
    full_prompt = prompt.format(user_message=input("Q:"))
    resp = llm(
        full_prompt,

應該改成這樣就可以了

gymeee0715 commented 1 year ago

@penut85420 謝謝您的幫忙，現在程式碼可以正常運行了，非常感謝您．

MiuLab / Taiwan-LLM

使用Taiwan-LLaMa-13b-1.0.Q8_0.gguf, inference產生的回答是空白的 #32