0781532 commented 1 year ago

我用以下的function去下載和部署模型得到LLM輸出回答是亂碼（如圖）。請問是哪裡的問題，如何解決？謝謝！（當然我知道用Github建議的方式就沒有問題）

def load_full_model(model_id, model_basename, device_type, logging): """ Load a full model using either LlamaTokenizer or AutoModelForCausalLM.

This function loads a full model based on the specified device type.
If the device type is 'mps' or 'cpu', it uses LlamaTokenizer and LlamaForCausalLM.
Otherwise, it uses AutoModelForCausalLM.

Parameters:
- model_id (str): The identifier for the model on HuggingFace Hub.
- model_basename (str): The base name of the model file.
- device_type (str): The type of device where the model will run.
- logging (logging.Logger): Logger instance for logging messages.

Returns:
- model (Union[LlamaForCausalLM, AutoModelForCausalLM]): The loaded model.
- tokenizer (Union[LlamaTokenizer, AutoTokenizer]): The tokenizer associated with the model.

Notes:
- The function uses the `from_pretrained` method to load both the model and the tokenizer.
- Additional settings are provided for NVIDIA GPUs, such as loading in 4-bit and setting the compute dtype.
"""

if device_type.lower() in ["mps", "cpu"]:
    logging.info("Using LlamaTokenizer")
    tokenizer = LlamaTokenizer.from_pretrained(model_id, cache_dir="./models/")
    model = LlamaForCausalLM.from_pretrained(model_id, cache_dir="./models/")
else:
    logging.info("Using AutoModelForCausalLM for full models")
    tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir="./models/")
    logging.info("Tokenizer loaded")
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        cache_dir=MODELS_PATH,
        # trust_remote_code=True, # set these if you are using NVIDIA GPU
        # load_in_4bit=True,
        # bnb_4bit_quant_type="nf4",
        # bnb_4bit_compute_dtype=torch.float16,
        # max_memory={0: "15GB"} # Uncomment this line with you encounter CUDA out of memory errors
    )
    model.tie_weights()
return model, tokenizer

PenutChen commented 1 year ago

這邊只有貼讀取權重的程式碼，請問你實際 Inference 的 Code 是什麼？Inference 與輸出的方式也會影響顯示的結果

0781532 commented 1 year ago

1.這是Inference 的 Code，請你查看一下。 https://github.com/PromtEngineer/localGPT/blob/main/run_localGPT.py 我用一樣上面的程式跑Taiwan-LLaMa-v1.0 輸出亂碼

Embedding_model : EMBEDDING_MODEL_NAME = "intfloat/multilingual-e5-large" # Uses 2.5 GB of VRAM

這邊也遇到另外個問題是：我用一樣上面的程式跑Taiwan-LLaMa-13b-1.0.Q4_0.gguf or Taiwan-LLaMa-13b-1.0.Q8_0.gguf 輸出結果正常，但是程式用CPU去推論，而一點GPU都沒用到，造成回答時間很長（300s以上／query）

我的機型是Intel CPU 64GB VRAM+ 2張 NVIDIA 3090 Ti GPU

如果可以，請幫我解答一下，我程式哪裡有問題？

謝謝！

PenutChen commented 1 year ago

我可以先回答 2. 用 CPU Inference 一定非常慢，13B 的模型每個 Token 花到 0.7 ~ 1.0s 都是有可能的，整個 Query 要花近五分鐘也是滿合理的

0781532 commented 1 year ago

對於第2問題：

我既然設定 --device_type cuda 去推論了，但 GGUF模型還是經過 CPU去跑，而不利用GPU。這是不是GGUF模型內在的設定，或是因 LlamaCpp-Python 的設定的。我不知道如何解決？！

Taiwan-LLaMa-v1.0因為不是 GGUF或GGML模型所有他馬上調用2張 NVIDIA 3090 Ti GPU來推論。

PenutChen commented 1 year ago

我猜測 --device_type cuda 只對 HF 推論有用，如果要用 gguf 調用 CUDA 的話，安裝 llama-cpp-python 時也要啟用 CUDA 編譯，請參考 llama-cpp-python 的說明

PenutChen commented 1 year ago

我測試了一下 CUDA 版的 llama-cpp-python 速度滿正常的，個人認為使用 HF Pipeline 推論出現亂碼應該屬於 localGPT 的 Issue，模型權重本身應該是沒問題的

0781532 commented 1 year ago

謝謝您的快速回答問題。

我會看看local GPT的問題在哪裡，或有其他要補充的參數要放在local GPT程式才能正確跑Taiwan-LLaMa-v1.0。

再次謝謝您喔！

0781532 commented 1 year ago

想問您另外個問題：

請問有沒有辦法讓Taiwan-LLaMa-13b-1.0.Q4_0.gguf or Taiwan-LLaMa-13b-1.0.Q8_0.gguf模型推論更快。

我已經在GCP上面建立一台機器 CPU 32 cores， 208GB VRAM, 2x NVIDIA L4 GPU去跑但回答時間也是蠻久的。

Create a pipeline for text generation

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=MAX_NEW_TOKENS,
    temperature=0.2,
    # top_p=0.95,
    repetition_penalty=1.15,
    generation_config=generation_config,
)

謝謝！

PenutChen commented 1 year ago

首先先確認一下 llama-cpp-python 是否有啟用 CUDA，如果有的話前面會有這些訊息

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  128.29 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 13256 MB

基本上 Single Batch 的 llama.cpp 已經是相當快了，或者改用其他推論框架，例如 TGI 或 vLLM 等。

長文輸出一般規格的機器需要花個 20 秒到兩分鐘都算是滿常見的，可能需要給個具體的秒數與對應的輸出長度才能確定狀況是否有改善的空間。

0781532 commented 1 year ago

請查看我剛跑的一個簡單Query，他跑了87s 來回答，比較複雜一點會跑到5分鐘

69419

PenutChen commented 1 year ago

@0781532 再次確認一下，程式剛開始初始化的時候，是否有印出 CUDA 相關的訊息？

0781532 commented 1 year ago

我看得是有啟動GPU

PenutChen commented 1 year ago

offloaded 40/43 layers to GPU

看起來 Llama-13B 總共有 43 個 Layers，但你只 Load 了 40 個 Layers 到 GPU 裡面，剩下 3 層放在 CPU 跑，我推斷可能是這個原因造成的

這邊建議把 constants.py 裡的 N_GPU_LAYERS 設定成 100 比較保險，請再試試看

0781532 commented 1 year ago

3。還是一樣的慢！（GPU沒跑)

可能因為我設定top_p = 0.95 還有可能是文件的內容有關係。

在文件內找不到相對應的答案，所以LLM去花很多時間去想。

然後回答出來的答案很常。

1。我測試其他Non－GGUF模型

MODEL_ID = "LinkSoul/Chinese-Llama-2-7b"

MODEL_ID = "FlagAlpha/Llama2-Chinese-13b-Chat"

速度會比較快（有吃GPU）結果不是亂碼.

PenutChen commented 1 year ago

可以從 llama_print_timing 的 sample time 看得出來 top_p 理論上影響是不大的

另外我從這張截圖看到上面有一些錯誤，我自己測試的時候並沒有發生這些 Error 請問你 localGPT 與 llama-cpp-python 的版本是多少？

PenutChen commented 1 year ago

另外建議可以測試看看單 GPU 推論 CUDA_VISIBLE_DEVICES=0 python run_localGPT.py

0781532 commented 1 year ago

3。我的local GPT是最新版的。 llama-cpp-python的版本如下： CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir

所以您有試跑過 Taiwan-LLaMa-13b-1.0.Q4_0.gguf or Taiwan-LLaMa-13b-1.0.Q8_0.gguf？推論速度如何呢？

『另外建議可以測試看看單 GPU 推論 CUDA_VISIBLE_DEVICES=0 python run_localGPT.py』：我正在測試

1。然後，不知道可不可以請求您用localGPT去試跑一下Taiwan-LLaMa-v1.0？

謝謝！

PenutChen commented 1 year ago

我使用的 llama-cpp-python 是 0.2.6 版，可以試試看

這張圖我就是用 gguf 跑的，但我是用 q6_k 的版本，生成 300 tokens 約七八秒，我的顯卡是 3090

另外我用 HF 跑 Taiwan-LLaMa-v1.0 情況與你一樣都是亂碼

0781532 commented 1 year ago

我用 q6_k 的版本, llama-cpp-python 0.2.6 版但程式還是一樣,沒有吃GPU去跑。

Taiwan-LLaMa-v1.0對中文繁體與台灣文化問答是蠻準的，但不能用有點可惜。

PenutChen commented 1 year ago

我現在猜測可能是 PASCAL 架構的顯卡太老了，有沒有可能換新一點（可能至少 Turing 以上？）的顯卡來跑？

penut85420 commented 1 year ago

我覺得不一定要執著在 LocalGPT 上，如果只是要做 Query-Based 的 QA，其實你拉個 LangChain 加 TGI 自己做並不會很困難，網路上也有相當多教學資源可以參考

0781532 commented 1 year ago

@PenutChen 我後來有成功用Taiwan-LLaMa-13b-1.0.Q8_0.gguf模型連GPU去跑。其中可以猜測之前連不成功的原因是:

1）PASCAL 架構的顯卡太老了 (換新顯卡就可以）。您的猜測是對的。 2）llama-cpp-python library本身還沒完善，安裝llama-cpp-python 來support跑GPU需要搭配整個安裝環境的一連串。網路上很多人遇到跟我一樣的問題，很多issues windows開出來但還沒closed。

非常感謝您上週的快速和友善的回答。

0781532 commented 1 year ago

@PenutChen 想問問您的經驗，如果問同樣的問題，Taiwan-LLaMa-v1.0 與 Taiwan-LLaMa-13b-1.0.q8.gguf 的回答品質／準確度比較起來如何？

PenutChen commented 1 year ago

只做 8-Bit 量化通常不會差太多，幾乎沒有變化的感覺，但如果直接量化到 4-Bit 就會感覺下降很多了，不過這是滿主觀的感受，還是要以實際應用為主

0781532 commented 1 year ago

@PenutChen 謝謝您！

MiuLab / Taiwan-LLM

yentinglin/Taiwan-LLaMa-v1.0 輸出亂碼 #31

Create a pipeline for text generation

MODEL_ID = "LinkSoul/Chinese-Llama-2-7b"

MODEL_ID = "FlagAlpha/Llama2-Chinese-13b-Chat"