docker: Error response from daemon: failed to create task for container

geoxpert0001 commented 10 months ago

https://github.com/MiuLab/Taiwan-LLaMa/assets/141697815/3d4868db-fff0-416a-bdd1-a879ad016f67

執行後出現此錯誤 docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. ERRO[0000] error waiting for container:

不知道哪邊有問題，我的docker照理說沒問題，用別的都能正常使用

PenutChen commented 10 months ago

請問你有安裝 NVIDIA Driver 嗎? 可以透過指令 nvidia-smi 查看驅動程式的版本，目前 TGI Image 應該至少要 >=520.61.05 (CUDA 11.8) 版以上才能跑

nigue3025 commented 10 months ago

我用一個已安裝了多個CUDA版本(包含了12.1)的環境(UBUNTU)也會出現這個狀況. 後來換了一個唯獨只裝了CUDA12.1的環境就沒有這個問題了

WenTingTseng commented 10 months ago

我也有這個問題，CUDA版本11.8的ubuntu環境，總共8張GPU。想問我要如何解決該狀況

penut85420 commented 10 months ago

@WenTingTseng 請問你有安裝 Nvidia Docker 嗎？

sudo apt install nvidia-docker2

WenTingTseng commented 10 months ago

@penut85420 感謝回覆，安裝之後現在可以執行了。不過中間有遇到 Unable to locate package nvidia-docker2 when installing using apt-get 問題，我透過https://github.com/bryanbocao/quick-cheatsheets/issues/3#issuecomment-1092131051 裡面方法解決。但是現在我的memory不太足夠有以下Error RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens

2023-09-01T00:48:51.665437Z ERROR warmup{max_input_length=100 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens

我不太確定我要調整哪個參數值，想請教大家參數都設定多少，感謝

penut85420 commented 10 months ago

@WenTingTseng 你有八張 49 GB 的顯卡居然還不夠嗎 OAO

我自己在 TITAN RTX (24GB) 上是這樣跑

docker run --gpus 'device=0' -p 8085:80 \
    -v ./Models:/Models \
    ghcr.io/huggingface/text-generation-inference:sha-5485c14 \
    --model-id /Models/TaiwanLlama-13B \
    --quantize "bitsandbytes" \
    --max-input-length 1500 \
    --max-total-tokens 2000 \
    --max-batch-prefill-tokens 4500 \
    --max-batch-total-tokens 6000 \
    --max-best-of 1 \
    --max-concurrent-requests 128

決定參數值的方向大概分成 Input Length, Output Length, Batch Size Taiwan Llama 模型長度雖然支援到 4k，但我的 GPU 最多跑到 2k 就滿了所以我分成輸入長度最多 1500 Tokens 輸出長度最多 500 Tokens 因此 --max-input-length 為 1500 而 --max-total-tokens 為 1500 + 500 = 2000

最後考慮 Batch Size，我這邊設為 3 因此 --max-batch-prefill-tokens 為 1500 3 = 4500 然後 --max-batch-total-tokens 為 (1500 + 500) 3 = 6000

--max-concurrent-requests 設為 128，這樣 TGI 只會同時 Inference 三個 Requests 剩下的都會被 Queue 起來

可以依照硬體資源與實際應用需求調整適合的參數 PS. 我有開 Quantization 所以才跑得動

BTW, 這個 Link 的 URL 好像設定錯誤了

透過 https://github.com/bryanbocao/quick-cheatsheets/issues/3#issuecomment-1092131051 裡面方法解決。

只要把原始 Link 留著 GitHub 就會 Auto Link 到該 Issue, e.g. https://github.com/bryanbocao/quick-cheatsheets/issues/3#issuecomment-1092131051

WenTingTseng commented 9 months ago

import torch

from torch.nn.parallel import DistributedDataParallel
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
model_path = "yentinglin/Taiwan-LLaMa-v1.0"

model = AutoModelForCausalLM.from_pretrained(model _path).cuda()
model = DistributedDataParallel(model)

tokenizer = AutoTokenizer.from_pretrained(model _path, use_fast = False)
streamer = TextStreamer(tokenizer, skip_prompt = True, skip_special_tokens = True)
prompt = ""

generated_ids = model.generate(tokenizer(prompt, return_tensors = 'pt').input_ids.cuda(), max_new_tokens = 10, streamer = streamer)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

想請教一下，我用上面程式碼想要載入釋出hugging face的Taiwan-LLaMa-v1.0，照理說os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"設定應該8張GPU都會使用到，但是在watch nvidia-smi時候看到只有用到預設第0張，最後報錯GPU out of memory。是否有其他環境變數需要設定呢，還是程式寫法有問題，感謝幫忙了

PenutChen commented 9 months ago

@WenTingTseng 抱歉我沒有多張 GPU 的使用經驗，但我大概可以提供幾個建議：

把 os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7" 放在所有 import 之前：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
import torch
from torch.nn.parallel import DistributedDataParallel
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

使用 device_map="auto" 讀取模型，例如：

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

先讀取再使用 .cuda() 會讓模型先被讀進 CPU Memory 再轉進 GPU Memory，一般來說這會比較慢。使用 device_map="auto" 可以讓 HF 直接幫你把模型讀進 GPU Memory 並自動把模型拆分到所有 GPU 裡面。

~~據我所知，如果是單純做 Inference 的話，似乎 HF 是不能拆卡推論的。但因為我沒有多顯卡的機器，所以我也無法證實這個說法，請閣下自行研究看看。~~

但如果是單張 GPU 要做推論，可以考慮做 8-Bit Quantization 初始化：

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", load_in_8bit=True)

Edit: 我借了一台多顯卡的機器測試了一下，確實是可以做多卡推論的，但記憶體不夠跑 13B 的模型，以下是我跑 7B 模型的範例：

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

from transformers import LlamaForCausalLM as ModelCls
from transformers import LlamaTokenizer as TkCls
from transformers import TextStreamer

model: ModelCls = ModelCls.from_pretrained("Models/Vicuna-7B", device_map="auto")
tk: TkCls = TkCls.from_pretrained("Models/Vicuna-7B")
ts = TextStreamer(tk)

prompt = "Hello, "
input_ids = tk(prompt, return_tensors="pt")["input_ids"].to("cuda")
model.generate(input_ids, max_new_tokens=16, streamer=ts)

請參考看看

WenTingTseng commented 9 months ago

@penut85420 @PenutChen 感謝各位協助與幫忙，已經成功可以使用到Taiwan-Llama模型了，謝謝

MiuLab / Taiwan-LLM

docker: Error response from daemon: failed to create task for container #18