WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO: Model = '/home/tesh/models/kunoichi-7b.Q4_K_M.gguf'
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gguf
INFO: Context Length = 8192
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 291/291 0:00:00
INFO: Model weights loaded. Memory usage: 4.12 GiB x 1 = 4.12 GiB
client_loop: send disconnect: Connection reset
Anything you want to discuss about Aphrodite.
Command:
Output:
Good fucking question as to why it crashes
but it cuts all ssh and ends the program.