[Crash]: Program gets terminated

Anything you want to discuss about Aphrodite.

Command:

python3 -m aphrodite.endpoints.openai.api_server --model ~/models/kunoichi-7b.Q4_K_M.gguf --max-model-len 8192 -tp 1 -gmu 1 -q gguf --device cuda --served-model-name kunoichi-7b --api-keys 0000 --chat-template ~/aphrodite-engine/examples/chatml_template.jinja --host 172.16.15.5 --port 2242

Output:

WARNING:  gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = '/home/tesh/models/kunoichi-7b.Q4_K_M.gguf'
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gguf
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 291/291 0:00:00
INFO:     Model weights loaded. Memory usage: 4.12 GiB x 1 = 4.12 GiB
client_loop: send disconnect: Connection reset

Good fucking question as to why it crashes

but it cuts all ssh and ends the program.

PygmalionAI / aphrodite-engine

[Crash]: Program gets terminated #401

Anything you want to discuss about Aphrodite.

Command:

Output:

Good fucking question as to why it crashes