c0sogi / llama-api

An OpenAI-like LLaMA inference API
MIT License
111 stars 9 forks source link

Generation stops at 251 tokens - works fine on oobabooga #14

Closed Dougie777 closed 12 months ago

Dougie777 commented 1 year ago

I hate to be a pain. You have been so helpful already, but I am stuck.

My generations are ending prematurely: "finish_reason": "length" as seen below

{ "id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3", "object": "chat.completion", "created": 1694531345, "model": "llama-2-70b-chat.Q5_K_M", "choices": [ { "message": { "role": "assistant", "content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot" }, "index": 0, "finish_reason": "length" } ], "usage": { "prompt_tokens": 123, "completion_tokens": 128, "total_tokens": 251 } }

My definition is:

llama2_70b_Q5_gguf = LlamaCppModel( model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download max_total_tokens=16384, use_mlock=False )

When I load I get:

llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 16384 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: freq_base = 82684.0 llm_load_print_meta: freq_scale = 0.25 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model size = 68.98 B llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MB llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state) .................................................................................................... llama_new_context_with_model: kv self size = 5120.00 MB llama_new_context_with_model: compute buffer total size = 2097.47 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

From the sever start screen I get:

llama2_70b_q5_gguf model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25

I have tried: 1) Starting the server specifying the max tokens: python3 main.py --max-tokens-limit 4096 2) I have set my ulimit to unlimited 3) I have set max_total_tokens: 16384 4) I tried setting the rope settings to be the same as oobabooga: rope_freq_base=10000, rope_freq_scale=1, BUT THESE SETTINGS WERE IGNORED.

The same model works perfectly on oobabooga.

I am not sure what else to try.

Thanks so so much, Doug

c0sogi commented 1 year ago

You don't have to set max-tokens-limit. It doesn't determine the max output tokens. Instead, give 'max_tokens' when requesting chat completion, just as OpenAI API. It defaults to 128, and that's what you are seeing.

c0sogi commented 1 year ago

I've made some changes; if max_tokens is unset (None), it defaults to the maximum number of available tokens. 749a93d8643b98354e66c6916af33bf698ad8c9b

Dougie777 commented 12 months ago

Oh wow thanks!!