abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.71k stars 928 forks source link

[Updated issue] Prompt + Generation is limited to n_ctx. #331

Open Priestru opened 1 year ago

Priestru commented 1 year ago

[Update] Issue below is fixed with new bug emerging from the fix. See https://github.com/abetlen/llama-cpp-python/issues/331#issuecomment-1585536186

Dampfinchen commented 1 year ago

Yeah, I am having the same issue.

Priestru commented 1 year ago

I tricked it into working by increasing n_ctx to 2400

Output generated in 59.23 seconds (0.20 tokens/s, 12 tokens, context 2049, seed 461475505)

Also 2049 wtf.

gjmulder commented 1 year ago

Happily running with a 8196 context size and can fit a 13B model onto my 11GB GTX 1080Ti:

llama.cpp: loading model from /usr/src/llama-cpp-telegram_bot/models/model.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8196
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size  = 6403.12 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =   576.09 ms
llama_print_timings:      sample time =   573.46 ms /   200 runs   (    2.87 ms per token)
llama_print_timings: prompt eval time =   576.01 ms /    88 tokens (    6.55 ms per token)
llama_print_timings:        eval time = 13484.66 ms /   199 runs   (   67.76 ms per token)
llama_print_timings:       total time = 22735.88 ms
jmtatsch commented 1 year ago

Do context sizes beyond 2048 make any sense for llama based models that just have been trained up to that context size of 2048?

gjmulder commented 1 year ago

I couldn't get the perplexity llama.cpp benchmark working for context sizes larger than 2048. 8196 was the default that I inherited via llama-cpp-telegram_bot, and there doesn't seem to be much of a performance hit :man_shrugging:

Priestru commented 1 year ago

Don't get me wrong. I'm not trying to go beyond 2048, i'm trying to force model run within 2048 context size. It's current self imposed limit is around for 1650 due to some bug. Workaround only tricks it work "as intended"

Dampfinchen commented 1 year ago

Happily running with a 8196 context size and can fit a 13B model onto my 11GB GTX 1080Ti:

llama.cpp: loading model from /usr/src/llama-cpp-telegram_bot/models/model.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8196
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 2282.48 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
....................................................................................................
llama_init_from_file: kv self size  = 6403.12 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =   576.09 ms
llama_print_timings:      sample time =   573.46 ms /   200 runs   (    2.87 ms per token)
llama_print_timings: prompt eval time =   576.01 ms /    88 tokens (    6.55 ms per token)
llama_print_timings:        eval time = 13484.66 ms /   199 runs   (   67.76 ms per token)
llama_print_timings:       total time = 22735.88 ms

Since your prompt processing was just 88 tokens, I'm not sure I'm getting your point here. This has nothing to do with the discussion.

Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.

gjmulder commented 1 year ago

Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.

Can you send a reproducible example? This has not been my experience using long prompts and a context size of 8192. It is possible of course that you're hitting an edge case.

Priestru commented 1 year ago

I use ooba here (actually it's SillyTavern and ooba works as api), but ooba itself isn't responsible for anything.

llama_print_timings:        load time = 15186.81 ms
llama_print_timings:      sample time =    15.96 ms /   104 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time = 34742.91 ms /  1531 tokens (   22.69 ms per token)
llama_print_timings:        eval time = 70428.48 ms /   103 runs   (  683.77 ms per token)
llama_print_timings:       total time = 106595.29 ms
Output generated in 106.89 seconds (0.96 tokens/s, 103 tokens, context 1978, seed 23583742)
Llama.generate: prefix-match hit

llama_print_timings:        load time = 15186.81 ms
llama_print_timings:      sample time =     7.74 ms /    50 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 33804.06 ms /    50 runs   (  676.08 ms per token)
llama_print_timings:       total time = 34525.68 ms
Output generated in 34.82 seconds (1.41 tokens/s, 49 tokens, context 1978, seed 1224656424)

Llama.generate: prefix-match hit

llama_print_timings:        load time = 15186.81 ms
llama_print_timings:      sample time =    13.67 ms /    85 runs   (    0.16 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 59195.54 ms /    85 runs   (  696.42 ms per token)
llama_print_timings:       total time = 60237.05 ms
Output generated in 60.55 seconds (1.39 tokens/s, 84 tokens, context 1978, seed 507786728)

This is results with workaround where i load model with n_ctx = 2400. As you can see it generates smoothly at context close to 2k.

Now, i'll change nothing, literally same prompt but i'll reload model with n_ctx = 2048

INFO:Loading Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin...
INFO:llama.cpp weights detected: D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 2.25 seconds.

Here we go:

Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1483144509)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 457531329)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1258266880)
Output generated in 0.29 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 173099458)
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1978, seed 1011850065)

Fails successfully as expected. Now i will load back to larger n_ctx

INFO:Loading Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin...
INFO:llama.cpp weights detected: D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2400
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 0 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 3656.25 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 2.33 seconds.

and now it works again as expected:

llama_print_timings:        load time = 17918.28 ms
llama_print_timings:      sample time =    36.63 ms /   236 runs   (    0.16 ms per token)
llama_print_timings: prompt eval time = 53098.17 ms /  1978 tokens (   26.84 ms per token)
llama_print_timings:        eval time = 160671.30 ms /   235 runs   (  683.71 ms per token)
llama_print_timings:       total time = 215008.68 ms
Output generated in 215.29 seconds (1.09 tokens/s, 235 tokens, context 1978, seed 598884865)

Windows 11. CuBLAS, latest version of everything: https://github.com/abetlen/llama-cpp-python https://github.com/oobabooga/text-generation-webui https://github.com/SillyTavern/SillyTavern

gjmulder commented 1 year ago

Admittedly the perplexity isn't at all good, but as per @jmtatsch that's likely due to llama's designed context of 2048:

$ pip list | grep llama
llama-cpp-python         0.1.57

$ python ./high_level_api_inference.py 
llama.cpp: loading model from /data/llama/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8192
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 12353 MB
...................................................................................................
llama_init_from_file: kv self size  = 4096.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =  1845.23 ms
llama_print_timings:      sample time =  1255.23 ms /  2048 runs   (    0.61 ms per token)
llama_print_timings: prompt eval time = 10921.96 ms /  2254 tokens (    4.85 ms per token)
llama_print_timings:        eval time = 301437.04 ms /  2047 runs   (  147.26 ms per token)
llama_print_timings:       total time = 366120.17 ms
{
  "id": "cmpl-70bf6f1a-1aa8-401d-ab2f-e9302aa61cd8",
  "object": "text_completion",
  "created": 1686154862,
  "model": "/data/llama/7B/ggml-model-f16.bin",
  "choices": [
    {
` "text": "In the beginning God created the heaven and the earth.\nAnd the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.\nAnd God said, Let there be light: and there was light.\nAnd God saw the light, that it was good: and God divided the light from the darkness.\nAnd God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\nAnd God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.\nAnd God made the firmament, and divided the waters which were under the firmament from the waters which were above the firmament: and it was so.\nAnd God called the firmament Heaven. And the evening and the morning were the second day.\nAnd God said, Let the waters under the heaven be gathered together unto one place, and let the dry land appear: and it was so.\nAnd God called the dry land Earth; and the gathering together of the waters called he Seas: and God saw that it was good.\nAnd God said, Let the earth bring forth grass, the herb yielding seed, and the fruit tree yielding fruit after his kind, whose seed is in itself, upon the earth: and it was so.\nAnd the earth brought forth grass, and herb yielding seed after his kind, and the tree yielding fruit, whose seed was in itself, after his kind: and God saw that it was good.\nAnd the evening and the morning were the third day.\nAnd God said, Let there be lights in the firmament of the heaven to divide the day from the night; and let them be for signs, and for seasons, and for days, and years:\nAnd let them be for lights in the firmament of the heaven to give light upon the earth: and it was so.\nAnd God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also.\nAnd God set them in the firmament of the heaven to give light upon the earth,\nAnd to rule over the day and over the night, and to divide the light from the darkness: and God saw that it was good.\nAnd the evening and the morning were the fourth day.\nAnd God said, Let the waters bring forth abundantly the moving creature that hath life, and fowl that may fly above the earth in the open firmament of heaven.\nAnd God created great whales, and every living creature that moveth, which the waters brought forth abundantly, after their kind, and every winged fowl after his kind: and God saw that it was good.\nAnd God blessed them, saying, Be fruitful, and multiply, and fill the waters in the seas, and let fowl multiply in the earth.\nAnd the evening and the morning were the fifth day.\nAnd God said, Let the earth bring forth the living creature after his kind, cattle, and creeping thing, and beast of the earth after his kind: and it was so.\nAnd God made the beast of the earth after his kind, and cattle after their kind, and every thing that creepeth upon the earth after his kind: and God saw that it was good.\nAnd God said, Let us make man in our image, after our likeness: and let them have dominion over the fish of the sea, and over the fowl of the air, and over the cattle, and over all the earth, and over every creeping thing that creepeth upon the earth.\nSo God created man in his own image, in the image of God created he him; male and female created he them.\nAnd God blessed them, and God said unto them, Be fruitful, and multiply, and replenish the earth, and subdue it: and have dominion over the fish of the sea, and over the fowl of the air, and over every living thing that moveth upon the earth.\nAnd God said, Behold, I have given you every herb bearing seed, which is upon the face of all the earth, and every tree, in the which is the fruit of a tree yielding seed; to you it shall be for meat.\nAnd to every beast of the earth, and to every fowl of the air, and to every thing that creepeth upon the earth, wherein there is life, I have given every green herb for meat: and it was so.\nAnd God saw every thing that he had made, and, behold, it was very good. And the evening and the morning were the sixth day.\nGen.2\nThus the heavens and the earth were finished, and all the host of them.\nAnd on the seventh day God ended his work which he had made; and he rested on the seventh day from all his work which he had made.\nAnd God blessed the seventh day, and sanctified it: because that in it he had rested from all his work which God created and made.\nThese are the generations of the heavens and of the earth when they were created, in the day that the LORD God made the earth and the heavens,\nAnd every plant of the field before it was in the earth, and every herb of the field before it grew: for the LORD God had not caused it to rain upon the earth, and there was not a man to till the ground.\nBut there went up a mist from the earth, and watered the whole face of the ground.\nAnd the LORD God formed man of the dust of the ground, and breathed into his nostrils the breath of life; and man became a living soul.\nAnd the LORD God planted a garden eastward in Eden; and there he put the man whom he had formed.\nAnd out of the ground made the LORD God to grow every tree that is pleasant to the sight, and good for food; the tree of life also in the midst of the garden, and the tree of knowledge of good and evil.\nAnd a river went out of Eden to water the garden; and from thence it was parted, and became into four heads.\nThe name of the first is Pison: that is it which compasseth the whole land of Havilah, where there is gold;\nAnd the gold of that land is good: there is bdellium and the onyx stone.\nAnd the name of the second river is Gihon: the same is it that compasseth the whole land of Ethiopia.\nAnd the name of the third river is Hiddekel: that is it which goeth toward the east of Assyria. And the fourth river is Euphrates.\nAnd the LORD God took the man, and put him into the garden of Eden to dress it and to keep it.\nAnd the LORD God commanded the man, saying, Of every tree of the garden thou mayest freely eat:\nBut of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die.\nAnd the LORD God said, It is not good that the man should be alone; I will make him an help meet for him.\nAnd out of the ground the LORD God formed every beast of the field, and every fowl of the air; and brought them unto Adam to see what he would call them: and whatsoever Adam called every living creature, that was the name thereof.\nAnd Adam gave names to all cattle, and to the fowl of the air, and to every beast of the field; but for Adam there was not found an help meet for him.\nAnd the LORD God caused a deep sleep to fall upon Adam and he slept: and he took one of his ribs, and closed up the flesh instead thereof;\nAnd the rib, which the LORD God had taken from man, made he a woman, and brought her unto the man.\nAnd Adam said, This is now bone of my bones, and flesh of my flesh: she shall be called Woman, because she was taken out of Man.\nTherefore shall a man leave his father and his mother, and shall cleave unto his wife: and they shall be one flesh.\nAnd they were both naked, the man and his wife, and were not ashamed.\nGen.3\nNow the serpent was more subtil than any beast of the field which the LORD God had made. And he said unto the woman, Yea, hath God said, Ye shall not eat of every tree of the garden?\nAnd the woman said unto the serpent, We may eat of the fruit of the trees of the garden:\nBut of the fruit of the tree which is in the midst of the garden, God hath said, Ye shall not eat of it, neither shall ye touch it, lest ye die.\nAnd the serpent said unto the woman, Ye shall not surely die:\nFor God doth know that in the day ye eat thereof, then your eyes shall be opened, and ye shall be as gods, knowing good and evil.\nAnd when the woman saw that the tree was good for food, and that it was pleasant to the eyes, and a tree to be desired to make one wise, she took of the fruit thereof, and did eat, and gave also unto her husband with her; and he did eat.\nAnd the eyes of them both were opened, and they knew that they were naked; and they sewed fig leaves together, and made themselves aprons.\nAnd they heard the voice of the LORD God walking in the garden in the cool of the day: and Adam and his wife hid themselves from the presence of the LORD God amongst the trees of the garden.\nAnd the LORD God called unto Adam, and said unto him, Where art thou?\nAnd he said, I heard thy voice in the garden, and I was afraid, because I was naked; and I hid myself.\nAnd he said, Who told thee that thou wast naked? Hast thou eaten of the tree, whereof I commanded thee that thou shouldest not eat?\nAnd the man said, The woman whom thou gavest to be with me, she gave me of the tree, and I did eat.\nAnd the man called unto him, and he ate: And the woman were two of the tree, and I shall be one.\nIn Adam, He was naked: And he ate:\nAnd the woman and man gave, and he ate,\nAnd it was dry land and the deep, and the man became dark.\nThe deep and the light and He called, the depth, and the darkness and the light, and the second day, the earth, the light, [the light, the darkness, and the light, the deep. And the light, and the dark, and The earth was the light: and the light. In the light, the and the light. the darkness and light, the night the the earth , the of the earth: the darkness and darkness made the the and the darkness, the and, the evening and the day and the earth\nAnd the Earth and darkness and the first and the light the day, and the darkness Day, and the night day and earth the day, and the day the and day the day, day and darkness Day, the, And the Day, the day, the day and and the day and the earth. The earth, and day there the 4th day there, dark day, the earth:\nHe the earth. Day the light in the day the day the day the day of the day and, the and, the. and earth, the earth, and the and it day, the day , the day and the day the day the and the earth, the earth ,, the earth -the, \u00c2 the, the darkness the, the there in the there is the was the was the, the was, Wearness,the was made there is, \u00c2 and the of, the waters, ..., the ground, the, the and the, and the earth the,\nIn [was the...in the unto, \u2026. The (In , (The the (cab -ch in the part the is the day, theness (the (of the, [. .. -the in, the Wether \u2013 of the [( (,[ of the, [...] of the*\t and (\n, [ ( (.\u00c2 [,\nThe,..., \u2022 of the, [W, ((Wails\nFewness ((Wather, ([,[((,, ,,, ([ is -(, ([, [[, [,. (., (. [. [\nof of the of the of the of. of the of the,[,, of the\u00c2 in it, [-the \u2014 [ (, \u2013, [\u2014(,, (,,.;, ([ (was, is (\u00ad[ (1[ of the of the of the of the ([\u0097 ( [ight of that of Wound\n[ (\ufffd [, ([,*, was,, in,, [,, ,.,, ([,, of the on,, the, as [\u200b, in,,, (> -\u00e2.\nM of the ,* ,-the of the of the of,, , , of the\u00c2 of the of the of of the of the [, [,, [,, ,, \u2014, [[,\u00c2, (W, was, is, of the, in of,,, of ( of [ of the of the [ of [ of the, of, of\nIn of the, of,, - of the of the of the of, of the,, (, of the, of ( of the,, of W of the of the, of , of of IN of, of the of of, [\u00ad, of, of [[ of the, ,,,, (, in of,, came,, ,,,, the,, (,,,, the, of the, of the, of the of the of, of, of the, of [, , of. of, of the of, of of of of of, of of of of (., of of, of of w of, of the of of of of of of of of of of of of of, . of of of of of water of of, --,,,. of, (,,,,,,,,, of [ ,,, (,, [,, [ ( [ ,,,,, of [ (, of [,,, [,, of of, of,,, ,, of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of, of of of of (, (,,, of [,,,,, [,,, (, [ of, ,, the, of of of of of of of of of of of of of of of of of of of, of of of of of of of of of the, ,,,,,,,,,, in of, ab,,,, prov,, [, of,,,, ( of of ( of [ [, of [ , of (\u00c2, [, [ [, [[ ,,, ab, of , of, of, of ,, d ,, of (, , of of the of a, of, of of of of of of of, of of of of of of of of of, of of of of of of of of of of of of of of [ of of of of of of of of of of of of the of of h of, (([ [ [(\u00ad, af [[, of of [, of [ [ ( [ [ [ [ [[ [ ( ( [ (, [ , the w [ [ ( \u2014 ( the ( [ of ( of of the of [[,. [{ [\u00ad [ ,, W. [, of , of of of of [\u00c2. [ of a ab, d over re ad, [,, ( ( ,,, [, [,, to- ., [ and,\t, c [ , [ [ \u2014, ( , ( , of, of of [ of the m. (\n or, of [ w de en [ , of of of of w of of [ of [ of of of h of of of [ of the of of of of of of of, . of of of of of of [ of of of [, [ [ and of of [ ( [, [, [. , [( . [. the [, [., [. [[ the the the the the d t,, .\u00c2 of the the [ [, of [ of of [\t of of of the of of of\u00ad of of of of of\n,. ( in the of, the of the ,. ay, \u2014 [ (, [ [, (, (,, [\u2009- . [, ,, , , ,, the of [ of , [\u00c2 ,,. , (,,,,,, [ [, [,.\u00ad, . [, the,, ,,,. the, the, . [ [ [ , of ( [\u00c2 of [ [, [, . of of of of of of of of of of of of of of of of of of of of of of of of of of of of, [,\n. a.\u2009. the [\u00ad [\u00ad( . ,, ,, d( the the the the the the [[ [[- [of the of of of of of of\u200b of [ of of . of of of of of of,, ( (\u00ad, the of, the , of the of , . of of af of of the the the the w,\u00c2 the. ( . [\u00c2\u00ad, (\u00c2 , - d , of w., al . (m ., ,, ,(\u00ad\u00c2, . . , . (, ,\u00ad . r , . ,. the , the the ., , , , [ ,, the [ [ [\u00ad in [ [, [ , . , \u2014. , [\u200b: , [ . ,{ ,, ,\u00ad ,,, , [, ,, of of of of of of [ ( . [( ( , , [, [. [, , [ [\u00c2 h to w, [, [\u00e2[[ , in (, [ [[ [\u00ad [? [ [\u009d [ ( [\u2009 , [ ( [ .. [ r, [ [., ,\u00c2 ( , [ [ , b . [ , , d [ , [ n, of , , . of of ( of of of , (\u00ad of of . , [\u00c2, , , [, , , of , , , , , , , ,\u200b\u00ad , ,[ , h [ [[ ab, , ( the, n, \u2014 ,, . ,: , ,, ., , and, [, , , [, -\u009d , , . , , in [ . [ [[ of [ of the the the the , . , [\ufffd the of the of [\u00ad of -- [( [\u00c2 of [\u00ad of . , . , , . . a to ,[ ,{ of \u201a of [\u0097\u00c2 of , ., [[ , ,, e [ , of\u00ad [ , [\u200d [ [\u00ad [\u00e2 [ [ (\u00c2 of n-of of of of . of of of of of of [ of of of of of, of [ of , . , [ of ,: of{\ufffd,\u00c2 . . . , . the\u00c2 [\u00ad , . ,\u00c2 \u00c2 [ , ,\n. [c, , , , , , [ . ,\u00c2 d ,",`
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 2254,
    "completion_tokens": 2048,
    "total_tokens": 4302
  }
}

high_level_api_inference.py.txt genesis70.txt

jmtatsch commented 1 year ago

And god saw it wasn't good so he gave us https://huggingface.co/epfml/landmark-attention-llama7b-wdiff

gjmulder commented 1 year ago

And god saw it wasn't good so he gave us https://huggingface.co/epfml/landmark-attention-llama7b-wdiff

Now let's pray to the llama gods of Apache 2.0 that they also attend to our pleas for large contexts.

Dampfinchen commented 1 year ago

Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048.

Can you send a reproducible example? This has not been my experience using long prompts and a context size of 8192. It is possible of course that you're hitting an edge case.

Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.

I've loaded a ggml 5_1 13b model in Ooba with max context of 2048, max new tokens 200 (this is default and important. When trying to reproduce the issue, please do not use 8K context).

Then I send this prompt to the model which is just a bit over 1900 tokens:

Please complete the following text:

"In most reciprocating piston engines, the steam reverses its direction of flow at each stroke (counterflow), entering and exhausting from the same end of the cylinder. The complete engine cycle occupies one rotation of the crank and two piston strokes; the cycle also comprises four events – admission, expansion, exhaust, compression. These events are controlled by valves often working inside a steam chest adjacent to the cylinder; the valves distribute the steam by opening and closing steam ports communicating with the cylinder end(s) and are driven by valve gear, of which there are many types.[citation needed] The simplest valve gears give events of fixed length during the engine cycle and often make the engine rotate in only one direction. Many however have a reversing mechanism which additionally can provide means for saving steam as speed and momentum are gained by gradually kshortening the cutoffk or rather, shortening the admission event; this in turn proportionately lengthens the expansion period. However, as one and the same valve usually controls both steam flows, a short cutoff at admission adversely affects the exhaust and compression periods which should ideally always be kept fairly constant; if the exhaust event is too brief, the totality of the exhaust steam cannot evacuate the cylinder, choking it and giving excessive compression (kkick backk).[60] In the 1840s and 1850s, there were attempts to overcome this problem by means of various patent valve gears with a separate, variable cutoff expansion valve riding on the back of the main slide valve; the latter usually had fixed or limited cutoff. The combined setup gave a fair approximation of the ideal events, at the expense of increased friction and wear, and the mechanism tended to be complicated. The usual compromise solution has been to provide lap by lengthening rubbing surfaces of the valve in such a way as to overlap the port on the admission side, with the effect that the exhaust side remains open for a longer period after cut-off on the admission side has occurred. This expedient has since been generally considered satisfactory for most purposes and makes possible the use of the simpler Stephenson, Joy and Walschaerts motions. Corliss, and later, poppet valve gears had separate admission and exhaust valves driven by trip mechanisms or cams profiled so as to give ideal events; most of these gears never succeeded outside of the stationary marketplace due to various other issues including leakage and more delicate mechanisms.[58][61] Compression Before the exhaust phase is quite complete, the exhaust side of the valve closes, shutting a portion of the exhaust steam inside the cylinder. This determines the compression phase where a cushion of steam is formed against which the piston does work whilst its velocity is rapidly decreasing; it moreover obviates the pressure and temperature shock, which would otherwise be caused by the sudden admission of the high-pressure steam at the beginning of the following cycle.[citation needed] Lead in the valve timing The above effects are further enhanced by providing lead: as was later discovered with the internal combustion engine, it has been found advantageous since the late 1830s to advance the admission phase, giving the valve lead so that admission occurs a little before the end of the exhaust stroke in order to fill the clearance volume comprising the ports and the cylinder ends (not part of the piston-swept volume) before the steam begins to exert effort on the piston.[62] Uniflow (or unaflow) engine Main article: Uniflow steam engine Animation of a uniflow steam engine. The poppet valves are controlled by the rotating camshaft at the top. High-pressure steam enters, red, and exhausts, yellow. Uniflow engines attempt to remedy the difficulties arising from the usual counterflow cycle where, during each stroke, the port and the cylinder walls will be cooled by the passing exhaust steam, whilst the hotter incoming admission steam will waste some of its energy in restoring the working temperature. The aim of the uniflow is to remedy this defect and improve efficiency by providing an additional port uncovered by the piston at the end of each stroke making the steam flow only in one direction. By this means, the simple-expansion uniflow engine gives efficiency equivalent to that of classic compound systems with the added advantage of superior part-load performance, and comparable efficiency to turbines for smaller engines below one thousand horsepower. However, the thermal expansion gradient uniflow engines produce along the cylinder wall gives practical difficulties.[citation needed]. Turbine engines Main article: Steam turbine A rotor of a modern steam turbine, used in a power plant A steam turbine consists of one or more rotors (rotating discs) mounted on a drive shaft, alternating with a series of stators (static discs) fixed to the turbine casing. The rotors have a propeller-like arrangement of blades at the outer edge. Steam acts upon these blades, producing rotary motion. The stator consists of a similar, but fixed, series of blades that serve to redirect the steam flow onto the next rotor stage. A steam turbine often exhausts into a surface condenser that provides a vacuum. The stages of a steam turbine are typically arranged to extract the maximum potential work from a specific velocity and pressure of steam, giving rise to a series of variably sized high- and low-pressure stages. Turbines are only efficient if they rotate at relatively high speed, therefore they are usually connected to reduction gearing to drive lower speed applications, such as a ship's propeller. In the vast majority of large electric generating stations, turbines are directly connected to generators with no reduction gearing. Typical speeds are 3600 revolutions per minute (RPM) in the United States with 60 Hertz power, and 3000 RPM in Europe and other countries with 50 Hertz electric power systems. In nuclear power applications, the turbines typically run at half these speeds, 1800 RPM and 1500 RPM. A turbine rotor is also only capable of providing power when rotating in one direction. Therefore, a reversing stage or gearbox is usually required where power is required in the opposite direction.[citation needed] Steam turbines provide direct rotational force and therefore do not require a linkage mechanism to convert reciprocating to rotary motion. Thus, they produce smoother rotational forces on the output shaft. This contributes to a lower maintenance requirement and less wear on the machinery they power than a comparable reciprocating engine.[citation needed] Turbinia – the first steam turbine-powered ship The main use for steam turbines is in electricity generation (in the 1990s about 90% of the world's electric production was by use of steam turbines)[3] however the recent widespread application of large gas turbine units and typical combined cycle power plants has resulted in reduction of this percentage to the 80% regime for steam turbines. In electricity production, the high speed of turbine rotation matches well with the speed of modern electric generators, which are typically direct connected to their driving turbines. In marine service, (pioneered on the Turbinia), steam turbines with reduction gearing (although the Turbinia has direct turbines to propellers with no reduction gearbox) dominated large ship propulsion throughout the late 20th century, being more efficient (and requiring far less maintenance) than reciprocating steam engines. In recent decades, reciprocating Diesel engines, and gas turbines, have almost entirely supplanted steam propulsion for marine applications.[citation needed] Virtually all nuclear power plants generate electricity by heating water to provide steam that drives a turbine connected to an electrical generator. Nuclear-powered ships and submarines either use a steam turbine directly for main propulsion, with generators providing auxiliary power, or else employ turbo-electric transmission, where the steam"

The generation immediately stops, this is how it looks in the WebUI:

assistant

In commandline: "Output generated in 0.31 seconds (0.00 tokens/s, 0 tokens, context 1933, seed 1692940365)"

Please refer to #307 This is the exact same issue. We don't want longer context than 2048 (atleast right now) we want to send long prompts within the 2040 tokens window without the generation stopping entirely.

gjmulder commented 1 year ago

Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.

Again, I don't see the problem with llama-cpp-python. I even went to the effort to try and reproduce it from your description, but couldn't.

Dampfinchen commented 1 year ago

Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python.

Again, I don't see the problem with llama-cpp-python. I even went to the effort to try and reproduce it from your description, but couldn't.

Which OS are you running? I've noticed Priestru and me are using the same OS (Windows 11).

BTW just because you can't reproduce it, doesn't mean the issue is invalid.

agronholm commented 1 year ago

I encountered this issue on Ubuntu 22.04 (GeForce 1080 ti, if that matters).

gjmulder commented 1 year ago

I encountered this issue on Ubuntu 22.04 (GeForce 1080 ti, if that matters).

Can you post your code, please?

agronholm commented 1 year ago

This happened with text-generation-webui. Sorry for not mentioning that.

gjmulder commented 1 year ago

This happened with text-generation-webui. Sorry for not mentioning that.

No problem. I'm sure any text-generation-webui developer reading this issue will jump in and fix it immediately.

agronholm commented 1 year ago

Sarcasm aside, text-generation-webui uses this library for text generation for llama based models, which is why OP opened this issue in the first place.

Mikkolehtimaki commented 1 year ago

I get short responses that are cut off when I use stream completions in server mode, is this related?

agronholm commented 1 year ago

Are you sure you're just not hitting the generation limit? That's usually the case when that happens to me.

gjmulder commented 1 year ago

I get short responses that are cut off when I use stream completions in server mode, is this related?

It could well be. Do you have a curl request to easily reproduce the problem?

I spent an hour trying to reproduce the problem from the OP's limited description, but without the specifics of exactly how llama-cpp-python is being called the issue is likely not going to get identified and fixed.

Mikkolehtimaki commented 1 year ago

I get very long answers to the same query with curl when I don't stream. Can I even stream with curl though? I'm streaming with the openai Python api.

gjmulder commented 1 year ago

Knowing that it is an issue with the streaming API helps, thanks. It explains why I couldn't reproduce it with the high level API example.

Mikkolehtimaki commented 1 year ago

When this happens, the finish_reason is "length" by the way. Happens with stream and no-stream python clients so maybe it's just me. With curl the reponse is nice and finish_reason is "stop".

gjmulder commented 1 year ago

Earlier today I asked @abetlen to look into this more.

Can I please confirm the versions of text-generation-webui and llama-cpp-python people are using?

agronholm commented 1 year ago

I used text-generation-webui v1.3.1, and llama-cpp-python v0.1.57.

Mikkolehtimaki commented 1 year ago

llama-cpp-python version 0.1.59 here.

Dampfinchen commented 1 year ago

Using 0.1.59 as well. But I don't know how to check the version for Textgen.

Priestru commented 1 year ago

0.1.59 for cpp-python, but bug has been present in previous one too. About ooba i can only say that i use the latest one.

Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.

https://github.com/oobabooga/text-generation-webui/issues/2576#issuecomment-1583339709

agronholm commented 1 year ago

I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:

Traceback (most recent call last):
  File "/home/alex/ai/text-generation-webui/modules/text_generation.py", line 301, in generate_reply_custom
    reply = shared.model.generate(context=question, **generate_params)
  File "/home/alex/ai/text-generation-webui/modules/llamacpp_model.py", line 78, in generate
    for completion_chunk in completion_chunks:
  File "/home/alex/ai/text-generation-webui/venv310/lib/python3.10/site-packages/llama_cpp/llama.py", line 725, in _create_completion
    raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}")
ValueError: Requested tokens exceed context window of 2048
gjmulder commented 1 year ago

I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:

Valuable info. Thx.

gjmulder commented 1 year ago

Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.

In the issue you linked to is a stack trace that directly points to a llama-cpp-python issue. Please log an issue, or I can for you.

abetlen commented 1 year ago

I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening. This is what I got on the console:

Traceback (most recent call last):
  File "/home/alex/ai/text-generation-webui/modules/text_generation.py", line 301, in generate_reply_custom
    reply = shared.model.generate(context=question, **generate_params)
  File "/home/alex/ai/text-generation-webui/modules/llamacpp_model.py", line 78, in generate
    for completion_chunk in completion_chunks:
  File "/home/alex/ai/text-generation-webui/venv310/lib/python3.10/site-packages/llama_cpp/llama.py", line 725, in _create_completion
    raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}")
ValueError: Requested tokens exceed context window of 2048

I think this narrows it down, create_completion would throw an error when len(prompt_tokens) + max_tokens > n_ctx I've changed this to just truncate max_tokens. I'll publish an updated version.

abetlen commented 1 year ago

@Priestru this is related to #183 actually, but thanks for reporting, I'll try to implement a fix that works outside of the server too.

The issue is that likely ooba is using a single Llama object in memory, when you click regenrate the previous request is still running but a new one comes in as well, this causes inconsistencies in the underlying library. Best workaround atm is what the llama-cpp-python server does and wrapping it in a lock but this is not a good solution as it doesn't allow for easy generation interruption.

gjmulder commented 1 year ago

Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point.

@Priestru I've created an oobabooga label to track issues that are being reported indirectly via text-generation-webui. Ideally, they should include a stack trace such as @agronholm kindly provided which in turn enabled @abetlen to understand the issue.

Priestru commented 1 year ago

0 token generation for larger prompts has been fixed in the newest update but now we have a new bug.

n_ctx is default (2048)

llama_print_timings:        load time = 17090.64 ms
llama_print_timings:      sample time =     2.11 ms /    14 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time = 48184.79 ms /  1989 tokens (   24.23 ms per token)
llama_print_timings:        eval time =  8830.68 ms /    13 runs   (  679.28 ms per token)
llama_print_timings:       total time = 57064.15 ms
Output generated in 57.34 seconds (0.24 tokens/s, 14 tokens, context 2034, seed 740796631)
Llama.generate: prefix-match hit
127.0.0.1 - - [10/Jun/2023 12:37:13] "GET /api/v1/model HTTP/1.1" 200 -

llama_print_timings:        load time = 17090.64 ms
llama_print_timings:      sample time =     2.05 ms /    14 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  9368.79 ms /    14 runs   (  669.20 ms per token)
llama_print_timings:       total time =  9412.53 ms
Output generated in 9.68 seconds (1.45 tokens/s, 14 tokens, context 2034, seed 516672141)
Llama.generate: prefix-match hit

llama_print_timings:        load time = 17090.64 ms
llama_print_timings:      sample time =     2.03 ms /    14 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  9426.21 ms /    14 runs   (  673.30 ms per token)
llama_print_timings:       total time =  9475.47 ms
Output generated in 9.75 seconds (1.44 tokens/s, 14 tokens, context 2034, seed 1241799934)

It only generates until it hit of total 2048 tokens. Sum of initial prompt + output.

Previously discovered workaround saves the day once again because it allows it to generate normally.

I set n_ctx to 2500. It results in:

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from D:\Models\Wizard-Vicuna-30B-Uncensored.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2500
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 25573.14 MB (+ 3124.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 512 MB
.
llama_init_from_file: kv self size  = 3808.59 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 9.01 seconds.

127.0.0.1 - - [10/Jun/2023 12:41:46] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Jun/2023 12:43:17] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Jun/2023 12:44:47] "GET /api/v1/model HTTP/1.1" 200 -

llama_print_timings:        load time = 17040.96 ms
llama_print_timings:      sample time =    42.47 ms /   289 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time = 58792.80 ms /  2034 tokens (   28.91 ms per token)
llama_print_timings:        eval time = 197917.31 ms /   288 runs   (  687.21 ms per token)
llama_print_timings:       total time = 258369.00 ms
Output generated in 258.64 seconds (1.11 tokens/s, 288 tokens, context 2034, seed 802723436)
Priestru commented 1 year ago

1.62 or smth doesn't fix it

llama_print_timings:        load time = 15599.80 ms
llama_print_timings:      sample time =     5.32 ms /    35 runs   (    0.15 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 23978.73 ms /    35 runs   (  685.11 ms per token)
llama_print_timings:       total time = 24616.61 ms
Output generated in 24.91 seconds (1.40 tokens/s, 35 tokens, context 2013, seed 971044292)

Should i make new issue to add visibility?

deepdatalive commented 1 year ago

The llama 7B model is giving me very small responses, input and output as below:

Endpoint: http://localhost:PORT/v1/chat/completions

Request body:
{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI Assistant named MR AI"
    },
    {
      "role": "user",
      "content": "Write a big poem for me"
    }
  ]
}

Response body:

{
    "id": "chatcmpl-<id>",
    "object": "chat.completion",
    "created": <timestamp>,
    "model": "llama.cpp/models/7B/ggml-model-q4_0.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Are you sure? "
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 33,
        "completion_tokens": 8,
        "total_tokens": 41
    }
}