deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
182 stars 59 forks source link

Stopping short: Very few output tokens returned, even though max tokens set very high. #2057

Closed yaronr closed 3 weeks ago

yaronr commented 3 weeks ago

Description

The curl query below should receive multiple output tokens, but only one is returned. The query is equivalent to queries generated by llmperf. They work perfectly on vllm inference endpoints.

Expected Behavior

Something on the order of 152 output tokens should be returned.

Error Message

No error message, just Info: (even though this parameter was not requested): 'INFO PyProcess W-109-model-stdout: The following parameters are not supported by neuron with rolling batch: {'frequency_penalty'}. '

How to Reproduce?

 curl http://Llama-3-8b-inst.inferentia.myurl.io/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
   "model": "meta-llama/Meta-Llama-3-8B-Instruct", 
   "messages": [
     {
     "role": "user",
     "content": "Randomly stream lines from the following text with 152 output tokens. Don'\''t generate eos tokens:\\n\\nAnd make Time'\''s spoils despised every where.\\nSince, seldom coming, in the long year set,\\nThat use is not forbidden usury,\\nAnd gives thy pen both skill and argument.\\nWhere art thou, Muse, that thou forget'\''st so long\\nSometime too hot the eye of heaven shines,\\nAs tender nurse her babe from faring ill.\\nFor blunting the fine point of seldom pleasure.\\nIn thee thy summer, ere thou be distill'\''d:\\nTherefore are feasts so solemn and so rare,\\nSince mind at first in character was done!\\nSo is the time that keeps you as my chest,\\nWhich in thy breast doth live, as thine in me:\\nAnd summer'\''s lease hath all too short a date:\\nThen look I death my days should expiate.\\nBut thy eternal summer shall not fade\\nHath been before, how are our brains beguiled,\\nOr whether revolution be the same.\\nWith beauty'\''s treasure, ere it be self-kill'\''d.\\nSo long as youth and thou are of one date;\\nIf there be nothing new, but that which is\\nPresume not on thy heart when mine is slain;\\nReturn, forgetful Muse, and straight redeem\\nBeing had, to triumph, being lack'\''d, to hope.\\nTo be death'\''s conquest and make worms thine heir.\\nBut when in thee time'\''s furrows I behold,\\nTen times thyself were happier than thou art,\\nBearing thy heart, which I will keep so chary\\nAnd every fair from fair sometime declines,\\nShall I compare thee to a summer'\''s day?\\nThe which he will not every hour survey,\\nNor lose possession of that fair thou owest;\\nEven of five hundred courses of the sun,\\nAnd often is his gold complexion dimm'\''d;\\nIn gentle numbers time so idly spent;\\nTo speak of that which gives thee all thy might?\\nBy chance or nature'\''s changing course untrimm'\''d;\\nThou art "
    }
   ],
   "max_tokens": 100000 
  }'

Steps to reproduce

Running djl-serving docker: deepjavalibrary/djl-serving:0.28.0-pytorch-inf2 With the following env vars:

AWS_NEURON_VISIBLE_DEVICES=ALL 
OPTION_TENSOR_PARALLEL_DEGREE=max
HF_HOME=/tmp/.cache/huggingface 
OPTION_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct 
OPTION_ENTRYPOINT=djl_python.transformers_neuronx 
OPTION_TRUST_REMOTE_CODE=true 
SERVING_LOAD_MODELS=test::Python=/opt/ml/model 
OPTION_ROLLING_BATCH=auto
OPTION_MAX_ROLLING_BATCH_SIZE=32 
DJL_CACHE_DIR=/tmp/.cache/

What have you tried to solve it?

  1. I've tried a very short query asking for 150 output tokens with no stop tokens.
yaronr commented 3 weeks ago

It turns out this is because max_tokens is set by n_positions which is set by default to a very short number. Setting OPTION_N_POSITIONS=8192. I recommend improving documentation, and setting more reasonable defaults (30 is too low). Also - I wonder why there should be a default max_tokens set at all. Finally, the info printout is weird. I assume no one will ever read this. If you do read this, and reach this line - congratulations, you've reached the end of this comment.