AmineDiro / cria

OpenAI compatible API for serving LLAMA-2 model
MIT License
215 stars 13 forks source link

Response cutting off at around 256 tokens #17

Closed sbmkvp closed 1 year ago

sbmkvp commented 1 year ago

The response cuts off at around 256tokens.

# .env
CRIA_MODEL_PATH=/home/bala/Models/llama-2-13b-chat.ggmlv3.q8_0.bin

# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=false
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Input

{
  "prompt":"[INST]<<SYS>>.<</SYS>>How do I get from UNSW to Central Station?[/INST]",
  "temperature":0.1
}

Output

console.log(response.choices[0].text)

There are several ways to get from the University of New South Wales (UNSW) to Central Station in Sydney. Here are some options: 1. Train: The easiest and most convenient way to get to Central Station from UNSW is by train. The UNSW campus is located near the Kensington station, which is on the Airport & South Line. You can take a train from Kensington station to Central Station. The journey takes around 20 minutes. 2. Bus: You can also take a bus from UNSW to Central Station. The UNSW campus is served by several bus routes, including the 395, 397, and 398. These buses run frequently throughout the day and the journey takes around 45-60 minutes, depending on traffic. 3. Light Rail: Another option is to take the light rail from UNSW to Central Station. The light rail runs along Anzac Parade and stops at Central Station. The journey takes around 30-40 minutes. 4. Taxi or Ride-sharing: You can also take a taxi or ride-sharing service such as Uber or Ly
sbmkvp commented 1 year ago

Is there a way to increase this?

AmineDiro commented 1 year ago

cria replicates the OpenAI API, so you can set the max_tokens value into your POST request. Here is an example :

curl http://localhost:3000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo-instruct",
    "prompt": "Say this is a test",
    "max_tokens": 512,
    "temperature": 0
  }'