c0sogi / llama-api

An OpenAI-like LLaMA inference API
MIT License
111 stars 9 forks source link

Dev update (23.9.3.) #7

Closed c0sogi closed 1 year ago

c0sogi commented 1 year ago

Pull Request Summary

This pull request introduces the following changes:

  1. Added Instruction Templates: We have added instruction-templates to the model definition. Now, one can explicitly provide an instruction_template in LlamaCppModel or ExllamaModel. This helps to generate more accurate prompts in the chat completion endpoint.

  2. Streaming Response Timeout: Streaming responses will now timeout automatically if the next chunk is not received within 30 seconds.

  3. Semaphore Bug Fix: Fixed a bug where a semaphore would not be properly released after being acquired.

  4. Auto Truncate in Model Definition: If auto_truncate is set to True in the model definition, past prompts will be automatically truncated to fit within the context window, thus preventing errors. The default setting is True.

  5. Automatic RoPE Parameter Adjustment: If explicit settings for rope_freq_base, rope_freq_scale in llama.cpp, or alpha_value, compress_pos_emb in exllama are not provided, the RoPE frequency and scaling factor will be automatically adjusted. This default behavior is based on the llama2 model with a training token count of 4096.

  6. Dynamic Model Definition Parsing: Model definitions are primarily configured in model_definitions.py. However, parsing will now also be attempted for Python script files in the root directory containing the words 'model' and 'def'. Environment variables containing 'model' and 'def' will also be parsed automatically. This applies to openai_replacement_models as well. For example, you can set environment variables as shown below:

#!/bin/bash

export MODEL_DEFINITIONS='{
  "gptq": {
    "type": "exllama",
    "model_path": "TheBloke/MythoMax-L2-13B-GPTQ",
    "max_total_tokens": 4096
  },
  "ggml": {
    "type": "llama.cpp",
    "model_path": "TheBloke/Airoboros-L2-13B-2.1-GGUF",
    "max_total_tokens": 8192,
    "rope_freq_base": 26000,
    "rope_freq_scale": 0.5,
    "n_gpu_layers": 50
  }
}'

export OPENAI_REPLACEMENT_MODELS='{
  "gpt-3.5-turbo": "ggml",
  "gpt-3.5-turbo-0613": "ggml",
  "gpt-3.5-turbo-16k": "ggml",
  "gpt-3.5-turbo-16k-0613": "ggml",
  "gpt-3.5-turbo-0301": "ggml",
  "gpt-4": "gptq",
  "gpt-4-32k": "gptq",
  "gpt-4-0613": "gptq",
  "gpt-4-32k-0613": "gptq",
  "gpt-4-0301": "gptq"
}'

echo "MODEL_DEFINITIONS: $MODEL_DEFINITIONS"
echo "OPENAI_REPLACEMENT_MODELS: $OPENAI_REPLACEMENT_MODELS"