π This PR introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
π 1. Exllama Module - LoRA Integration
By placing adapter_config.json and adapter_model.bin in the ./models/gptq/YOUR_MODEL directory, the system will now seamlessly initialize LoRA.
π 2. OpenAI Logit Bias Support
For API queries to models specified within the openai_replacement_models dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.
β 3. Optimized Worker Load Balancing
Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed worker_rank, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.
π 4. Enhanced Logging Mechanism
Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in logs/chat.log.
π₯ 5. Docker Image Upgrades
The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.
π This PR introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
π 1. Exllama Module - LoRA Integration
adapter_config.json
andadapter_model.bin
in the./models/gptq/YOUR_MODEL
directory, the system will now seamlessly initialize LoRA.π 2. OpenAI Logit Bias Support
openai_replacement_models
dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.β 3. Optimized Worker Load Balancing
worker_rank
, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.π 4. Enhanced Logging Mechanism
logs/chat.log
.π₯ 5. Docker Image Upgrades