Dev update (23.8.17.) - Githubissues

🚀 This PR introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:

🌟 1. Exllama Module - LoRA Integration

By placing adapter_config.json and adapter_model.bin in the ./models/gptq/YOUR_MODEL directory, the system will now seamlessly initialize LoRA.

🔗 2. OpenAI Logit Bias Support

For API queries to models specified within the openai_replacement_models dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.

⚖ 3. Optimized Worker Load Balancing

Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed worker_rank, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.

📜 4. Enhanced Logging Mechanism

Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in logs/chat.log.

🔥 5. Docker Image Upgrades

The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.

c0sogi / llama-api