b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

API Server #47

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

This pull request introduces API functionality to the distributed llama project. The main addition is the implementation of the chat completion endpoint, following the specifications outlined by OpenAI for chat completions.

Key features of this implementation include streaming support, the capability to terminate generation upon detecting a stop word or end-of-sequence token (EOS), and the ability to dynamically adjust parameters such as temperature, seed, and top probability (top-p) for each request.

The code has undergone significant refactoring to enhance clarity and maintainability. I have tested the changes locally to ensure functionality. Your feedback on the implementation is highly appreciated.

DifferentialityDevelopment commented 1 month ago

Noticed i accidentally renamed ProgramArgs to ServerArgs in main.cpp, This wasn't intentional and will be reverted.

DifferentialityDevelopment commented 1 month ago

@b4rtaz I've made the change to rather use SocketServer and I've added a function on a Socket that's meant just for reading an full http request, using read caused it to never finish since I didn't know ahead of time how much data is being sent. I don't want to change the current read function as it's being used by the workers and such and don't want to break anything there.

b4rtaz commented 1 month ago

@DifferentialityDevelopment I need a bit of time to test it, after I'll release the refactored multihead layers I'll switch to this (maybe 2-5 days).