b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

Fleshing out API mode #39

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 2 months ago

Hi there

Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model coming in the following months, and I've been planning how to run it locally. Anyways I'm going off topic here, I'd really like to help out where I can and while it's undocumented in the readme, I see there is an simple server capability built in to distributed-llama main program.

I was wondering if I can assist in fleshing it out, there are a few things I can think of at the top of my head that could be needed in an API for distributed-llama, in no particular order:

I'm fairly new to C++ but I'd be happy to work with you on this

b4rtaz commented 2 months ago

Hello @DifferentialityDevelopment,

yeah, more hands are welcome.

ability to stop on encountering specific strings

This could be easy to implement. Maybe you could add an extra cli parameter like --eos-id 123 that overrides the value from the tokenizer file.

chat template integration, and for that matter an openai chat completion api endpoint

This would be very good. But still Distributed Llama doesn't support the kv cache recalculation after the inference riched seqLen tokens. Now that cannot be implemented, because there are some more important changes to do before. I think in next days I'll add the roadmap for this project.

DifferentialityDevelopment commented 1 month ago

I've already made some good progress on an API endpoint, was thinking it might be a good idea to instead of building it into main, to rather have it seperate as server.cpp Sometime in the next week I'm hoping to have it ready, I'm aiming for the chat completions endpoint to work in this format https://platform.openai.com/docs/api-reference/chat/create I'm almost certain I'll need your help on refactoring it though, I've still got tons to learn about C++, I mostly program in C# The only external dependency right now I'm looking at using for it is this https://github.com/nlohmann/json

b4rtaz commented 1 month ago

Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz).

Also I think this server should be excluded from the main.cpp file. You could put the main code to the src/apps/api-server.cpp file.

DifferentialityDevelopment commented 1 month ago

Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz).

It's only a header file ie json.hpp, which makes it super easy to integrate and allows for using JSON data, without bloating the source code with custom code to parse JSON, was necessary for the API endpoint as most communication to/from server is via JSON

Also I think this server should be excluded from the main.cpp file. You could put the main code to the src/apps/api-server.cpp file.

Yes this is what I've done, though it's at src/server.cpp

I've already got it mostly working, the non-streaming endpoint is basically done then I just have to finish the streaming endpoint and it should be good to go, I've only made a few small changes to the other files, like for instance the sampler to be able to dynamically set the temperature or seed based on the api request to process

DifferentialityDevelopment commented 1 month ago

By the way I've published a distributed version of llama 3 8B Instruct to huggingface https://huggingface.co/Azamorn/Meta-Llama-3-8B-Instruct-Distributed

That model is ideal for testing the chat completion endpoint.

DifferentialityDevelopment commented 1 month ago

I've got it responding back now, super stoked I finally got it working!

image

and another one, I've got stop strings integrated too

image

You can check out my changes here https://github.com/DifferentialityDevelopment/distributed-llama

Still busy with it though

DifferentialityDevelopment commented 1 month ago

API is basically done, streaming & non streaming modes work too, tested both and seems to work fine.

image

Do you think it's ready for a pull request?

b4rtaz commented 1 month ago

Nice job! I'll review your changes after you'll create a new PR.

b4rtaz commented 1 month ago

Btw: what is a behaviour when a generating position reaches the kv cache max position?

DifferentialityDevelopment commented 1 month ago

Btw: what is a behaviour when a generating position reaches the kv cache max position?

To be honest I have not tested that, do you mean when it reaches max sequence length? From what I can tell it should just force end the chat completion, normally the chat completion can end if an stop word is detected or eos token is generated, but if not it will run until it reaches the maximum sequence length specified in TransformerSpec->seqLen

b4rtaz commented 1 month ago

I mean if the chat completion would support keeping a session (id) then the kv cache limit may be a problem very quickly. Because after a short conversation the limit would be reached.

Does the current version support keeping a session or only it generates answers for new prompts? If no I still think this is a good step forward.

The rolling kv cache is a planned feature.

DifferentialityDevelopment commented 1 month ago

I haven't implemented it to keep session yet, so each request would recompute the prompt for the whole conversation, but I think I get where your coming from, before tokens get returned it processes each token in the prompt before it begins generating new tokens.

Ideally after each next chat message it wouldn't need to reprocess all the previous chat messages from previous requests, assuming the chat history is not altered.

b4rtaz commented 1 month ago

The 0.6.0 version is released with the API. I'm closing this issue now. Any next improvement/fix should be discussed in a new thread.