Closed DifferentialityDevelopment closed 1 month ago
Hello @DifferentialityDevelopment,
yeah, more hands are welcome.
ability to stop on encountering specific strings
This could be easy to implement. Maybe you could add an extra cli parameter like --eos-id 123
that overrides the value from the tokenizer file.
chat template integration, and for that matter an openai chat completion api endpoint
This would be very good. But still Distributed Llama doesn't support the kv cache recalculation after the inference riched seqLen
tokens. Now that cannot be implemented, because there are some more important changes to do before. I think in next days I'll add the roadmap for this project.
I've already made some good progress on an API endpoint, was thinking it might be a good idea to instead of building it into main, to rather have it seperate as server.cpp Sometime in the next week I'm hoping to have it ready, I'm aiming for the chat completions endpoint to work in this format https://platform.openai.com/docs/api-reference/chat/create I'm almost certain I'll need your help on refactoring it though, I've still got tons to learn about C++, I mostly program in C# The only external dependency right now I'm looking at using for it is this https://github.com/nlohmann/json
Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz
).
Also I think this server should be excluded from the main.cpp
file. You could put the main code to the src/apps/api-server.cpp
file.
Great! I would prefer to not use external depenciencies, you can attach the json library to the source code as is done in llama.cpp. To goal is to keep Distributed Llama simple to compile (make xyz).
It's only a header file ie json.hpp, which makes it super easy to integrate and allows for using JSON data, without bloating the source code with custom code to parse JSON, was necessary for the API endpoint as most communication to/from server is via JSON
Also I think this server should be excluded from the
main.cpp
file. You could put the main code to thesrc/apps/api-server.cpp
file.
Yes this is what I've done, though it's at src/server.cpp
I've already got it mostly working, the non-streaming endpoint is basically done then I just have to finish the streaming endpoint and it should be good to go, I've only made a few small changes to the other files, like for instance the sampler to be able to dynamically set the temperature or seed based on the api request to process
By the way I've published a distributed version of llama 3 8B Instruct to huggingface https://huggingface.co/Azamorn/Meta-Llama-3-8B-Instruct-Distributed
That model is ideal for testing the chat completion endpoint.
I've got it responding back now, super stoked I finally got it working!
and another one, I've got stop strings integrated too
You can check out my changes here https://github.com/DifferentialityDevelopment/distributed-llama
Still busy with it though
API is basically done, streaming & non streaming modes work too, tested both and seems to work fine.
Do you think it's ready for a pull request?
Nice job! I'll review your changes after you'll create a new PR.
Btw: what is a behaviour when a generating position reaches the kv cache max position?
Btw: what is a behaviour when a generating position reaches the kv cache max position?
To be honest I have not tested that, do you mean when it reaches max sequence length? From what I can tell it should just force end the chat completion, normally the chat completion can end if an stop word is detected or eos token is generated, but if not it will run until it reaches the maximum sequence length specified in TransformerSpec->seqLen
I mean if the chat completion would support keeping a session (id
) then the kv cache limit may be a problem very quickly. Because after a short conversation the limit would be reached.
Does the current version support keeping a session or only it generates answers for new prompts? If no I still think this is a good step forward.
The rolling kv cache is a planned feature.
I haven't implemented it to keep session yet, so each request would recompute the prompt for the whole conversation, but I think I get where your coming from, before tokens get returned it processes each token in the prompt before it begins generating new tokens.
Ideally after each next chat message it wouldn't need to reprocess all the previous chat messages from previous requests, assuming the chat history is not altered.
Hi there
Amazing project by the way, it has given me hopes in being able to run really big models, specifically I'm very excited about the upcoming 400b Llama model coming in the following months, and I've been planning how to run it locally. Anyways I'm going off topic here, I'd really like to help out where I can and while it's undocumented in the readme, I see there is an simple server capability built in to distributed-llama main program.
I was wondering if I can assist in fleshing it out, there are a few things I can think of at the top of my head that could be needed in an API for distributed-llama, in no particular order:
I'm fairly new to C++ but I'd be happy to work with you on this