Support for llama.cpp API Server

digiwombat commented 1 year ago

Firstly, thanks for your hard work on this project so far! llama.cpp has added an API that can be built into an .exe. Support for it would be much appreciated.

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints

Endpoints are pretty simple. It has some nodejs examples. Seems pretty cozy. I'd slap together a PR, but I am swamped at the moment.

anon998 commented 1 year ago

I added it with this commit: https://github.com/anon998/simple-proxy-for-tavern/commit/2661b7e24866642b24babc614b9c280a2fc7e816

But it seems buggy to me. I get a lot of segmentation faults on Linux. And regeneration/swipes seem to continue the previous output. Same thing with pressing stop. It's like the state is not reset properly.

digiwombat commented 1 year ago

Thanks for slapping this together. You're a champion. I did some testing with things and wiggled some settings around. There are a number of undocumented params you can pass on generate (all the gpt_params settings work) and the problem child for basic gens for anyone who comes around to do testing was repeat_last_n. It was set to 64 tokens.

Setting the repeat_last_n in the config to repeat_last_n: 1028 clears the problem up entirely, as far as I can tell. Though I was only testing again the default preset.

Here are the undocumented parameters for anyone tinkering:
https://github.com/ggerganov/llama.cpp/blob/master/examples/common.h#L23
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/server.cpp#L138

To the broader problem, the server is definitely running it like a console app and just appending context instead of reprocessing or doing anything clever. A few questions and responses quickly runs into the Context too long error.

I added a disgusting hack to the server.cpp file to make it work consistently, though it does force reprocessing the entire prompt every time. Basically at the top of bool LoadPrompt() I added:

if(processed_tokens.size() != 0)
{
    processed_tokens.erase(processed_tokens.begin() + 1, processed_tokens.end());
}
if(embd_inp.size() != 0)
{
    embd_inp.erase(embd_inp.begin() + 1, embd_inp.end());
}

n_remain = 0;
n_past = 0;
n_consumed = 0;

That'll get it working for any testing, but it is a cudgel instead of a scalpel. I will maybe spend a little time seeing what I can do to cut down on reprocessing. I don't think the proxy will require anymore changes, so I'll just close this and comment if I come across anything good.

Edit: I think the problem is that it's adding the calculated tokens onto the context internally rather than just taking the context it's been given. haven't investigated the code enough yet, but stopping that could be the solution.

Also, the parameters aren't being set, they can easily be added to the parse_options_completion function.

Edit 2: Here's a gist with the changes I mentioned. Drop it in and build and proxy will work properly and have access to the expanded set of parameters. https://gist.github.com/digiwombat/57c43ac76d01f6735f686678804c070a

digiwombat commented 1 year ago

Apologies, Anon. Just going to re-open this for information sake since it's still pretty snappy even with the reprocessing. Feel free to close it if you want.

digiwombat commented 1 year ago

I cleaned this up a little and made a PR for llama.cpp over here:

https://github.com/ggerganov/llama.cpp/pull/1570

That PR will let you add reload_ctx: true to your llamaCppSettings config line to force it to regenerate the context at will. It also includes the extra parameters. Should basically be true at all times for proxy usage, I think. If that gets merged, I'll close this.

anon998 commented 1 year ago

I added them now since it doesn't really throw an error if the parameter isn't used. Some of the kobold names are translated to these, the others could be added to the generation presets or to llamaCppSettings.

In the code some of the types don't match the ones in the struct, like penalize_nl is bool but is parsed as float. I'm not sure if it matters.

digiwombat commented 1 year ago

Whoops, good catch. Fixed that. Being a lazy copy-paste goblin got the better of me.

anon998 commented 1 year ago

I have a couple of comments.

Top_k is not used, I think llama_sample_top_k(ctx, &candidates_p, top_k, 1) is missing around this line: https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L210

This about replacing the eos with a new line and the first stopping string is kind of weird. For the proxy it will replace it with "\n\n##", I would prefer if it was option instead, although it doesn't make a lot of sense to me. https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L222-L231 An option to set "params.logit_bias[llama_token_eos()] = -INFINITY;" would be cool so it doesn't generate the eos token.

I'm not sure why it sets result = id in a loop? I would replace the whole thing with just "result = id;". https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L235-L238

The llama_token_to_str function can return nullptr and it will crash the server when it does: https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L2957-L2960 I think these need a null check: https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L353 https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L266 Although I'm not sure why it returns null sometimes.

This sets has_next_token to false when it ends with the eos, but then afterwards it overwrites the value of that variable anyway so it has no effect. I would replace "has_next_token = n_remain != 0;" with "if (has_next_token) { has_next_token = n_remain != 0; }". https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L285-L293

This resets n_remain and makes it generate forever, I think the check was supposed to be == -1 instead of !=: https://github.com/ggerganov/llama.cpp/blob/0df7d63e5ba0ab8856476e121a03b985d6f15c9d/examples/server/server.cpp#L289-L292

I also would initialize generated_text with "generated_text.reserve(params.n_ctx);" somewhere to reserve the amount of memory.

There's another problem with a prompt that ends with "ASSOCIATE:" and with a stopping string with the same value, it seems to be applied immediately unlike most other backends, there should be a check to only apply it after the first character was generated.

In the proxy later n_predict should be set to genParams.max_length instead of what's currently now.

digiwombat commented 1 year ago

I'm about to eat dinner, but I'll take a look at this after and test them out. Thanks for actually reading through the code. I also forgot to change reload_ctx to a bool, so it's sort of impressive that worked. I guess the get<> stuff doesn't care too much.

digiwombat commented 1 year ago

Okay, looks like they're doing a refactor on the that will hopefully take care of all these concerns. And they added server to the build actions earlier as well, so it will be easy to point people at the server once the improved version is in. 👍🏻

FSSRepo commented 1 year ago

Thank you very much for improving the code. I apologize for my bad practices in C++. I'm not very good at that language.

digiwombat commented 1 year ago

No worries and thanks to @anon998 for looking through the code more thoroughly. Thanks for putting the server together and getting the ball rolling in the first place. There's a lot of value in that. And since it's in the default build actions now, I think it will help a lot of people get native speeds.

LiliumSancta commented 1 year ago

I'm testing the generation and it seems to have a problem with the generated size and i believe the best place would be here. There is something strange with the size sent to llama.cpp in n_predict the value being sent in my test is "n_predict":1014 i set maxNewTokens: 40 i know it must be due to today's changes, stoppingStrings don't seem to be working either, but this I still need to check better and maybe it's something in llama.cpp

digiwombat commented 1 year ago

He is setting it like this in backend.mjs:

n_predict: Math.min(
      config.promptTokenCount + 1 + genParams.max_length,
      config.maxContextLength
    ),

Edit: On latest commit that line is: n_predict: genParams.max_length,

So I'm pretty sure n_predict/maxTokens isn't being used at all. And yeah, I'm not sure about the stopping strings. That could be a llama.cpp issue. There's certainly code in the server to catch llama_eos_token(), it just doesn't seem to want to generate one. I don't know if that's a generation settings issue or what.

LiliumSancta commented 1 year ago

Yes, if you manually set llama.params.n_predict = body["n_predict"].get(); will respect the maximum size. As for stopsStrings, I'm going to test with another model/prompt because I'm using guanaco with a template i created and there may be something wrong with the stop word. I'll try some things later, now I've had a few drinks and I'm prone to making some mistakes 😅. Anyway, good job, makes using frontends a lot easier with llama.cpp without relying on python.

Edit: I'll test on the latest version later, I realized I was still using the old one...

digiwombat commented 1 year ago

Deleted what I was typing. Looks like anon might have caught the problem. We'll see.

Edit: Confirmed that's what the problem was. It's all working nicely now. Edit Edit: Also, that new per-character write-out is luxurious. I feel so fancy. Edit Edit Edit: Regens are broken again. Weird outputs.

digiwombat commented 1 year ago

llama.cpp server seems to be working as expected now so I guess we can close this. ( ͡° ͜ʖ ͡°)

Thanks @anon998 for all the contributions. Apologies for dragging you into what turned out to be a much larger project than expected. Haha.

anon998 / simple-proxy-for-tavern

Support for llama.cpp API Server #8