ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.01k stars 9.32k forks source link

any interest in the openchatkit on a power book? #96

Closed ninjamoba closed 1 year ago

ninjamoba commented 1 year ago

https://www.together.xyz/blog/openchatkit this new repository might also be a good candidate for any local deployment with a strong GPU. As the gptNeox focus is on GPU deployments.

v3ss0n commented 1 year ago

GptNeoX is not as good as LLAMa

sburnicki commented 1 year ago

@v3ss0n I'm not sure if it's simple like that.

The fine-tuned GPT-NeoXT-Chat-Base-20B model that is part of OpenChatKit is focused on conversational interactions. And they are working on fine tuning it for other use cases.

Testing their model on Hugging Face looks quite promising.

Also it's Apache 2.0 licensed and there is a lot of stuff around the project besides from just using inference on the model as moderation and concepts for a retrieval system.

So being able to use inference with their model(s) using ggml on commodity hardware would be interesting in my opinion.

v3ss0n commented 1 year ago

I see , i can see their benifit. So far LLAMA version is quite bad at code generation , otherwise quite good . GOnna try their HF>

gjmulder commented 1 year ago

So far LLAMA version is quite bad at code generation , otherwise quite good .

You might want to read the original paper LLaMA: Open and Efficient Foundation Language Models if you want to understand why this is. LLaMA was trained similar to GPT-3 and not fine-tuned on specific tasks such as code generation and chat.

v3ss0n commented 1 year ago

Mixing this with code tune models like codegen would make it ChatGPT-like performance? Also we need CodegenCPP.

This repo needs a discussion tab opened.

Ayushk4 commented 1 year ago

Open-Chat-Kit along with Open-Assistant models are supported in Cformers/, along with 10 other models.

You can now interface with the models with just 3 lines of code from python.

from interface import AutoInference as AI
ai = AI('OpenAssistant/oasst-sft-1-pythia-12b')
x = ai.generate("<|prompter|>What's the Earth total population<|endoftext|><|assistant|>", num_tokens_to_generate

Generation speed is competitive to llama.cpp (75 ms/token for 12B model on my Macbook Pro)

v3ss0n commented 1 year ago

i hope to see this 2 framework collaborate , instead of compete , because - together we are the only fighting chance against big tech.

Ayushk4 commented 1 year ago

Cformers and LLaMa.cpp are efforts on two orthogonal fronts to enable fast inference of SoTA AI models on CPU. We are not competing. Both are necessary and we would love to collaborate - as we have already indicated multiple times. If you go through our cformers README, we identified three fronts:

  1. Fast C/C++ LLM inference kernels: This is what LLaMa.cpp and GGML is for -- We use these libraries as backend. So far we use a copy-pasted version of the files. We plan to switch over to LLaMa.cpp or GGML as a git-submodule dependency.

In order to avoid redundant efforts, we switched away from our original int4 LLaMa implementation in CPP to llama.cpp, even though ours had similar speed on CPU and was released before the first commit to llama.cpp was even made!

  1. Easy to use API for fast AI inference in dynamically typed language like Python - This is what Cformers is for.

  2. Machine Learning Research & Exploration front - This is what our other efforts at nolano.org is for - like int3 quantization, sparsifying models, and training open-source LLaMa equivalent model - the latter two will soon be shared.

Eventually, we will to move away from LLaMa/Alpaca models because of its restrictive licensing, though we will have support for it in Cformers.

v3ss0n commented 1 year ago

Would be better if two communities able to merge.