ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.51k stars 9.69k forks source link

How to use ggml for Flan-T5 #247

Closed i-am-neo closed 6 months ago

i-am-neo commented 1 year ago

@ggerganov Thanks for sharing llama.cpp. As usual, great work.

Question rather than issue. How difficult would it be to make ggml.c work for a Flan checkpoint, like T5-xl/UL2, then quantized?

Would love to be able to have those models run on a browser, much like what you did with whisper.cpp wasm.

Thanks again. (I can move this post somewhere else if you prefer since it's not technically about Llama. Just let me know where.)

alexconstant9108 commented 1 year ago

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

michaelbogdan commented 1 year ago

Since LLaMA wasn't published under a FOSS license and explicitly does not allow commercial use, it shouldn't be considered an open source model.

i-am-neo commented 1 year ago

@i-am-neo just curious, have you found T5-xl/UL2 to be in any way superior to FB's llama models (even the smallest 7b model) for certain tasks? IIRC T5-xxl used to be the best open source model before llama, but now...?

@alexconstant9108 I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Flan-UL2 looks to be more "fluent"/expressive than Flan-T5, but I've just started to look.

I won't comment on text generation nor "general knowledge queries" since I don't use these models to write novels, nor do I care if they know about Batman and Robin.

What I found beautiful about Stanford Alpaca was that a 7B autoregressive model can be coaxed to follow instructions. The Llama paper mentions "Llama-I" which compared favorably to other instructed models at around 65B parameters). As to how well Alpaca does at 7B, we'll have to see (I've requested the Stanford team to run stats and hopefully they are willing to respond).

What I find beautiful about the idea of llama.cpp or the possibility of a Flan-T5.cpp is running a model locally, hopefully at some point on the browser. If one could shrink Flan-UL2 down to xB parameters and it still performs well... (Leche Flan anyone?).

I'm of the thinking a sequence-to-sequence model and an autogenerative are better harnessed to the best capabilities of each, depending on what one needs to accomplish, rather than pitted against each other on a one-on-one comparison.

In the end, as our field evolves, we all need to ask the question touched on in the Llama paper, italics mine - "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by ..."

alexconstant9108 commented 1 year ago

@i-am-neo Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

MarkSchmidty commented 1 year ago

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here: Launch In Colab

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

alexconstant9108 commented 1 year ago

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it... If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model

MarkSchmidty commented 1 year ago

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English.
It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

>If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model
It takes 30 seconds to launch and test my free web demo linked above.

alexconstant9108 commented 1 year ago

But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

BTW, the readme says:

Quantization Level GPU Memory
FP16(no quantization) 13 GB
INT8 10 GB
INT4 6 GB

How did you manage to shrink it to 4GB? Also is it using GPTQ or RTN quantization for 4-bit? Edit, I saw that in another place, the authors mention 5.2GB CPU memory and GPT-Q quantization. It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

MarkSchmidty commented 1 year ago

That is the GPU memory required to run inference not the model size. 

The official int4 model is 4.06GB on HuggingFace before any pruning.

>It would help if there docs had more English examples. Right now, it seems that they are targeting mostly the Chinese audience which is in line with what you said regarding how they "market" it...

Yes, that's why I'm trying to "market" it to English speakers who want instruct models right now. If more people dig into it then we can collectively work out out answers to questions like these. I actually have an all-English fork in the works which I will publish soon.

alexconstant9108 commented 1 year ago

I actually have an all-English fork in the works which I will publish soon.

Great! You seem to be moving really fast! They only published the 4-bit model 2 days ago :) Hopefully, we'll see also see support for GLM-6B within llama.cpp or end up with a glm.cpp fork :) One more thing - please also include more info about what datasets the model has been trained (and RLHF fine tuned) on (was it similar to LLama, e.g. on the Pile, etc.). Usually with software / models coming out of China people are wary because of any potential (ab)use of party propaganda materials / censorship and so on.

i-am-neo commented 1 year ago

@i-am-neo Very interesting. So in that setup did you fine-tune Flan-T5 on your own dataset so that it can answer those questions accurately? Also can you give any example question / answer pairs (with any sensitive data masked for privacy reasons, ofcourse )

There is an alpaca quantized version available for download somewhere on github if you look close enough :)

BTW regarding running flan-t5 (and llama) locally (not in the browser tho) you will be very excited to try https://bellard.org/ts_server/ . supports both CPU and GPU inference

@alexconstant9108 Thanks for sharing about bellard.org. Very good to know!

Re Flan-T5, no special fine-tuning done in this case. Unfortunately I can't expose question-answer pairs as that would violate user privacy (plus it takes a lot of work to disguise details). I think you may feel similarly if it were your data. You could, if so inclined, try it for yourself, though hopefully because you have a real use case instead of just wanting to kick the tires (limited resources). (https://github.com/ggerganov/whisper.cpp/discussions/625)

And yes, I saw those Alpaca weights floating around, thanks. :)

i-am-neo commented 1 year ago

@MarkSchmidty useful reference. Thanks From what I have observed GLM is mostly ignored due to it being weaker with English prompts. But it may turn out to be better when conversing in Chinese for those who can speak it...

I have not had this experience. It seems excellent for English. It is mostly ignored because the team behind it does very little to market it outside of China or to non-Chinese speakers.

If you could provide a small comparison table of English prompts when using Alpaca, Flan-U2L and GLM-6B, that would open the eyes of many people for that particular Instruct model It takes 30 seconds to launch and test my free web demo linked above.

@MarkSchmidty Thanks for sharing. I was excited enough to try it right away (both using your notebook and another gradio demo off HF). The model doesn't seem as adept at answering questions drawn from context I give it. Maybe it's the way I gave prompts? In your tests, how did you ask it to answer from specific text?

i-am-neo commented 1 year ago

@MarkSchmidty But! chatglm seems quite fluent. Great to see. ps. You probably also know this - it can't be used for commercial purposes.

Prompt:

Write a paragraph about the most beautiful sunset you have ever seen.

Response:

The most beautiful sunset I have ever seen was in a remote corner of the world.� It was a moment of pure magic and wonder, a vision of beauty and tranquility. The sky above was a rich, vibrant pink and orange, a vibrant shade that could be seen even from a great distance. The sun was setting over the horizon, casting long shadows across the landscape, and the sound of the wind rustling through the leaves could be heard clearly.逄

The setting sun was the perfect contrast to the rest of the day.

TokenBender commented 1 year ago

Since people in this thread are interested in Instruct models, I recommend checking out chatGLM-6B.

I believe it is more capable than Flan-UL2 in just 6B parameters. I have a one-click web demo of the 4bit GPU version here: Launch In Colab

It's not getting much attention due to being the product of a Chinese University. But in my testing it outperforms every other open source instruct model while coming in at just 4GB, even smaller than Alpaca-7B.

Full repo is here: https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md

Hi,

I tried to run your colab demo locally on M2 pro and it failed with below error: File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/nvrtc.py", line 5, in nvrtc = Lib("nvrtc") File "/opt/anaconda3/lib/python3.9/site-packages/cpm_kernels/library/base.py", line 59, in init raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin

Considering the RAM req. are so low I supposed it could work on M2 directly but looks like there is still something that needs to be changed for it?

MarkSchmidty commented 1 year ago

Th colab demo is meant to run on a free google colab GPU not on a local runtime (and definitely not CPU).

If you want to run chatGLM on a local CPU you should follow the instructions in the official chatGLM repository for running on CPU. It is very fast on GPU but very slow on CPU currently.

iliemihai92 commented 1 year ago

To be clear: Flan-UL2 is the best open source model trained on instruction dataset. Other models such as Llama or GhatGLM are open for research. ChatGLM may be worst at instruction tuning due to Chinese vocabulary

MarkSchmidty commented 1 year ago

I get awful results with Flan-UL2. Its responses tend to be extremely short and it hallucinates more than most models when it doesn't know something. I have had no issues with chatGLM's English abilities.

But for fully open source models, I have had good results with the newer OpenAssistant STF-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b (Note: The HuggingFace inference API version is broken and returns results nothing like the actual model does in practice.)

i-am-neo commented 1 year ago

Maybe we all can share the prompts we're talking about so our baselines for comparison are not moving about. I have a feeling some folks here are trying models for text generation vs question-answering which are different use cases...

MarkSchmidty commented 1 year ago

Here's an example of a question to Flan-UL2 where it is both wrong and characteristically short, even when asked to explain. (Gears 1 and 6 spin in opposite directions, as they are odd and even numbered gears respectively.) This shortness is highly typical of my experiences with UL2. image

ghost commented 1 year ago

IME text generating models (even ones that explicitly say they don't like salesforce's codegen models) work fine as question/answer models if you use few shot learning (prepend your real question with a couple rounds of fake converstion.) According to OpenAI's instruct GPT paper the major improvement from their HRL was that this wasn't necesary.

i-am-neo commented 1 year ago

@MarkSchmidty Since I don't know much about gears... Also I believe the Flan models activate their "explain" reasoning via "Let's think step by step."

image

av commented 1 year ago

Hey, @i-am-neo 👋🏻 Sorry for bothering, but I'm curious if you had a chance to discover anything interesting in relation to conversion of t5 family of models to ggml since the last activity in this thread. Thank you!

i-am-neo commented 1 year ago

Hi @av! I haven't yet. Want to share something you're working on?

av commented 1 year ago

Hi @av! I haven't yet. Want to share something you're working on?

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

I have found Flan-T5 performant when one needs accurate answers to questions (no inventions allowed). This is from real-life data, details disguised for privacy. Flan-T5 was used in its recipe.

Tried to follow the ggml examples only to later discover that there's an open branch for t5 integration. Unfortunately finishing it is out of reach for my level of C++ (and understanding of inference mechanics in LLM, to be fair).

Not sure if the following is of any interest to you, but also tried to run it with ONNX (only a minor boost for small batches, worse than PyTorch), TensorRT (the setup process is quite demanding, unfortunately, completely excludes CPU inference), CTranslate2 (better boost than ONNX for CPU inference). Currently, leaning towards Flan T5 Alpaca robo-camelid, There's also an Alpaca fine-tuned UL2 version, but haven't tried it yet.

i-am-neo commented 1 year ago

@av Thanks for sharing.

Nothing too specific, but looking for a way to reduce runtime costs for the T5 family with the same context from your message above:

We may be sometime before being able to run a good LLM on the browser - as even whisper.wasm can only run its small model (244M parameters) at 1G of memory. Or is your goal reducing inference time?

I am, however, able to run Flan-T5 inference without a GPU. Not using quantization, as I've seen odd results with LLM.int8.

I'm not sold on the results of Flan-T5 finetuned on Alpaca data. With declare-lab/flan-alpaca-xl:

Screen Shot 2023-04-10 at 10 56 37 AM

With quantized 8-bit https://huggingface.co/spaces/joaogante/transformers_streaming:

Screen Shot 2023-04-10 at 10 58 20 AM

Plain vanilla Flan-T5-xl:

Screen Shot 2023-04-10 at 11 03 39 AM
ghost commented 1 year ago

I am also restricted by the license, and prefer truly open sourced models like flan-t5.

In my case, I'm particularly interested in question answering with context. I find flan-t5 provides extractive answers, probably because the prompt triggers its instruction tuned on the Squad dataset. I am seeking for more abstractive answers.

@av @i-am-neo Do you guys have any suggestions? Thank you.

i-am-neo commented 1 year ago

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

ghost commented 1 year ago

@jasontian6666 say more about what you mean by "question answering with context." Context from where? your own data? from the web? Maybe share some examples.

Just the typical open book QA. You provide a question and context, and the model generates answer.

i-am-neo commented 1 year ago

I hear you @jasontian6666 . I don't quite grasp what you meant by "with context" - just guessing but you want the answers paraphrased maybe?

If so try Flan-UL2. It's more fluent then the Flan-T5 series.

turian commented 10 months ago

I am also curious if there are plans to include flan-t5 in llama.cpp

ggerganov commented 10 months ago

Definitely, though it is not high prio atm, so will hope for a community contribution in the meantime

jorirsan commented 8 months ago

Going to comment out that not T5 specifically, but more support for available Encoder-Decoder models in llama.cpp would be amazing. From what I've seen, Enc-Dec support on most optimized model inference projects seems almost non-existent. It is really a shame since Enc-Dec models currently tend to offer the best performance in tasks such as Machine Translation and Automatic Speech Recognition.

maziyarpanahi commented 8 months ago

Speaking of T5 support, Aya model(s) is based on mT5 and it's trending now. It's 13B, so it does require quantization.

Would be great if we can support it now that the T5/mT5 architectures are back to interesting for text-generation (they were before, but now trending next to Llama-2 and Mistral) https://huggingface.co/CohereForAI/aya-101/discussions/9

ggerganov commented 8 months ago

In terms of API, how do these models work?

I imagine something like:

prompt_enc = llama_encode(prompt_text);

while (true) {
    token = llama_decode(prompt_enc);

    ? update prompt_enc with sampled token ?
}

Does that make sense? Would be helpful if we get a short summary about how these models are different compared to decoder-only models

lukestanley commented 8 months ago

Sorry if I misunderstood your request, but encoder-decoder models perform "translations", like Whisper, or Google Translate. They can do summarisation, Q&A, and even completion too, if so trained.

Encoder-decoder models might be exposed to an end user with a similar GUI. Rather than it being a completion task, it's all about one specific input, and one specific output. That is a separate thing, not a continuation. It's "translate" vs "autocomplete" (even though they can do completion, it's not done as a continuation).

The code you sketched seems to be a completion, which is quite different. The API would probably need to be a bit different to support these sorts of models.

Some encoder-decoder models are inherently less at risk to instruction prompt injection and can be made to run really fast on mobile devices more easily because of their constrained task focus. Both are really important features for solving problems quickly and reliably.

I imagine it would be more like this:

prompt_enc = llama_encode(prompt_text);

// Empty string to collect output tokens
result_output = "";

for (int i = 0; i < max_output_length; ++i) {
    string token = llama_decode_step(prompt_enc, result_output);
    result_output += token;
    if (token == END_TOKEN) {  
        break;
    }
}

Encoder-Decoder models in llama.cpp could be amazing. I hope this comment is useful, much respect, I'm using Llama.cpp all time time, you're a hero! @ggerganov

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.