Faster llama - Githubissues

maiquangtuan commented 1 year ago

what about adding llama to the pre-trained models list guys?

janekb04 commented 1 year ago

I tried converting the LLAMA models to the CTranslate2 format, but I got stuck at creating a proper LLamaLoader:

@register_loader("LlamaConfig")
class LlamaLoader(ModelLoader):
    ...

converters/transformers.py has some examples, but I don't know enough to be able to convert the Hugging Face implementation of Llama (Config, Tokenizer, Model, Model with something added to it) to the CTranslate2 format. As far as I understood the code, the converter is basically, a kind of reimplementation of the original model that will be run by CTranslate2's "execution engine". Supposedly, for someone who knows LLMs well, this could be relatively straightforward, as all the other conversion scripts look rather short. Also, having read the LLama paper, I saw that they don't really talk a lot about its internals. The entire description of LLama's fits in a few paragraphs while its initial implementation is 238 lines.

2.2 Architecture

Following recent work on large language models, our network is based on the transformer architecture (Vaswani et al., 2017). We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architecture, and where we were found the inspiration for this change (in bracket): Pre-normalization [GPT3]. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019). SwiGLU activation function [PaLM]. We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2/3 4d instead of 4d as in PaLM. Rotary Embeddings [GPTNeo]. We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network. The details of the hyper-parameters for our different models are given in Table 2.

2.4 Efficient Implementation

We make several optimizations to improve the training speed of our models. First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This implementation, available in the xformers library, is inspired by Rabe and Staats (2021) and uses the backward from Dao et al. (2022). This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task. To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. To fully benefit from this optimization, we need to reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also overlap the computation of activations and the communication between GPUs over the network (due toall_reduce operations) as much as possible. When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

Quote from LLaMA: Open and Efficient Foundation Language Models Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

guillaumekln commented 1 year ago

Thank you for looking into this.

The LLaMa architecture is almost supported, but we need to implement the "Rotary Embeddings" in the core library. I could try to implement that and then provide an example script for conversion.

However, it seems there would be additional work to improve the usage experience with these large models. For example, the converter currently loads the full model in memory before quantization and saving. It would be more practical to convert the model in multiple parts to reduce the memory usage.

Also I'm not sure if the 8-bit quantization would work out of the box with this model, or if it requires rescaling the activations with techniques like SmoothQuant or gptq.

janekb04 commented 1 year ago

Hi @guillaumekln, thanks for the quick response.

Regarding the usage experience with the converter, it is true that it takes a lot of memory. On my desktop, I was able to run the conversion (until the point when the loader is used) only for the 7B model, as it used 30GB of memory. Still, I think that this could be handled later, as this conversion is performed only once and could be done, for example, in the cloud. Then, the CTranslate2 weights could be redistributed.

I don't know what about quantization. I'll try to have a look. Still, there is always the option of just trying to run it and see how it works.

I joined this thread because LLama seems to be a major step forward in terms of performance and utility. Recently, a team at Stanford showed that fine-tuning LLama on an instruction dataset generated with InstructGPT makes its performance be on par with it. They call their fine tuned version Alpaca. Here is an implementation of Alpaca running locally. The GIF reportedly shows a 4-bit quantized 7B model, running in real time on a laptop CPU.

That implementation is based on the exact same framework that underlies whisper.cpp. Given that faster-whisper is 5-6x faster than whisper.cpp, running Alpaca under CTranslate2 could theoretically make it possible to:

run a GPT-3.5-like model in real-time on 1 or 2 CPU cores
run multiple instances of the model at once that could be processing some large text
run a larger model, given enough memory, that could have performance closer to PaLM

And all this in a a consumer grade notebook. This could be a major thing, as more and more consumer applications are using LLMs. It would be great for both privacy and cost-effectiveness if such a model could be run locally, offline.

I think that it would be interesting to see if inference-time memory usage could somehow be cut. For example, could weights be streamed from disk (as is done in assets in modern games)? Or maybe some sparse representation could be used? I don't know, these are just some of my ideas.

guillaumekln commented 1 year ago

The latest version (3.11.0) implemented the rotary embeddings, so we now support all components that are used by LLaMa!

I'm sharing an experimental conversion code and example in this Gist:

https://gist.github.com/guillaumekln/7ef5db5ef2e84ebaf9b005ebecf4a85a

It seems to work great for 7B even with 8-bit quantization, but the checkpoints with multiple shards produce gibberish after conversion. Probably there is something wrong in the converter but I'm not sure what at the moment. I'm sharing this now so other people can look at it and run first experiments.

guillaumekln commented 1 year ago

the checkpoints with multiple shards produce gibberish after conversion

I updated the converter to fix this issue. Some sharded weights were not correctly concatenated.

janekb04 commented 1 year ago

I'm afraid that the new script doesn't work for me. This might be caused by the fact that I´m not using the original LLama parameters, but (in theory) they should be compatible. I'm trying to convert the Vicuna 13B fine-tunning of LLama, as it has far superior performance for chat applications.

Traceback (most recent call last):
  File "llama-converter.py", line 129, in <module>
    main()
  File "lama-converter.py", line 26, in main
    converter.convert_from_args(args)
  File "ctranslate2\converters\converter.py", line 50, in convert_from_args
    return self.convert(
  File "ctranslate2\converters\converter.py", line 89, in convert
    model_spec = self._load()
  File "llama-converter.py", line 67, in _load
    shape = linear_spec.weight.shape
AttributeError: 'NoneType' object has no attribute 'shape'

That model doesn't include params.json, but I found that it has these contents for LLama 13B and just copied it: {"dim": 5120, "multiple_of": 256, "n_heads": 40, "n_layers": 40, "norm_eps": 1e-06, "vocab_size": -1}

guillaumekln commented 1 year ago

The script works with the original checkpoint format. It should be adapted if the checkpoint name or structure are different.

janekb04 commented 1 year ago

Ah, ok. I just realized that this loads the Llama in Meta's format rather than in the HuggingFace format. So it expects something like:

    LLAMA
    │   tokenizer.model
    │   tokenizer_checklist.chk
    │
    ├───13B
    │       checklist.chk
    │       consolidated.00.pth
    │       consolidated.01.pth
    │       params.json
    │
    └───7B
            checklist.chk
            consolidated.00.pth
            params.json

av commented 1 year ago

On my desktop, I was able to run the conversion (until the point when the loader is used) only for the 7B model, as it used 30GB of memory.

@janekb04, I'm currently trying to convert 11B Flan-Alpaca and it uses ~40GB or real RAM + 20GB of swap and then crashes.

By any chance, do you have any tips to share on how to approach this (apart from increasing swap and getting a large cup of hot drink 😄)

janekb04 commented 1 year ago

@av My recommendation is to use a cloud. It is surprisingly cheap right now. Could you send over your conversion code? I think it might work for Vicuna as well, as they seem to have the same format.

av commented 1 year ago

@janekb04, thanks for returning back to me! Wanted to keep cloud as a last resort option, hoping to have a fully-fledged local dev env for the project I'm working on :)

Just for the future context, ~100G of RAM+Swap wasn't enough for the process to finish by default, but it did work successfully based on the recommendation from Guillaume using ~32G of RAM for XXL model.

Could you send over your conversion code?

Sure, but most likely it won't be applicable to actual Alpaca model directly, and I'm literally using plain CT2 APIs. Following version is based on a comment from @guillaumekln:

import ctranslate2

model_id = "declare-lab/flan-alpaca-xxl"
output_dir = "ct_flan_alpaca_xxl_v1"

class FlanAlpacaConverter(ctranslate2.converters.TransformersConverter):
    def load_model(self, model_class, model_name_or_path, **kwargs):
        # Helps to reduce overal RAM usage during conversion
        # You'll still need:
        # ~32GB of RAM for the xxl (11B params) model
        # ~8GB of RAM for the xl (3B params) model
        kwargs["low_cpu_mem_usage"] = True
        return super().load_model(model_class, model_name_or_path, **kwargs)

    def load_tokenizer(self, tokenizer_class, model_name_or_path, **kwargs):
        # FlanAlpaca ships without a fast version of the tokenizer
        # This is a workaround to avoid the error about it missing in the model files
        del kwargs["use_fast"]
        return super().load_tokenizer(tokenizer_class, model_name_or_path, **kwargs)

# Use extended Converter class as a workaround for the missing fast tokenizer
# and to reduce RAM usage during conversion
ct = FlanAlpacaConverter(model_name_or_path=model_id, load_as_float16=True)
ct.convert(output_dir=output_dir, force=True, quantization="int8_float16")

Older conversion code, before the comment from Guillaume

```python # ! ctranslate2 was patched to turn off the "use_fast" flag in the tokenizer # ctranslate2/converters/transformers.py#_load (line 97) - comment out the line that sets use_fast=True, this is specific to derivatives of T5 model from ctranslate2.converters import TransformersConverter model_id = "declare-lab/flan-alpaca-xxl" output_dir = "ct_flan_alpaca_xxl_v1" ct = TransformersConverter(model_name_or_path=model_id, load_as_float16=True) ct.convert(output_dir=output_dir, force=True, quantization="int8_float16") ```

This comment was updated based on the below feedback.

guillaumekln commented 1 year ago

@janekb04 I can look to also provide a conversion example for the HF format, but the logic will be similar to the Meta format. The main difference is that each HF checkpoint contains a subset of all layers, while each Meta checkpoint contains all layers but with partial weights that should then be concatenated with the other shards. The weights name in the checkpoint are also different.

@av, since you already patched the converter code, you could add the argument low_cpu_mem_usage=True when loading the Transformers model in:

https://github.com/OpenNMT/CTranslate2/blob/v3.11.0/python/ctranslate2/converters/transformers.py#L118

Combined with load_as_float16=True, this should further reduce the memory usage and prevent from_pretrained from duplicating the model weights in memory.

comment out the line that sets use_fast=True, this is specific to derivatives of T5 model

Did you mean "the line that sets use_fast=False"?

Note that you could also change the arguments by extending the converter class instead of patching the code:

class MyTransformersConverter(ctranslate2.converters.TransformersConverter):
    def load_model(self, model_class, model_name_or_path, **kwargs):
        kwargs["low_cpu_mem_usage"] = True
        return super().load_model(model_class, model_name_or_path, **kwargs)

    def load_tokenizer(self, tokenizer_class, model_name_or_path, **kwargs):
        del kwargs["use_fast"]
        return super().load_tokenizer(model_name_or_path, **kwargs)

(I have not tested this code but I expect it to work.)

av commented 1 year ago

you could add the argument low_cpu_mem_usage=True when loading the Transformers mode

@guillaumekln thank you so much for taking your time to look at this!

Your suggestion helped like a charm! Now the conversion process for 11B model consumes ~36G of RAM and finishes successfully in ~4m, I was also able to succesfully run inference faster on the CPU compared with base Transformers version of the model.

Did you mean "the line that sets use_fast=False"?

Yes, you're precicely right, I've added that comment based on already patched version.

Note that you could also change the arguments by extending the converter class

Yes, definitely a much better version, thank you! Was just quickly hacking things together to see what would work.

Will update my initial comment above to avoid possible confusion for whomever will see this thread in the future 👍🏻

janekb04 commented 1 year ago

@guillaumekln Thanks for your quick replies and willingness to help. I am currently learning a lot about AIs and LLMs. I have a background in realtime rendering (programming GPUs, low-level optimizations, linear algebra...), so I thought that I might get into this topic. There's just so much happening and so many papers and projects appearing that I might be spreading too thin. I'm writing this to say that you don't have to bother writing the Vicuna converter, as running it is just one of these curiosities of mine. And at this pace, I wouldn't be surprised if next week an even better fine-tuned variant will be published.

hobodrifterdavid commented 1 year ago

I did a little test, I'll try share what I know. On a 3090 (24GB) with the 7B model, I was getting 50+ tokens/s output with a batch of 1. This seems to be a bit higher that this other project (https://github.com/qwopqwop200/GPTQ-for-LLaMa). There are 'cuda' and 'triton' branches, from the github issues they report about 40-45 token/s. I don't know the throughput when running bigger batches.

With CTranslate2, when running batches of 32, I am seeing speeds of about 1300 tokens/s output (about 40 token/s for each generation). I'm going OOM with batch size 64, 48 seems ok but not really higher throughput. Specifying a data type (fp16, int8) didn't seem to increase throughput. Some batch sizes seemed to give unexpected speeds, but I didn't test systematically. I did email Fabrice to ask him what speeds he was getting on his implementation (https://bellard.org/ts_server/), he said about 1400 tokens/s on a 3090, and that input tokens (context) don't change the processing time very much. This also seems to be the case with CTranslate2, I tried with a longer context and throughput fell to about 1200 only.

Fabrice made quite a nice model performance table on that page using https://github.com/EleutherAI/lm-evaluation-harness , checking the output scores could be a nice way of ensuring the model is being run correctly. There's also some token/s numbers. There is a performance jump (~50%) when going from Q8 to Q4, presumably due to reduced demands on ram bandwidth, and throughput probably scales linearly with num. of parameters.

As Janek mentioned, some of the most interesting models now are the llama derivatives that can follow instructions, such as vicuna, koala, gpt4xAlpaca. These are llama models that are find tuned on chatGPT transcripts. The koala model also adds other open datasets, but their page (https://bair.berkeley.edu/blog/2023/04/03/koala/) notes that including those didn't improve the quality of the responses.

Again, as Janek notes, these models seem to be distributed in a different format from the original meta weights, they look like this:

Anyways, hopefully some info was of use. :)

janekb04 commented 1 year ago

@hobodrifterdavid Thanks for the valuable insights. It's interesting to see, on the linked Fabric0e's benchmark, that the difference between the Q4 and Q8 models is basically unnoticeable. Currently, CTranslate2 seems to support only 8bit quantization, so it would be interesting to see how the performance improves in practice when using 4bits. It's the first time for me hearing about ts_server, and it's impressive to see that CTranslate2 seems to be almost on par with it in terms of performance, while being open-source. After having looked through a few fine-tuned versions of LLama, I noticed that although they are distributed in a different format than Meta used, this format seems to be rather consistent among them. So maybe a single loader would be able to handle them all, after all, and maybe even account for new future variants.

guillaumekln commented 1 year ago

I did email Fabrice to ask him what speeds he was getting on his implementation (https://bellard.org/ts_server/), he said about 1400 tokens/s on a 3090

Do you know if this result is for the 8-bit or 4-bit model?

I added a conversion script for the Hugging Face format. See the script llama_hf_converter.py in https://gist.github.com/guillaumekln/7ef5db5ef2e84ebaf9b005ebecf4a85a. If it works well for you, I will probably include it in a future version.

Palmik commented 1 year ago

I did not manage to get the HF->CT2 conversion working. The HF model is a result of LLaMA weights + LoRa weights merge, using this script. The resulting merged model (which should be equivalent to a plain 7B LlamaForCausalLM) works well, but the converted CT2 model produces mostly garbage. But it does not produce complete gabrage -- a prefix that appears a lot in fine tuning is often generated well, followed by stuff like ?|||111|2|2||||2|2|2222222222222|||||222222222||||||||||||||||||||||.

On the CT2 side I tried both int8_float16 and float16 quantization. On the HF side I tried both float16 and float32, and exporting as a single pytorch_model.bin shard as well as multiple.

I checked the token ids generated from the SP tokenizer, and it's equivalent to what I get from the "native" HF one (modulo the <s> from your script, I had to remove that -- does not change the result).

Let me know if you have some suggestions for debugging this further.

guillaumekln commented 1 year ago

Thanks for testing.

I forgot that some weights need additional transformation. I updated the converter accordingly in the Gist. Can you try again?

Palmik commented 1 year ago

That fixed it, awesome!

janekb04 commented 1 year ago

@guillaumekln Thanks for the converter. I can also confirm that it works. I am yet to conduct a benchmark on my GPU (4090), but I will share my results, when I'll have them. For now, I checked that it can work as a local AI-assistant, ~~though #1161 would make the experience more similar to other solutions~~.

In case someone's interested, here's a simple chat in the Vicuna format, adapted from generate.py.

import os
import ctranslate2
import sentencepiece as spm
import time

model_dir = "llama_ct2/"
generator = ctranslate2.Generator(model_dir, device="auto")
sp = spm.SentencePieceProcessor(os.path.join(model_dir, "tokenizer.model"))

prompt = "A chat between a curious user and an artificial intelligence assistant.\n"\
"The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n"

try:
    while True:
        query = input(f"\n\n       USER: ")
        prompt += f"USER: {query.strip()}\nASSISTANT:" # no space after ASSISTANT:
        tokens = sp.encode(prompt, out_type=str)
        print(f"\n\n  ASSISTANT:", end='')

        start_time = time.time_ns()

        tokens_gen = generator.generate_tokens(
            tokens,
            sampling_temperature=0.8,
            max_length=2048,
        )

        prev = ''
        line = 0
        line_len = 0
        output = ''
        tokens = 0
        for token in tokens_gen:
            tokens += 1
            if token.token.startswith("▁"):
                output += ' '
                if line_len > 60:
                    print("\n             ", end="", flush=True)
                    line += 1
                    line_len = 0
                else:
                    print(' ', end="", flush=True)
                    line_len += 1

            chars = sp.decode([token.token_id])
            output += chars
            for char in chars:
                if char == '\n':
                    line += 1
                    line_len = 0
                    if prev == '\n':
                        continue
                    print("\n\n            ", end="", flush=True)
                elif char != ' ':   
                    line_len += 1       
                    print(char, end="", flush=True)
                prev = char
                print("", end="", flush=True)
        print("", flush=True)

        end_time = time.time_ns()
        secs = (end_time - start_time) / 1e9

        print(f"\n             ({tokens / secs:.2f} toks/s; {tokens} tokens; {secs:.2f} s)")

        prompt += f" {output}\n"
except KeyboardInterrupt:
    print("exit()")


       USER: List all the people who walked the moon.

  ASSISTANT: No one has ever walked on the moon. The first humans to walk
             on the moon were Neil Armstrong and Buzz Aldrin, who stepped off
             the lunar module Eagle and onto the moon's surface on July 20,
             1969, during the Apollo 11 mission.

             (25.34 toks/s; 65 tokens; 2.56 s)

       USER: Your answer is self-contradictory.

  ASSISTANT: I apologize for the confusion. My previous response was incorrect.
             The first humans to walk on the moon were Neil Armstrong and Buzz
             Aldrin, who stepped off the lunar module Eagle and onto the moon's
             surface on July 20, 1969, during the Apollo 11 mission.

             (35.62 toks/s; 69 tokens; 1.94 s)

       USER:

guillaumekln commented 1 year ago

though https://github.com/OpenNMT/CTranslate2/issues/1161 would make the experience more similar to other solutions.

This feature is now released in version 3.12. Can you give it a try?

I updated the example script generate.py to use this feature and output the generation word by word.

I also updated the converters to enable a more efficient implementation of the rotary embeddings (especially for CPU execution). It is not required, but I still suggest to reconvert the model with this updated converter.

qZhang88 commented 1 year ago

I am running on agx orin. I have to compile ctranslate2 by myself.

When I run @janekb04 generate.py, I encounter a problem

    results = generator.generate_batch(
RuntimeError: No SGEMM backend on CPU

CTranslate2 is compile with:

cmake .. -DWITH_CUDA=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DWITH_RUY=ON

@guillaumekln runing your generate.py has no error, but never returned a word for long long time.

Any suggestion?

guillaumekln commented 1 year ago

The rotary embeddings are initialized on CPU and run 1 matrix multiplication on CPU. So you should enable a backend that provides that. You could install OpenBLAS and then enable it with -DWITH_OPENBLAS=ON.

qZhang88 commented 1 year ago

The rotary embeddings are initialized on CPU and run 1 matrix multiplication on CPU. So you should enable a backend that provides that. You could install OpenBLAS and then enable it with -DWITH_OPENBLAS=ON.

great, it works. I have test the speed on AGX ORIN with vicuna-13b-1.1 and int8_float16 quantization.

generator.generate_batch is slow, only 0.08 token / s. generator.generate_token is much faster, around 5.68 token / s. PyTorch Transformer from HuggingFace is around 4.8 token / s.

CTranslate2 is about 18.3% speed up.

guillaumekln commented 1 year ago

How did you measure the speed? You should get the same performance with generate_tokens and generate_batch (generate_tokens actually calls generate_batch).

For reference I'm getting 28 tokens/s on a Tesla V100S with both methods. I used this code for the comparison:

import os
import time
import ctranslate2
import sentencepiece as spm

ctranslate2.set_random_seed(42)

model_dir = "/tmp/llama_ct2/"
generator = ctranslate2.Generator(model_dir, device="cuda")
sp = spm.SentencePieceProcessor(os.path.join(model_dir, "tokenizer.model"))

prompt = "What is the meaning of life?"
prompt_tokens = sp.encode(prompt, out_type=str)

start = time.time()
num_tokens = 0
use_generate_batch = False

if use_generate_batch:
    results = generator.generate_batch(
        [prompt_tokens],
        sampling_temperature=0.8,
        sampling_topk=20,
        max_length=2048,
        include_prompt_in_result=False,
    )
    print(results[0].sequences[0])
    num_tokens = len(results[0].sequences[0])

else:
    results = generator.generate_tokens(
        prompt_tokens,
        sampling_temperature=0.8,
        sampling_topk=20,
        max_length=2048,
    )
    for result in results:
        print(result.token)
        num_tokens += 1

end = time.time()

print("Tokens per second:", num_tokens / (end - start))

qZhang88 commented 1 year ago

@guillaumekln O, I see, I count the tokens wrong for batch result. I used len(results) not len(results[0].sequences[0]). Many thanks.

janekb04 commented 1 year ago

@guillaumekln Thanks, I confirm that token streaming works. It's impressive to see the answers being written so fast. I updated my comment above. Using that simple test, I get about 38 tokens per second, on my RTX 4090 (interestingly, this seems to be about equal to the speed measured on the RTX 3090).

I am curious about parallel generation. This simple test already uses 100% GPU. So how can parallel generation achieve those 1300 tokens per second? Is using no batches so inefficient? I mean, if the GPU can generate about 1300 tokens per second, then the usage for 38 tokens per second should be only about 3%, not 100%.

Edit: I noticed that after crossing some length threshold (of the entire conversation) the speed drops down to about 8 tokens per second. Is that to be expected as it is some kind of limitation of the model or CTranslate2?

Edit 2: Would it be possible to have a batch version of generate_tokens? Ie. have tokens be streamed for multiple prompts at once?

guillaumekln commented 1 year ago

I am curious about parallel generation. This simple test already uses 100% GPU. So how can parallel generation achieve those 1300 tokens per second? Is using no batches so inefficient? I mean, if the GPU can generate about 1300 tokens per second, then the usage for 38 tokens per second should be only about 3%, not 100%.

GPU are designed for batch processing, so using a single entry in the batch is usually inefficient. However, you can't infer the GPU usage like this as there are other details to consider (kernel occupancy, etc.)

I noticed that after crossing some length threshold (of the entire conversation) the speed drops down to about 8 tokens per second. Is that to be expected as it is some kind of limitation of the model or CTranslate2?

Does this length threshold happen to be 2048? That's default size of the rotary embeddings. Currently their size is increased by 1 for each step above 2048 which is not very efficient. We should increase the size by a larger value when the overflow happens (e.g. double the size).

Would it be possible to have a batch version of generate_tokens? Ie. have tokens be streamed for multiple prompts at once?

Currently you can do something like that using the hidden _callback argument in generate_batch. That's what generate_tokens actually use:

https://github.com/OpenNMT/CTranslate2/blob/v3.12.0/python/ctranslate2/extensions.py#L301-L330

The callback is called for every tokens in every batches.

Just be careful when using include_prompt_in_result=False with variable-length prompts. Given the current implementation, some prompt tokens will be returned in the callback. I can improve that in the next version.

janekb04 commented 1 year ago

I'm aware that GPUs are made for batch processing. I actually come from a background in realtime rendering. I was simply under the impression that in machine learning, the computations distill down to tensor multiplications and that the input tensor size is proportional to the number of elements in a batch. And hence, the time complexity would be linear with respect to the batch size.

The Task Manager GPU percentage is indeed a bad measure. I tried to look more in-depth at Nsigth Compute, but I haven't set it up correctly on my computer yet. So far, I've always been using Nsight Graphics. Still, it looks to have some sort of "CUDA function call log" that, at first glance, seems to suggest that something is (presumably) unnecessarily constantly querying for the device and setting it, and doing memory allocations. I have no experience with CUDA, but it caught my attention as the general performance rules are probably the same whether it is graphics or compute. So, constantly interacting with the GPU (or driver-level) state machine is relatively inefficient.

Nsight Compute screenshot

This is only a tiny part of the snapshot. All these calls are only for generating one token - "The".

By the way, it's weird that Nsight Compute claims not to have access to the GPU performance counters, despite them working perfectly fine in Nsight Graphics.

janekb04 commented 1 year ago

Does this length threshold happen to be 2048? That's the default size of the rotary embeddings. Currently, their size is increased by 1 for each step above 2048 which is not very efficient. We should increase the size by a larger value when the overflow happens (e.g. double the size).

It seems to be around 2048. I haven't checked if it is exactly 2048 (as I only roughly estimated the token count based on my chat example above), but it is something along those lines.

Currently you can do something like that using the hidden _callback argument in generate_batch. That's what generate_tokens actually uses.

Thanks, I'll have a try.

janekb04 commented 1 year ago

I see that cudaSetDevice is used in two places (allocator.cc, devices.cc). I propose to change this to some checked_set_device that will first check if new_device == old_device. Or better, ~~to avoid having to query the device~~ (it can be cached in a variable, so we don't have to query it - as long as all calls to cudaSetDevice use the wrapper function), just have a special case, where if there is only a single CUDA device at all, just set the device at the start of the program and never touch it again. Maybe add a compilation flag for this? Although this would be a well-predictable branch, it's a wasted entry in the BTB. I think that most people (at least on consumer hardware) don't expect to run on more than one CUDA-enabled GPU.

Getting of these calls alone would lessen the overhead of the API calls.

Maybe this is actually not important. All I know is that in real-time games, we tend to limit the number of API calls to a minimum. This also often involves not calling the error functions. Basically, in release mode CUDA_CHECK would be just #define CUDA_CHECK(x) // noop.

Having looked at that allocator.cc, it seems that most of these API calls might be coming from allocate.

Even if this doesn't improve performance significantly, it will still ease debugging by decluttering the log and the timeline from all these context switches, leaving only the actual kernel launches.

guillaumekln commented 1 year ago

I'm pretty sure these calls have no impact on performance, especially when running large models (see also https://github.com/pytorch/pytorch/issues/18048). We could add a compilation flag but this would make the code more complex so I'm not sure it is worth it. Maybe you can just filter out these functions in your profiling tool?

We are getting off topic here. Feel free to open other issues on these topics!

hobodrifterdavid commented 1 year ago

I ran a few tests with 7B and 13B models (eachadea/vicuna-13b-1.1 and vicuna-7b-1.1), with different input and output (generated) token lengths, on a 3090 (24C epyc 7443p cpu, lots of ram).

Config:

generator = ctranslate2.Generator(model_dir, device="cuda", device_index=[0]) 

generator.generate_batch(
    tokens,
    beam_size=1,
    sampling_temperature=0.8,
    sampling_topk=10,
    num_hypotheses=1,
    max_length=80,
    min_length=80,
    include_prompt_in_result=False,
    max_batch_size=100,
    return_end_token=True
)

// 7B - 24 Input - 80 Output
// batch size / time to generate (s) / tokens per s total
32: 1.63 - 1570/s
64: 2.10 - 2438/s
100: 2.94 - 2720

// 7B - 642 Input - 80 Output
1: 1.38 - 58/s
2: 1.67 - 96/s
4: 2.04 - 157/s
6: 2.54 - 189/s
7: 2.75 - 203/s
8: 3.07 - 208/s
9: 3.89
10: 4.29 - 186/s
12: 4.58
16: 6.19
20: 5.19 - 308/s
24: 5.99
32: 7.61 - 336/s
48: OOM

// 13B - 24 Input - 80 Output
1: 1.97 - 40.6/s
2: 2.10 - 76.2/s
3:
4: 2.13 - 150.2/s
5:
6:
7:
8: 2.64 - 242/s
9: 4.00
10: 4.25
11: 4.32

16: 4.74
20: 2.87 << Strange?, run time seems to go down here..
40: 2.78
60: 3.25
80: 4.16
100: 4.86 - 1646/s

// Here one input prompt was long, others short:
// 13B - 24 Input + one 963 - 80 Output
2: 2.09
4: 2.14
8: 2.63
20: 2.23

// 13B - 963 Input - 80 Output
1: 2.44 - 32.8/s - 100%
2: 2.99 - 53.5/s - 163%
3: 3.49 - 68.8/s
4: 3.93 - 81.4/s - 248%
5: 4.45 - 89.9/s
6: 4.97 - 96.6/s
7: 5.52 - 101.4/s
8: 6.14 - 104.2/s - 317%
9: 7.99 - 90.1/s

// 13B - 963 Input - 160 Output
4: 7.07 - 90.5/s
8: 11.36 - 112.7/s

// 13B - 1163 Input - 80 Output
1: 3.22 - 24.8/s - 100%
2: 3.28 - 48.8/s - 196%
3: 3.94 - 60.9/s
4: 5.12 - 62.5/s - 252%
5: 5.19 - 77.0/s
6: 5.89 - 81.5/s
7: 6.61 - 84.7/s
8: 7.34 - 87.2/s - 351%
9: OOM

There is what looks like a performance anomaly I marked above.

If you are making generations for users with a 3090 and a 13B model, latency gets too high (6s+) at larger batch sizes before you run out of ram.

I'll edit this comment later with some more observations.

BrightXiaoHan commented 1 year ago

The latest version (3.11.0) implemented the rotary embeddings, so we now support all components that are used by LLaMa!

I'm sharing an experimental conversion code and example in this Gist:

https://gist.github.com/guillaumekln/7ef5db5ef2e84ebaf9b005ebecf4a85a

It seems to work great for 7B even with 8-bit quantization, but the checkpoints with multiple shards produce gibberish after conversion. Probably there is something wrong in the converter but I'm not sure what at the moment. I'm sharing this now so other people can look at it and run first experiments.

May I ask if the code for converting the llama model will be merged into the ctranslate2 codebase?

OpenNMT / CTranslate2

Faster llama #1127

2.2 Architecture

2.4 Efficient Implementation