OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.4k stars 302 forks source link

Support for "mistralai/Mistral-7B-Instruct-v0.1" model #1501

Closed Matthieu-Tinycoaching closed 12 months ago

Matthieu-Tinycoaching commented 1 year ago

Hi,

Would it be possible to add support for "mistralai/Mistral-7B-Instruct-v0.1" model?

vince62s commented 1 year ago

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

winstxnhdw commented 1 year ago

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

vince62s commented 1 year ago

RAG ?

BBC-Esq commented 1 year ago

Retrieval augmented generation, as in creating a vector database and querying it for results, then appending those results to a user's query that are both sent to an LLM for an answer. It lets one ask for an answer from an LLM on specific information that is after a model's knowledge cutoff date, for example. Very powerful.

vince62s commented 1 year ago

and what, is the common usage of this with seq length higher than 4096 ?

winstxnhdw commented 1 year ago

You can certainly do RAG decently under 4096 but typically, the point of RAG is to make use of as much context as possible.

vince62s commented 1 year ago

but again, the sliding window is only for the attention mask. it does mean that it will "break". if something breaks it's just because the sequence length might be way too long and it will OOM by itself. does not mean results will be bad. anyway, I am implementing the sliding mask in -py and will check how easy it is to replicate in ct2.

winstxnhdw commented 1 year ago

You are right, I misunderstood their article. My apologies.

MrigankRaman commented 1 year ago

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

What would be the command to use llama convertor for Mistral?

winstxnhdw commented 1 year ago

I've uploaded the converted model to Hugging Face. See here.

vince62s commented 1 year ago

https://opennmt.net/CTranslate2/guides/transformers.html#llama-2

NeonBohdan commented 1 year ago

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

When I do this

ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ./models/ctranslate2 --low_cpu_mem_usage

It outputs

ValueError: No conversion is registered for the model configuration MistralConfig

Maybe need to change model type too or what?

vince62s commented 1 year ago

did you try to change here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197 to MistralConfig if this is not enough we'll need to add the config, ootherwise you can download directly the the converted file from @winstxnhdw

manishiitg commented 1 year ago

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig
MrigankRaman commented 1 year ago

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config

image
vince62s commented 1 year ago

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens. Would be interesting to see how it behaves with longer sequences.

MrigankRaman commented 1 year ago

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens. Would be interesting to see how it behaves with longer sequences.

When will ctranslate2 support SWA?

BBC-Esq commented 1 year ago

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config image

Can you please post your code for me instead of a picture of it??

wsxiaoys commented 1 year ago
@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

BBC-Esq commented 1 year ago
@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

Awesome, any change we can get a bfloat ctranslate2 edition since the model is originally in bfloat16? that way we can use quantizations at run time other than int8?

BBC-Esq commented 1 year ago

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl, for example. I'm being serious here, since you successfully converted Mistral by modifying the ctranslate2 scripts, I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally. This is very important to me, so hit me up if you want to discuss. I'd be happy to share my credentials, law firm website, or whatever it takes so we can do this and make payment remotely...Thanks.

silvacarl2 commented 1 year ago

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

BBC-Esq commented 1 year ago

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

Let's do this, we'll split the cost 50/50 for whatever freelance programmer actually does it. We'll need to discuss the amount of time and first of course. ;-)

silvacarl2 commented 1 year ago

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

BBC-Esq commented 1 year ago

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

I agree, and even though it's a resource hog (relative to other embedding models) it's worth it IMHO.

winstxnhdw commented 1 year ago

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl

Can I ask why you've been so incessant to use instructor-xl over bge-large-en when bge-large-en has shown to be more performant and efficient than instructor-xl embeddings in every metric as shown in the leaderboards?

BBC-Esq commented 1 year ago

I've just noticed that it performs significantly better when I use it. Not sure why exactly, I know that different models perform differently depending on the type of text being fed it, but that's just what I've noticed. Any interest?

silvacarl2 commented 1 year ago

will check out the leaderboard and runs some tests thx.

winstxnhdw commented 1 year ago

I've just noticed that it performs significantly better when I use it.

Are you certain that you've appended your instructions with the following when using bge-en-large-1.5?

Represent this sentence for searching relevant passages:
BBC-Esq commented 1 year ago

I'm sorry, are you saying that bge-en-large-1.5 allows you to enter instructions like instructor-xl does?

winstxnhdw commented 1 year ago

https://github.com/FlagOpen/FlagEmbedding/issues/148

vince62s commented 1 year ago

@winstxnhdw do you have the use case to test #1528 it would require passing a very long prompt ( > 4096, maybe double this) and see if it outputs consistent completion.

winstxnhdw commented 1 year ago

Yeah, easily but I am really busy this week. I can maybe test something this weekend. Will update.

muhtalhakhan commented 1 year ago

Hey guys,

I am facing problem as I am shifting one of my codes to Mistral from GPT.

def get_embedding(text, model="sentence-transformers/all-MiniLM-L6-v2"):
  text = text.replace("\n", " ")
  if not text: 
    text = "this is blank"
  return openai.Embedding.create(
          input=[text], model=model)['data'][0]['embedding']

if __name__ == '__main__':
#   gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50, 
#                    "temperature": 0, "top_p": 1, "stream": False,
#                    "frequency_penalty": 0, "presence_penalty": 0, 
#                    "stop": ['"']}
  gpt_parameter = {"max_tokens": 50, 
                   "temperature": 0, "top_p": 1, "stream": False,
                   "frequency_penalty": 0, "presence_penalty": 0, 
                   "stop": ['"']}

  curr_input = ["driving to a friend's house"]
  prompt_lib_file = "prompt_template/test_prompt_July5.txt"
  prompt = generate_prompt(curr_input, prompt_lib_file)

  def __func_validate(gpt_response): 
    if len(gpt_response.strip()) <= 1:
      return False
    if len(gpt_response.strip().split(" ")) > 1: 
      return False
    return True
  def __func_clean_up(gpt_response):
    cleaned_response = gpt_response.strip()
    return cleaned_response

I wanted to know which "Engine" and "Embedding Model" should be used for MIstral.

Looking forward for help 🙂

winstxnhdw commented 1 year ago

That's not remotely how you should be using any open-source model, and let's not pollute this issue any further with irrelevant topics. You can create a new issue for this. Also, it might be useful for you to learn what an API client library is first..

Ideally, there should be a discussion tab for such matters. Maybe @guillaumekln can help enable the tab?

muhtalhakhan commented 1 year ago

Alright.

vince62s commented 1 year ago

I closed #1528 and worked with @minhthuc2502 on #1524.

still WIP, not good so far.

vince62s commented 12 months ago

We just merged #1524 great team work with @minhthuc2502 Mistral should now run fine with very long inputs. I just recommend to use int8_float16 when converting, plain float16 may go OOM quite easily on a 24GB GPU.

vince62s commented 11 months ago

Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm

muhtalhakhan commented 11 months ago

I'm getting 'Limit exceeded error' and tried everything but didn't get any of the output from it.

On Wed, Dec 20, 2023, 20:48 Vincent Nguyen @.***> wrote:

Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm

— Reply to this email directly, view it on GitHub https://github.com/OpenNMT/CTranslate2/issues/1501#issuecomment-1864716887, or unsubscribe https://github.com/notifications/unsubscribe-auth/APA3FDDFANIHQKERUN6YQV3YKMCENAVCNFSM6AAAAAA5KVV45SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUG4YTMOBYG4 . You are receiving this because you commented.Message ID: @.***>

kdcyberdude commented 9 months ago

Hi @vince62s, Just wanted to confirm -

  1. Can we convert the Mistral huggingface awq model or onmt-awq model to CT2?
  2. Is it possible to do 4-bit quantization with CT2?

Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm

silvacarl2 commented 9 months ago

this looks really impressive, what are you running this on? AWS EC2 A10? or?

winstxnhdw commented 9 months ago

Hey Mistral users, what kind of throughput are you getting with CT2 ?

96-181 toks/s on an RTX 3090 with CT2 (so obviously 8-bit quantised). 8 toks/s on Intel i7-8700

vince62s commented 9 months ago

Hi @vince62s, Just wanted to confirm -

Can we convert the Mistral huggingface awq model or onmt-awq model to CT2? Is it possible to do 4-bit quantization with CT2?

no, not at the moment, only int8 quantization.

96-181 toks/s on an RTX 3090 with CT2 (so obviously 8-bit quantised). 8 toks/s on Intel i7-8700

but batch_size 1, right ?

winstxnhdw commented 9 months ago

Yeap, just 1.

kdcyberdude commented 9 months ago

This is the speed that I am getting on berkeley-nest/Starling-LM-7B-alpha model quantized with int8_bfloat16 on 4090 -

batch_size max_seq_len overall_time token_speed
60 256 6.8 2145
60 512 18.1 1436
40 1024 34.6 867

This is the script that I am using -

import ctranslate2
import transformers
import time
generator = ctranslate2.Generator("../c2/starling-lm-7b-alpha/", device='cuda')
tokenizer = transformers.AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")

with open('prompts100.txt', 'r') as file:
    lines = file.readlines()
prompts = [line.strip() for line in lines]
prompt_tokens = [tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)) for prompt in prompts]

def execute_prompt(prompt_tokens):
    batch_size = 60
    pt= prompt_tokens[:batch_size]
    results = generator.generate_batch(pt, max_length=512, sampling_topk=1, include_prompt_in_result=False, sampling_temperature=0)

    counts = [len(result.sequences_ids[0]) for result in results]
    outputs = [tokenizer.decode(result.sequences_ids[0]) for result in results]
    return outputs, counts

start_time = time.time() 
outputs, counts = execute_prompt(prompt_tokens)
end_time = time.time()  

time_taken = end_time - start_time
print(f"Time taken for 60 prompts in 4090: {time_taken} seconds")
print(f"Token gen per second: {sum(counts)/time_taken}")
print(counts)

And I am not able to convert the same model to CT2 with 8bit quantization, getting the following error -

File "/home/kd/anaconda3/envs/hf2/lib/python3.12/site-packages/ctranslate2/converters/transformers.py", line 1470, in set_decoder
    print(layer.self_attn.q_proj.qweight.shape)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kd/anaconda3/envs/hf2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Linear' object has no attribute 'qweight'. Did you mean: 'weight'?
BBC-Esq commented 9 months ago

Which version of pytorch are you using?

kdcyberdude commented 9 months ago

Which version of pytorch are you using?

@BBC-Esq torch - 2.3.0.dev20240119+cu121

BBC-Esq commented 9 months ago

I see in the traceback taht you're using python 3.12? Pytorch doesn't support python 3.12 last time I checked...

BBC-Esq commented 9 months ago

Hi @vince62s, Just wanted to confirm -

1. Can we convert the Mistral huggingface awq model or onmt-awq model to CT2?

2. Is it possible to do 4-bit quantization with CT2?

Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm

If I remember correctly, guillikan said awhile ago that it'd require "cutlass" to do 4-bit...