Closed Matthieu-Tinycoaching closed 12 months ago
Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.
Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.
RAG ?
Retrieval augmented generation, as in creating a vector database and querying it for results, then appending those results to a user's query that are both sent to an LLM for an answer. It lets one ask for an answer from an LLM on specific information that is after a model's knowledge cutoff date, for example. Very powerful.
and what, is the common usage of this with seq length higher than 4096 ?
You can certainly do RAG decently under 4096 but typically, the point of RAG is to make use of as much context as possible.
but again, the sliding window is only for the attention mask. it does mean that it will "break". if something breaks it's just because the sequence length might be way too long and it will OOM by itself. does not mean results will be bad. anyway, I am implementing the sliding mask in -py and will check how easy it is to replicate in ct2.
You are right, I misunderstood their article. My apologies.
Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.
What would be the command to use llama convertor for Mistral?
I've uploaded the converted model to Hugging Face. See here.
Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.
When I do this
ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ./models/ctranslate2 --low_cpu_mem_usage
It outputs
ValueError: No conversion is registered for the model configuration MistralConfig
Maybe need to change model type too or what?
did you try to change here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197 to MistralConfig if this is not enough we'll need to add the config, ootherwise you can download directly the the converted file from @winstxnhdw
@winstxnhdw possible to share how u did the conversion? i am getting the same error
ValueError: No conversion is registered for the model configuration MistralConfig
@winstxnhdw possible to share how u did the conversion? i am getting the same error
ValueError: No conversion is registered for the model configuration MistralConfig
I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197
copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config
Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens. Would be interesting to see how it behaves with longer sequences.
Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens. Would be interesting to see how it behaves with longer sequences.
When will ctranslate2 support SWA?
@winstxnhdw possible to share how u did the conversion? i am getting the same error
ValueError: No conversion is registered for the model configuration MistralConfig
I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197
copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config
Can you please post your code for me instead of a picture of it??
@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
@property
def architecture_name(self):
return "MistralForCausalLM"
def get_model_spec(self, model):
num_layers = model.config.num_hidden_layers
num_heads = model.config.num_attention_heads
num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
if num_heads_kv == num_heads:
num_heads_kv = None
spec = transformer_spec.TransformerDecoderModelSpec.from_config(
num_layers,
num_heads,
activation=common_spec.Activation.SWISH,
pre_norm=True,
ffn_glu=True,
rms_norm=True,
rotary_dim=0,
rotary_interleave=False,
num_heads_kv=num_heads_kv,
)
self.set_decoder(spec.decoder, model.model)
self.set_linear(spec.decoder.projection, model.lm_head)
return spec
def get_vocabulary(self, model, tokenizer):
tokens = super().get_vocabulary(model, tokenizer)
extra_ids = model.config.vocab_size - len(tokens)
for i in range(extra_ids):
tokens.append("<extra_id_%d>" % i)
return tokens
def set_vocabulary(self, spec, tokens):
spec.register_vocabulary(tokens)
def set_config(self, config, model, tokenizer):
config.bos_token = tokenizer.bos_token
config.eos_token = tokenizer.eos_token
config.unk_token = tokenizer.unk_token
config.layer_norm_epsilon = model.config.rms_norm_eps
def set_layer_norm(self, spec, layer_norm):
spec.gamma = layer_norm.weight
def set_decoder(self, spec, module):
spec.scale_embeddings = False
self.set_embeddings(spec.embeddings, module.embed_tokens)
self.set_layer_norm(spec.layer_norm, module.norm)
for layer_spec, layer in zip(spec.layer, module.layers):
self.set_layer_norm(
layer_spec.self_attention.layer_norm, layer.input_layernorm
)
self.set_layer_norm(
layer_spec.ffn.layer_norm, layer.post_attention_layernorm
)
wq = layer.self_attn.q_proj.weight
wk = layer.self_attn.k_proj.weight
wv = layer.self_attn.v_proj.weight
wo = layer.self_attn.o_proj.weight
layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
layer_spec.self_attention.linear[1].weight = wo
self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)
delattr(layer, "self_attn")
delattr(layer, "mlp")
gc.collect()
Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.
@register_loader("MistralConfig") class MistralLoader(ModelLoader): @property def architecture_name(self): return "MistralForCausalLM" def get_model_spec(self, model): num_layers = model.config.num_hidden_layers num_heads = model.config.num_attention_heads num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads) if num_heads_kv == num_heads: num_heads_kv = None spec = transformer_spec.TransformerDecoderModelSpec.from_config( num_layers, num_heads, activation=common_spec.Activation.SWISH, pre_norm=True, ffn_glu=True, rms_norm=True, rotary_dim=0, rotary_interleave=False, num_heads_kv=num_heads_kv, ) self.set_decoder(spec.decoder, model.model) self.set_linear(spec.decoder.projection, model.lm_head) return spec def get_vocabulary(self, model, tokenizer): tokens = super().get_vocabulary(model, tokenizer) extra_ids = model.config.vocab_size - len(tokens) for i in range(extra_ids): tokens.append("<extra_id_%d>" % i) return tokens def set_vocabulary(self, spec, tokens): spec.register_vocabulary(tokens) def set_config(self, config, model, tokenizer): config.bos_token = tokenizer.bos_token config.eos_token = tokenizer.eos_token config.unk_token = tokenizer.unk_token config.layer_norm_epsilon = model.config.rms_norm_eps def set_layer_norm(self, spec, layer_norm): spec.gamma = layer_norm.weight def set_decoder(self, spec, module): spec.scale_embeddings = False self.set_embeddings(spec.embeddings, module.embed_tokens) self.set_layer_norm(spec.layer_norm, module.norm) for layer_spec, layer in zip(spec.layer, module.layers): self.set_layer_norm( layer_spec.self_attention.layer_norm, layer.input_layernorm ) self.set_layer_norm( layer_spec.ffn.layer_norm, layer.post_attention_layernorm ) wq = layer.self_attn.q_proj.weight wk = layer.self_attn.k_proj.weight wv = layer.self_attn.v_proj.weight wo = layer.self_attn.o_proj.weight layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv]) layer_spec.self_attention.linear[1].weight = wo self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj) self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj) self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj) delattr(layer, "self_attn") delattr(layer, "mlp") gc.collect()
Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.
Awesome, any change we can get a bfloat ctranslate2 edition since the model is originally in bfloat16? that way we can use quantizations at run time other than int8?
Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.
Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl, for example. I'm being serious here, since you successfully converted Mistral by modifying the ctranslate2 scripts, I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally. This is very important to me, so hit me up if you want to discuss. I'd be happy to share my credentials, law firm website, or whatever it takes so we can do this and make payment remotely...Thanks.
I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.
we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.
I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.
we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.
Let's do this, we'll split the cost 50/50 for whatever freelance programmer actually does it. We'll need to discuss the amount of time and first of course. ;-)
confirmed. we are also looking into fine tuning of this model, althought it does not need very much.
from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.
confirmed. we are also looking into fine tuning of this model, althought it does not need very much.
from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.
I agree, and even though it's a resource hog (relative to other embedding models) it's worth it IMHO.
Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl
Can I ask why you've been so incessant to use instructor-xl
over bge-large-en
when bge-large-en
has shown to be more performant and efficient than instructor-xl
embeddings in every metric as shown in the leaderboards?
I've just noticed that it performs significantly better when I use it. Not sure why exactly, I know that different models perform differently depending on the type of text being fed it, but that's just what I've noticed. Any interest?
will check out the leaderboard and runs some tests thx.
I've just noticed that it performs significantly better when I use it.
Are you certain that you've appended your instructions with the following when using bge-en-large-1.5
?
Represent this sentence for searching relevant passages:
I'm sorry, are you saying that bge-en-large-1.5 allows you to enter instructions like instructor-xl does?
@winstxnhdw do you have the use case to test #1528 it would require passing a very long prompt ( > 4096, maybe double this) and see if it outputs consistent completion.
Yeah, easily but I am really busy this week. I can maybe test something this weekend. Will update.
Hey guys,
I am facing problem as I am shifting one of my codes to Mistral from GPT.
def get_embedding(text, model="sentence-transformers/all-MiniLM-L6-v2"):
text = text.replace("\n", " ")
if not text:
text = "this is blank"
return openai.Embedding.create(
input=[text], model=model)['data'][0]['embedding']
if __name__ == '__main__':
# gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50,
# "temperature": 0, "top_p": 1, "stream": False,
# "frequency_penalty": 0, "presence_penalty": 0,
# "stop": ['"']}
gpt_parameter = {"max_tokens": 50,
"temperature": 0, "top_p": 1, "stream": False,
"frequency_penalty": 0, "presence_penalty": 0,
"stop": ['"']}
curr_input = ["driving to a friend's house"]
prompt_lib_file = "prompt_template/test_prompt_July5.txt"
prompt = generate_prompt(curr_input, prompt_lib_file)
def __func_validate(gpt_response):
if len(gpt_response.strip()) <= 1:
return False
if len(gpt_response.strip().split(" ")) > 1:
return False
return True
def __func_clean_up(gpt_response):
cleaned_response = gpt_response.strip()
return cleaned_response
I wanted to know which "Engine" and "Embedding Model" should be used for MIstral.
Looking forward for help 🙂
That's not remotely how you should be using any open-source model, and let's not pollute this issue any further with irrelevant topics. You can create a new issue for this. Also, it might be useful for you to learn what an API client library is first..
Ideally, there should be a discussion tab for such matters. Maybe @guillaumekln can help enable the tab?
Alright.
I closed #1528 and worked with @minhthuc2502 on #1524.
still WIP, not good so far.
We just merged #1524 great team work with @minhthuc2502 Mistral should now run fine with very long inputs. I just recommend to use int8_float16 when converting, plain float16 may go OOM quite easily on a 24GB GPU.
Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm
I'm getting 'Limit exceeded error' and tried everything but didn't get any of the output from it.
On Wed, Dec 20, 2023, 20:48 Vincent Nguyen @.***> wrote:
Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm
— Reply to this email directly, view it on GitHub https://github.com/OpenNMT/CTranslate2/issues/1501#issuecomment-1864716887, or unsubscribe https://github.com/notifications/unsubscribe-auth/APA3FDDFANIHQKERUN6YQV3YKMCENAVCNFSM6AAAAAA5KVV45SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUG4YTMOBYG4 . You are receiving this because you commented.Message ID: @.***>
Hi @vince62s, Just wanted to confirm -
Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm
this looks really impressive, what are you running this on? AWS EC2 A10? or?
Hey Mistral users, what kind of throughput are you getting with CT2 ?
96-181 toks/s on an RTX 3090 with CT2 (so obviously 8-bit quantised). 8 toks/s on Intel i7-8700
Hi @vince62s, Just wanted to confirm -
Can we convert the Mistral huggingface awq model or onmt-awq model to CT2? Is it possible to do 4-bit quantization with CT2?
no, not at the moment, only int8 quantization.
96-181 toks/s on an RTX 3090 with CT2 (so obviously 8-bit quantised). 8 toks/s on Intel i7-8700
but batch_size 1, right ?
Yeap, just 1.
This is the speed that I am getting on berkeley-nest/Starling-LM-7B-alpha model quantized with int8_bfloat16 on 4090 -
batch_size | max_seq_len | overall_time | token_speed |
---|---|---|---|
60 | 256 | 6.8 | 2145 |
60 | 512 | 18.1 | 1436 |
40 | 1024 | 34.6 | 867 |
This is the script that I am using -
import ctranslate2
import transformers
import time
generator = ctranslate2.Generator("../c2/starling-lm-7b-alpha/", device='cuda')
tokenizer = transformers.AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
with open('prompts100.txt', 'r') as file:
lines = file.readlines()
prompts = [line.strip() for line in lines]
prompt_tokens = [tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)) for prompt in prompts]
def execute_prompt(prompt_tokens):
batch_size = 60
pt= prompt_tokens[:batch_size]
results = generator.generate_batch(pt, max_length=512, sampling_topk=1, include_prompt_in_result=False, sampling_temperature=0)
counts = [len(result.sequences_ids[0]) for result in results]
outputs = [tokenizer.decode(result.sequences_ids[0]) for result in results]
return outputs, counts
start_time = time.time()
outputs, counts = execute_prompt(prompt_tokens)
end_time = time.time()
time_taken = end_time - start_time
print(f"Time taken for 60 prompts in 4090: {time_taken} seconds")
print(f"Token gen per second: {sum(counts)/time_taken}")
print(counts)
And I am not able to convert the same model to CT2 with 8bit quantization, getting the following error -
File "/home/kd/anaconda3/envs/hf2/lib/python3.12/site-packages/ctranslate2/converters/transformers.py", line 1470, in set_decoder
print(layer.self_attn.q_proj.qweight.shape)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kd/anaconda3/envs/hf2/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Linear' object has no attribute 'qweight'. Did you mean: 'weight'?
Which version of pytorch are you using?
Which version of pytorch are you using?
@BBC-Esq torch - 2.3.0.dev20240119+cu121
I see in the traceback taht you're using python 3.12? Pytorch doesn't support python 3.12 last time I checked...
Hi @vince62s, Just wanted to confirm -
1. Can we convert the Mistral huggingface awq model or onmt-awq model to CT2? 2. Is it possible to do 4-bit quantization with CT2?
Hey Mistral users, what kind of throughput are you getting with CT2 ? replicating this blog post: https://modal.com/docs/examples/vllm_inference I am getting close to 2500 tok/sec with a 4-bit quantized model on OpenNMT-py, batch_size 60 here it is: https://huggingface.co/OpenNMT/Mistral-7B-v0.2-instruct-onmt-awq-gemm
If I remember correctly, guillikan said awhile ago that it'd require "cutlass" to do 4-bit...
Hi,
Would it be possible to add support for "mistralai/Mistral-7B-Instruct-v0.1" model?