ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.08k stars 9.18k forks source link

MiniCPM 2b model support? #5276

Open KnutJaegersberg opened 6 months ago

KnutJaegersberg commented 6 months ago

Feature Description

Like Phi is supported, it would great to have this Mistral level 2b model ggufable.

Motivation

SOTA 2b model, a piece of art, read how they made it:

https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20

ggerganov commented 6 months ago

Seems like the only unusual thing about this architecture is some modification related to mixing the input/output embeddings of the layers:

image

From this description, I'm not 100% sure what it means, but I suppose it would not be too difficult to implement

raymond-infinitecode commented 6 months ago

Most impressive 2B model ever see.

ShengdingHu commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation
Embedding Output Scaling We multiply the output of the embedding by scale_emb=12.
Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40).
lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value.

These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

sweetcard commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

lzs0603 commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

sweetcard commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project:

https://github.com/zkh2016/llmfarm_core.swift/commit/c7de12db67a12b3c22367721d70f1c3228830116

lzs0603 commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project:

zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

sweetcard commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file :

https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

lin-calvin commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Maybe Issue of tokenizer?

runfuture commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection.

Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made:

Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code.

Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

The information you summarized from the paper is very helpful. Thank you for your work.

ShengdingHu commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ?

@ShengdingHu

Sure, we will look into it today!

ShengdingHu commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file :

https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?

lzs0603 commented 6 months ago

Thank you for interest in MiniCPM. I am one of the authors. In MiniCPM, we implement tie_word_embedding, which involves utilizing the same matrix for both input embedding and the output projection (lm_head). To adapt this from a standard architecture like Llama, you would need to make adjustments such as replacing lm_head.projection with something like input_embedding.projection. Additionally, our model incorporates $\mu$P (https://arxiv.org/abs/2203.03466), a technique that applies numeric scaling to various model parameters and forward hiddens. Here are some specific modifications related to inference (not training) we've made: Modification Name Specific Operation Embedding Output Scaling We multiply the output of the embedding by scale_emb=12. Residual Connection Scaling The increment at each residual connection in every layer is scaled by scale_depth/√(num_layers), which equals 1.4/√(40). lm_head Scaling The output logits are adjusted to 1/(dim_model/256) = 1/9 of their original value. These modifications can also be seen from our huggingface transformers code. Llama.cpp is an exceptional framework for running edge-side LLMs. Please feel free to contact me at shengdinghu@gmail.com if you have any questions or need further clarification.

There is a branch of llama.cpp to support Minicpm in your official repository. Could u push it back here ? @ShengdingHu

The branch here appears to have an issue with it. I am using the openbmb/MiniCPM-2B-dpo-fp16 model, when using the python3 convert.py model, the converted model lacks an output.weight Screen Shot 2024-02-05 at 4 37 21 PM , resulting in an error. Additionally, the convert-hf-to-gguf.py script has not yet implemented support for MiniCPM, leading to the error: "Architecture 'MiniCPMForCausalLM' not supported!"

Some changes are applied to other llama.cpp in this project: zkh2016/llmfarm_core.swift@c7de12d

Thanks, problem solved! But the output makes no sense Screen Shot 2024-02-05 at 5 48 16 PM

Is the prompt template is correct ? Check this file : https://github.com/OpenBMB/LLMFarm-MiniCPM/blob/main/LLMFarm/model_setting_templates/llama%20chat%202%207B%20iphone%2012.json

Thanks, the output might not be due to the template. Could you point me to the code that's generating the nonsensical output? I'm a bit lost in all the information. Is it directly produced by our zkh2016/llmfarm_core.swift@c7de12d?

Not really, I actually referenced from this repo, and integrated it with zkh2016/llmfarm_core.swift@c7de12d

ShengdingHu commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically, 1) we absorb the $mu$P scaling factors into the model checkpoints. 2) we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

sweetcard commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.

https://github.com/ggerganov/llama.cpp/pull/5346

ShengdingHu commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.

5346

It seems that in the pr the model behaves strangely, I am checking it.

runfuture commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

There is a pr to support MiniCPM 2b. Please help to check whether it works correctly.

5346

It seems that in the pr the model behaves strangely, I am checking it.

Please check here: I've fixed the bug, welcome to do more test.

runfuture commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output:

image

This might be a lot easier to be used in llama.cpp.

Could you help us add it to the supported models?

As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it?

Thanks a lot!

@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:

  1. It simplifies your model publishing work, as there is no need to convert various kinds of models.
  2. It reduces confusion for users who encounter different types of models.
  3. It saves memory for both model storage and inference.

Thank you and I look forward to hearing your feedback.

ShengdingHu commented 6 months ago

A good news is that we have converted the original checkpoints into Llama format. Specifically,

  1. we absorb the $mu$P scaling factors into the model checkpoints.
  2. we untie the heads and absorb the scaling factors into embedding and lm_head. (Although this might take more memory)

This produces a checkpoint that can be immediately loaded by LLama code. The huggingface repo is openbmb/MiniCPM-2B-dpo-bf16-llama-format

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

example output: image This might be a lot easier to be used in llama.cpp. Could you help us add it to the supported models? As for the minicpm code without converting to llama format, we think it might be the cause of lm_head not loaded from the input_embeddings. Could you help us check it? Thanks a lot!

@ShengdingHu Could you please help test the latest release of llama.cpp, which includes support for converting and inferring with the original minicpm model? If everything works well, there are at least three benefits compared to providing a "llama-lized" minicpm model on Hugging Face:

  1. It simplifies your model publishing work, as there is no need to convert various kinds of models.
  2. It reduces confusion for users who encounter different types of models.
  3. It saves memory for both model storage and inference.

Thank you and I look forward to hearing your feedback.

That's definitely better than a llama-format, thanks very much. I am testing the lastest release.

gardner commented 6 months ago

Still getting

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
./main --version
version: 2252 (525213d2)
runfuture commented 6 months ago

Still getting

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
./main --version
version: 2252 (525213d2)

I've just tested it, and it works. Please make sure you have converted the model using the latest version of convert-hf-to-gguf.py.

flatsiedatsie commented 4 months ago

What's the status of this? They just released three new models, and it's as if they were reading my mind by creating 3B model with a huge context (could be great for summarization).

https://www.reddit.com/r/LocalLLaMA/comments/1c3badu/three_new_minicpm_models_moe_vision_128k/

gardner commented 4 months ago

I am still getting this on Apple Silicon:

$ make clean ; git pull origin ; make -j $(nproc)
$ conda activate llama
$ python3 -m pip install -U -r requirements.txt

$ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/
Loading model: MiniCPM-2B-128k
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
Can not map tensor 'lm_head.weight'

It creates a zero-length file at models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf

$ ls -lah models/openbmb/MiniCPM-2B-128k
total 11779952
drwxr-xr-x@ 14 gardner  staff   448B 14 Apr 23:05 .
drwxr-xr-x@  3 gardner  staff    96B 14 Apr 22:59 ..
-rw-r--r--@  1 gardner  staff   1.5K 14 Apr 22:57 .gitattributes
-rw-r--r--@  1 gardner  staff   7.2K 14 Apr 22:57 README.md
-rw-r--r--@  1 gardner  staff   168B 14 Apr 22:57 added_tokens.json
-rw-r--r--@  1 gardner  staff   1.1K 14 Apr 22:57 config.json
-rw-r--r--@  1 gardner  staff   9.7K 14 Apr 22:57 configuration_minicpm.py
-rw-r--r--@  1 gardner  staff     0B 14 Apr 23:05 ggml-model-f16.gguf
-rw-r--r--@  1 gardner  staff    66K 14 Apr 22:57 modeling_minicpm.py
-rw-r--r--@  1 gardner  staff   5.6G 14 Apr 22:59 pytorch_model.bin
-rw-r--r--@  1 gardner  staff   574B 14 Apr 22:59 special_tokens_map.json
-rw-r--r--@  1 gardner  staff   5.9M 14 Apr 22:59 tokenizer.json
-rw-r--r--@  1 gardner  staff   1.9M 14 Apr 22:59 tokenizer.model
-rw-r--r--@  1 gardner  staff   2.6K 14 Apr 22:59 tokenizer_config.json
runfuture commented 4 months ago

I am still getting this on Apple Silicon:

$ make clean ; git pull origin ; make -j $(nproc)
$ conda activate llama
$ python3 -m pip install -U -r requirements.txt

$ python3 convert-hf-to-gguf.py models/openbmb/MiniCPM-2B-128k/
Loading model: MiniCPM-2B-128k
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
gguf: Setting chat_template to {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Exporting model to 'models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/Users/gardner/miniconda3/envs/llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
Can not map tensor 'lm_head.weight'

It creates a zero-length file at models/openbmb/MiniCPM-2B-128k/ggml-model-f16.gguf

$ ls -lah models/openbmb/MiniCPM-2B-128k
total 11779952
drwxr-xr-x@ 14 gardner  staff   448B 14 Apr 23:05 .
drwxr-xr-x@  3 gardner  staff    96B 14 Apr 22:59 ..
-rw-r--r--@  1 gardner  staff   1.5K 14 Apr 22:57 .gitattributes
-rw-r--r--@  1 gardner  staff   7.2K 14 Apr 22:57 README.md
-rw-r--r--@  1 gardner  staff   168B 14 Apr 22:57 added_tokens.json
-rw-r--r--@  1 gardner  staff   1.1K 14 Apr 22:57 config.json
-rw-r--r--@  1 gardner  staff   9.7K 14 Apr 22:57 configuration_minicpm.py
-rw-r--r--@  1 gardner  staff     0B 14 Apr 23:05 ggml-model-f16.gguf
-rw-r--r--@  1 gardner  staff    66K 14 Apr 22:57 modeling_minicpm.py
-rw-r--r--@  1 gardner  staff   5.6G 14 Apr 22:59 pytorch_model.bin
-rw-r--r--@  1 gardner  staff   574B 14 Apr 22:59 special_tokens_map.json
-rw-r--r--@  1 gardner  staff   5.9M 14 Apr 22:59 tokenizer.json
-rw-r--r--@  1 gardner  staff   1.9M 14 Apr 22:59 tokenizer.model
-rw-r--r--@  1 gardner  staff   2.6K 14 Apr 22:59 tokenizer_config.json

@gardner @flatsiedatsie The latest long context model couldn't be supported due to "removed tie_embedding and expanded the vocabulary to 127660". It could be solved by add some lines to process "MODEL_TENSOR.OUTPUT". However, it seems it's quite difficult to distinguish the new model from the older ones. @ShengdingHu any suggestions?

foldl commented 4 months ago

Hey, you can have a try with ChatLLM.cpp, 2B, 1B and MoE models are all supported. 😊

flatsiedatsie commented 4 months ago

@fodl thanks for the suggestion, but unfortunately I'm relying on llamap-cpp-wasm / wllama to run these models 100% in the browser. ChatLLM.cpp does not seem to support that use case at the moment?