Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.67k stars 964 forks source link

Mistral Nemo 12B Checkpoints #1597

Open rasbt opened 1 month ago

rasbt commented 1 month ago

According to some first reports, this new model works great. In case we have the time, it would be a nice model to add as it would fill the "multilingual" niche. (Some people have been asking about models for various non-English languages.) Not sure if Gemma-2 already fulfills that too though.

Andrei-Aksionov commented 1 month ago

There is no modeling_*.py file in the repo. And the config.json file looks pretty standard.

It can be just a matter of adding a config.

Update: there is a custom tokenizer - tekken. So, yeah, might not be so easy 🙃.

rasbt commented 1 month ago

In case we want to pursue this, some findings from Daniel Han:

My findings for Mistral NeMo 12b:

  1. EOS token is untrained in base - a bug?
  2. EOS token is auto appended 3, 4096, not 5120 for Wq
  3. Not Llama Tokenizer
  4. Tools, FIM
  5. Pad_token=10
  6. 1M max RoPE pos: new dynamic RoPE in🦥 @UnslothAI saves 20GB VRAM

Longer notes:

  1. EOS token is untrained in the base model but trained in instruct - confirming with @MistralAI if this is a feature or a bug - could make finetunes break with NaNs and infinities. Mistral 7b does not have this issue. Only the embed_tokens, not the lm_head has this issue.

  2. EOS token is auto appended. This can break finetuning and inference - collabed with @xenovacom to fix this quickly :)

3, Not 5120 for Wq but 4096 - HF transformers main branch already has a fix for this - please update transformers! Unsloth auto patches, so no need to update!

  1. Not a Llama Tokenizer - was GPT2 Tokenizer, now generic PreTrainedTokenizer? Very interesting! Tokenizer compresses other languages more efficiently.

  2. Support for tools & FIM (Fill in the middle tasks). Function calling, code completion etc.

  3. Pad_token=10 . A dedicated pad token - yay! Finetuning can break less with fewer infinite outputs :)

  4. 1 million possible position embeddings - had to support dynamic sizing of Cos & Sin cached matrices to not go OOM (used 20GB!)

More details in our blog: https://unsloth.ai/blog/mistral-nemo

Our free Colab notebook can finetune 12b in a free 16GB Tesla T4 GPU exactly, and can do it 2x faster and use 60% less VRAM than HF+FA2! https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing

We also have a Kaggle notebook making finetuning 2x faster: https://kaggle.com/code/danielhanchen/kaggle-mistral-nemo-12b-unsloth-notebook