“A Stochastic Parrot, flat design, vector art” — Stable Diffusion XL
This repository contains Stability AI's ongoing development of the StableLM series of language models and will be continuously updated with new checkpoints. The following provides an overview of all currently available models. More coming soon.
September 29, 2023
August 5, 2023
April 28, 2023
April 20, 2023
Released initial set of StableLM-Alpha models, with 3B and 7B parameters. Base models are released under CC BY-SA-4.0.
Try to chat with our 7B model, StableLM-Tuned-Alpha-7B
, on Hugging Face Spaces.
Technical Report: StableLM-3B-4E1T
StableLM-3B-4E1T is a 3 billion (3B) parameter language model pre-trained under the multi-epoch regime to study the impact of repeated tokens on downstream performance. Given prior success in this area (Tay et al., 2023 and Taylor et al., 2022), we train on 1 trillion (1T) tokens for 4 epochs following the observations of Muennighoff et al. (2023) in "Scaling Data-Constrained Language Models" in which they find "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data." Further inspiration for the token count is taken from "Go smol or go home" (De Vries, 2023), which suggests a 2.96B model trained for 2.85 trillion tokens achieves a similar loss to a Chinchilla compute-optimal 9.87B language model ($k_n = 0.3$).
Size | StableLM-3B-4E1T | Training Tokens | Parameters |
---|---|---|---|
3B | checkpoint | 4T | 2,795,443,200 |
The model is a decoder-only transformer similar to the LLaMA (Touvron et al., 2023) architecture with the following modifications:
Parameters | Hidden Size | Layers | Heads | Sequence Length |
---|---|---|---|---|
2,795,443,200 | 2560 | 32 | 32 | 4096 |
The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), and RedPajama-Data (Together Computer., 2023) and The Pile (Gao et al., 2020) both without Books3 and other subsets, and StarCoder (Li et al., 2023).
Given the large amount of web data, we recommend fine-tuning the base StableLM-3B-4E1T for your downstream tasks.
Please refer to the provided YAML configuration file stablelm-3b-4e1t.yml
for complete hyperparameter settings and the technical report for further details.
The following zero-shot evaluations are performed with the lm-evaluation-harness
using the lm-bench branch of Stability AI's fork. Full lm-eval
JSONs can be found in the evals
directory.
Pre-Trained Model | Average | ARC Challenge |
ARC Easy |
BoolQ | HellaSwag (✱) | LAMBADA OpenAI |
OpenBookQA | PIQA | SciQ | Winogrande |
---|---|---|---|---|---|---|---|---|---|---|
meta-llama/Llama-2-13b-hf | 71.77 | 48.63 | 79.50 | 80.52 | 79.36 | 76.77 | 35.40 | 79.05 | 94.50 | 72.22 |
huggyllama/llama-7b | 68.84 | 41.89 | 75.25 | 75.05 | 76.22 | 73.55 | 34.40 | 78.67 | 94.60 | 69.93 |
meta-llama/Llama-2-7b-hf | 68.75 | 43.00 | 76.26 | 77.74 | 75.94 | 73.47 | 31.40 | 77.75 | 93.60 | 69.61 |
Qwen/Qwen-7B | 67.91 | 45.39 | 67.38 | 74.56 | 88.85 (?) | 69.67 | 32.20 | 73.99 | 93.20 | 65.98 |
tiiuae/falcon-7b | 67.83 | 40.27 | 74.41 | 73.55 | 76.35 | 74.56 | 30.60 | 79.49 | 94.00 | 67.25 |
mosaicml/mpt-7b | 67.36 | 40.53 | 74.92 | 73.94 | 76.17 | 68.64 | 31.40 | 78.89 | 93.70 | 68.03 |
stabilityai/stablelm-3b-4e1t | 66.93 | 37.80 | 72.47 | 75.63 | 73.90 | 70.64 | 31.40 | 79.22 | 94.80 | 66.54 |
baichuan-inc/Baichuan2-7B-Base | 66.93 | 42.24 | 75.00 | 73.09 | 72.29 | 70.99 | 30.40 | 76.17 | 94.60 | 67.56 |
stabilityai/stablelm-base-alpha-7b-v2 | 66.89 | 38.48 | 73.19 | 70.31 | 74.27 | 74.19 | 30.40 | 78.45 | 93.90 | 68.82 |
openlm-research/open_llama_7b_v2 | 66.32 | 38.82 | 71.93 | 71.41 | 74.65 | 71.05 | 30.20 | 79.16 | 93.80 | 65.82 |
microsoft/phi-1_5 | 65.57 | 44.45 | 76.14 | 74.53 | 62.62 | 52.75 | 37.60 | 76.33 | 93.20 | 72.53 |
EleutherAI/gpt-neox-20B | 65.57 | 37.88 | 72.90 | 69.48 | 71.43 | 71.98 | 29.80 | 77.42 | 93.10 | 66.14 |
togethercomputer/RedPajama-INCITE-7B-Base | 65.07 | 37.71 | 72.35 | 70.76 | 70.33 | 71.34 | 29.00 | 77.15 | 92.70 | 64.33 |
cerebras/btlm-3b-8k-base (§) | 63.59 | 34.90 | 70.45 | 69.63 | 69.78 | 66.23 | 27.60 | 75.84 | 92.90 | 64.96 |
EleutherAI/pythia-12b | 62.69 | 31.83 | 70.20 | 67.31 | 67.38 | 70.64 | 26.40 | 76.28 | 90.20 | 64.01 |
openlm-research/open_llama_3b_v2 | 62.43 | 33.87 | 67.59 | 65.69 | 69.99 | 66.74 | 26.00 | 76.66 | 92.40 | 62.90 |
EleutherAI/gpt-j-6B | 62.34 | 33.96 | 66.96 | 65.44 | 66.24 | 68.23 | 29.00 | 75.57 | 91.50 | 64.17 |
stabilityai/stablelm-base-alpha-3b-v2 | 62.19 | 32.42 | 67.26 | 64.56 | 68.58 | 70.25 | 26.40 | 76.01 | 92.10 | 62.12 |
facebook/opt-6.7b | 61.85 | 30.72 | 65.66 | 66.02 | 67.20 | 67.65 | 27.60 | 76.33 | 90.10 | 65.35 |
EleutherAI/pythia-6.9b | 60.58 | 31.83 | 67.21 | 64.01 | 63.88 | 67.01 | 25.80 | 75.08 | 89.80 | 60.62 |
EleutherAI/pythia-2.8b-deduped | 58.52 | 30.12 | 63.47 | 64.13 | 59.44 | 65.15 | 23.80 | 74.10 | 88.20 | 58.25 |
§ Previous 3B Pre-Trained SOTA ? Outlier Reuslts * Byte-length Normalized Accuracy |
StableLM-3B-4E1T achieves state-of-the-art performance (September 2023) at the 3B parameter scale for open-source models and is competitive with many of the popular contemporary 7B models, even outperforming our most recent 7B StableLM-Base-Alpha-v2.
StableLM-Alpha v2 models significantly improve on the initial Alpha models by incorporating architectural improvements such as SwiGLU (Shazeer, 2020) and using higher-quality data sources, as discussed below. The context length for these models is 4096 tokens.
Size | StableLM-Base-Alpha-v2 | Training Tokens | Parameters |
---|---|---|---|
3B | checkpoint | 1.1T | 2,796,431,360 |
7B | checkpoint | 1.1T | 6,890,209,280 |
Please refer to the provided YAML configuration files for hyperparameter details. E.g. for the extended StableLM-Alpha-3B-v2
model, see stablelm-base-alpha-3b-v2-4k-extension.yml.
Following similar work, we use a multi-stage approach to context length extension (Nijkamp et al., 2023), scheduling 1 trillion tokens at context length 2048 followed by 100 billion tokens at 4096. We found that sequence length warmup (Li et al., 2022) helped stabilize early spikes during the first ~80 billion tokens of pre-training. However, it was not applied to the final runs due to significant throughput penalties as length shapes grew across the curriculum.
The most impactful changes for StableLM-Alpha-v2 downstream performance were in the usage of higher quality data sources and mixtures; specifically, the use of RefinedWeb and C4 in place of The Pile v2 Common-Crawl scrape as well as sampling web text at a much higher rate (35% -> 71%).
The first pre-training stage relies on 1 trillion tokens sourced from a mix of the public Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer., 2023), The Pile (Gao et al., 2020), and internal datasets with web text sampled at a rate of 71%.
In the second stage, we include the StarCoder (Li et al., 2023) dataset and down sample web text to 55% while increasing sampling proportions of naturally long text examples in the aforementioned sources.
The following zero-shot evaluations are performed with the lm-evaluation-harness
at commit df3da98c5405deafd519c2ddca52bb7c3fe36bef
with the exception of SIQA which uses the add-siqa
branch with prompt format
{doc['context']}\nQuestion: {doc['question']}\nAnswer:
.
Model | ARC Challenge✱ | ARC Easy✱ | BoolQ | HellaSwag✱ | LAMBADA OpenAI |
OpenBookQA | PIQA | SIQA | TruthfulQA▲ | Winogrande | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
StableLM-Alpha-7B-v2 | 40.53 | 69.11 | 70.31 | 74.27 | 74.19 | 30.40 | 78.45 | 42.43 | 36.46 | 68.82 | 58.50 |
LLaMA-2-7B | 46.16 | 74.54 | 77.74 | 75.94 | 73.47 | 31.40 | 77.75 | 43.50 | 38.97 | 69.61 | 60.91 |
MPT-7B | 41.89 | 70.03 | 73.94 | 76.17 | 68.64 | 31.40 | 78.89 | 45.14 | 33.49 | 68.03 | 58.76 |
OpenLLaMA-7B-v2 | 42.41 | 69.65 | 71.41 | 74.65 | 71.05 | 30.20 | 79.16 | 41.97 | 34.57 | 65.82 | 58.09 |
RedPajama-INCITE-7B-Base | 39.42 | 69.19 | 70.76 | 70.33 | 71.34 | 29.00 | 77.15 | 42.58 | 33.01 | 64.33 | 56.71 |
StableLM-Alpha-3B-v2 | 35.07 | 63.26 | 64.56 | 68.58 | 70.25 | 26.40 | 76.01 | 42.48 | 35.87 | 62.12 | 54.46 |
BTLM-3B-8K | 37.63 | 67.09 | 69.63 | 69.78 | 66.23 | 27.60 | 75.84 | 42.78 | 36.00 | 64.96 | 55.75 |
OpenLLaMA-3B-v2 | 36.09 | 63.51 | 65.69 | 69.99 | 66.74 | 26.00 | 76.66 | 41.20 | 34.59 | 62.90 | 54.34 |
Pythia-2.8B (deduped) | 32.94 | 59.09 | 64.13 | 59.44 | 65.15 | 23.80 | 74.10 | 40.94 | 35.56 | 58.25 | 51.34 |
StableLM-Alpha-7B | 27.05 | 44.87 | 60.06 | 41.22 | 55.11 | 21.40 | 66.76 | 39.46 | 39.96 | 50.12 | 44.60 |
StableLM-Alpha-3B | 25.77 | 42.05 | 57.65 | 38.31 | 41.72 | 17.00 | 63.82 | 35.62 | 40.53 | 52.64 | 41.51 |
✱: Denotes byte-length normalized accuracy (acc_norm
) as described in Gao, 2021.
▲: We score TruthfulQA using the normalized total probability assigned to the set of true answers (mc2
).
StableLM-Alpha models are trained on a new dataset that builds on The Pile, which contains 1.5 trillion tokens, roughly 3x the size of The Pile. The context length for these models is 4096 tokens.
As a proof-of-concept, we also fine-tuned the model with Stanford Alpaca's procedure using a combination of five recent datasets for conversational agents: Stanford's Alpaca, Nomic-AI's gpt4all, RyokoAI's ShareGPT52K datasets, Databricks labs' Dolly, and Anthropic's HH. We will be releasing these models as StableLM-Tuned-Alpha.
Size | StableLM-Base-Alpha | StableLM-Tuned-Alpha | Training Tokens | Parameters | Web Demo |
---|---|---|---|---|---|
3B | checkpoint | checkpoint | 800B | 3,638,525,952 | |
7B | checkpoint | checkpoint | 800B | 7,869,358,080 | Hugging Face |
StableVicuna is an RLHF fine-tune of Vicuna-13B v0, which itself is a fine-tune of LLaMA-13B. It is our attempt at creating an open-source RLHF LLM Chatbot. This model is developed by StabilityAI's CarperAI team, with Duy V. Phung leading the training effort.
Due to the original non-commercial license of LLaMA, we can only release the weights of our model as deltas over the original model's weights. StableVicuna's delta weights are released under (CC BY-NC-SA-4.0).
Please visit HuggingFace checkpoint for more information about how to combine our delta weights with the original model.
Model | Download | Web Demo | Cite |
---|---|---|---|
StableVicuna-13B | checkpoint | Hugging Face |
All StableLM models are hosted on the Hugging Face hub. Check out this notebook to run inference with limited GPU capabilities.
Get started chatting with StableLM-Tuned-Alpha
by using the following code snippet:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-tuned-alpha-7b")
model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-tuned-alpha-7b")
model.half().cuda()
class StopOnTokens(StoppingCriteria):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
stop_ids = set([50278, 50279, 50277, 1, 0])
return input_ids[0][-1] in stop_ids
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
prompt = f"{system_prompt}<|USER|>What's your mood today?<|ASSISTANT|>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
tokens = model.generate(
**inputs,
max_new_tokens=64,
temperature=0.7,
do_sample=True,
stopping_criteria=StoppingCriteriaList([StopOnTokens()])
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))
StableLM Tuned should be used with prompts formatted to <|SYSTEM|>...<|USER|>...<|ASSISTANT|>...
The system prompt is
<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
This section contains a collection of fun cherry-picked examples of what you can do with stablelm-tuned-alpha
.
This section contains a collection of selected failure modes of stablelm-tuned-alpha
.
Want to get involved?
As is typical for any pretrained Large Language Model without additional finetuning and reinforcement learning, the responses a user gets might be of varying quality and might potentially include offensive language and views. This is expected to be improved with scale, better data, community feedback, and optimisation.
StableLM-Tuned-Alpha
would not have been possible without the helpful hand of Dakota Mahan @dmayhem93.Base model checkpoints (StableLM-Base-Alpha
) are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Stability AI, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the Stability AI endorses you or your use.
Fine-tuned checkpoints (StableLM-Tuned-Alpha
) are licensed under the Non-Commercial Creative Commons license (CC BY-NC-SA-4.0), in-line with the original non-commercial license specified by Stanford Alpaca.
All code in this repository is licensed under the Apache License 2.0 license.