GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
Please refer to our paper for a detailed description of GLM:
GLM: General Language Model Pretraining with Autoregressive Blank Infilling (ACL 2022)
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang (*: equal contribution)
News: We release ChatGLM-6B, an open pre-trained language model with 6 billion parameters optimized for Chinese QA and dialogue based on the GLM framework.
You can download the pretrained models used in the paper from OneDrive or Tsinghua-Cloud.
Name | Params | Language | Corpus | Objective | File | Config |
---|---|---|---|---|---|---|
GLM-Base | 110M | English | Wiki+Book | Token | glm-base-blank.tar.bz2 | model_blocklm_base.sh |
GLM-Large | 335M | English | Wiki+Book | Token | glm-large-blank.tar.bz2 | model_blocklm_large.sh |
GLM-Large-Chinese | 335M | Chinese | WuDaoCorpora | Token+Sent+Doc | glm-large-chinese.tar.bz2 | model_blocklm_large_chinese.sh |
GLM-Doc | 335M | English | Wiki+Book | Token+Doc | glm-large-generation.tar.bz2 | model_blocklm_large_generation.sh |
GLM-410M | 410M | English | Wiki+Book | Token+Doc | glm-1.25-generation.tar.bz2 | model_blocklm_1.25_generation.sh |
GLM-515M | 515M | English | Wiki+Book | Token+Doc | glm-1.5-generation.tar.bz2 | model_blocklm_1.5_generation.sh |
GLM-RoBERTa | 335M | English | RoBERTa | Token | glm-roberta-large-blank.tar.bz2 | model_blocklm_roberta_large.sh |
GLM-2B | 2B | English | Pile | Token+Sent+Doc | glm-2b.tar.bz2 | model_blocklm_2B.sh |
GLM-10B | 10B | English | Pile | Token+Sent+Doc | Download | model_blocklm_10B.sh |
GLM-10B-Chinese | 10B | Chinese | WuDaoCorpora | Token+Sent+Doc | Download | model_blocklm_10B_chinese.sh |
Unzip the downloaded file into a local folder and set CHECKPOINT_PATH
in the corresponding scripts to the folder path.
dev set, single model, single-task finetuning
Model | COPA | WSC | RTE | WiC | CB | MultiRC | BoolQ | ReCoRD |
---|---|---|---|---|---|---|---|---|
GLM-10B | 98.0 | 95.2 | 93.1 | 75.7 | 98.7/98.2 | 88.1/63.3 | 88.7 | 94.4/94.0 |
DeBERTa-XXLarge-v2 | 97.0 | - | 93.5 | - | - | 87.8/63.6 | 88.3 | 94.1/93.7 |
CNN/Daily Mail (test set, no additional data used)
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
GLM-10B | 44.7 | 21.4 | 41.4 |
T5-11B | 43.5 | 21.6 | 40.7 |
PEGASUS-Large | 44.2 | 21.5 | 41.4 |
BART-Large | 44.2 | 21.3 | 40.9 |
XSum (test set, no additional data used)
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
GLM-10B | 48.9 | 25.7 | 40.4 |
PEGASUS-Large | 47.2 | 24.6 | 39.3 |
BART-Large | 45.1 | 22.3 | 37.3 |
test set, zero-shot
Model | LAMBADA (accuracy) | Wikitext103 (perplexity) |
---|---|---|
GLM-10B (bi) | 72.35 | 11.33 |
GLM-10B (uni) | 67.18 | 12.22 |
GPT-2 | 52.66 | 17.48 |
Megatron-LM (8.3B) | 66.51 | 10.81 |
Turing-NLG | 67.98 | 10.21 |
You can access GLM models via HuggingFace Hub. Please
install transformers>=4.23.1
and find all the available models here.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()
# Inference
inputs = tokenizer("Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.", return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)
inputs = inputs.to('cuda')
outputs = model.generate(**inputs, max_length=512, eos_token_id=tokenizer.eop_token_id)
print(tokenizer.decode(outputs[0].tolist()))
# Training
inputs = tokenizer(
["Tsinghua University is located in [MASK].", "One minus one equals zero, is it correct? Answer: [MASK]"],
return_tensors="pt", padding=True)
inputs = tokenizer.build_inputs_for_generation(inputs, targets=["Beijing", "No"], max_gen_length=8, padding=False)
inputs = inputs.to('cuda')
outputs = model(**inputs)
loss = outputs.loss
logits = outputs.logits
from transformers import AutoTokenizer, AutoModelForMultipleChoice
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = AutoModelForMultipleChoice.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
model = model.half().cuda()
model.eval()
inputs = tokenizer(["Tsinghua University is located in [MASK].",
"One minus one equals zero, is it correct? Answer: [MASK]"], return_tensors="pt", padding=True)
choices = [["Beijing", "Shanghai"], ["Yes", "No"]]
inputs = tokenizer.build_inputs_for_multiple_choice(inputs, choices)
inputs = inputs.to('cuda')
outputs = model(**inputs)
logits = outputs.logits
You can also convert the finetuned checkpoints with scripts/convert_glm_checkpoint_to_transformers.py
.
We prepare two docker images based on CUDA 10.2 and CUDA 11.2. You can pull the pre-built images from Docker Hub and run with docker v19.03+
docker run --gpus all --rm -it --ipc=host zxdu20/glm-cuda102
or replace glm-cuda102
with glm-cuda112
.
You can also modify the image according to your requirements in docker/cuda102.dockerfile and build the image yourself
docker build -f cuda102.dockerfile . -t glm-cuda102
Please first install PyTorch (we use 1.7.0) and apex, and then install other
dependencies by pip install -r requirements.txt
git clone https://github.com/THUDM/GLM
cd GLM
If your encounter the CUDA out of memory
error, which means you GPU memory is limited, you can try the model
parallelism to divide the parameters into multiple GPUs. Take the two-way model parallelism as an example. First
run change_mp.py
to divide the checkpoint:
python change_mp.py path_to_the_checkpoint 2
Then update the checkpoint path in the model config file (such
as config_tasks/model_blocklm_10B.sh) and change MP_SIZE
in the script (such
as scripts/ds_finetune_superglue.sh) to 2
.
We provide scripts for finetuning GLM on some downstream tasks.
CHECKPOINT_PATH
to your local path. Run the following scriptbash scripts/generate_block.sh \
config_tasks/model_blocklm_10B_chinese.sh
Some models (GLM-2B, GLM-10B, and GLM-10B-Chinese) use three different mask tokens: [MASK]
for short blank
filling, [sMASK]
for sentence filling, and [gMASK]
for left-to-right generation.
You can also add multiple [MASK]
and [sMASK]
in a single example. The model will fill the blanks one by one from left to right. The answer to each blank always begins with a special <|startofpiece|>
.
Download the SuperGlue data and check the experiment setup in
scripts/ds_finetune_superglue.sh. Note
that DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
need to be changed to your local path. You may also change the batch-size
and nproc_per_node
according to your
available hardware.
Run the following script (use the COPA dataset as an example)
bash scripts/ds_finetune_superglue.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/task_copa.sh
bash scripts/ds_finetune_superglue_prompt.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/task_copa.sh
DataProcessor
in
tasks/superglue/dataset.py for data loading and add a PVP
in
tasks/superglue/pvp.py for the cloze question. More details can be found
here.Download the Gigaword
, CNN/Daily Mail
or XSum dataset and check the experiment setup in
scripts/ds_finetune_seq2seq.sh. Change DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
to
your
local path.
Run the following script (use the CNN/Daily Mail dataset as an example)
bash scripts/ds_finetune_seq2seq.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/seq_cnndm_org.sh
The summaries are written into ./runs/experiment_name/test.jsonl.hyps
. The references are written
into test.jsonl.refs
in the same directory. For calculating rouge,
install file2rouge and download Stanford CoreNLP
from here. Run the following script
bash scripts/evaluate_seq2seq.sh \
./runs/experiment_name/test.jsonl.hyps ./runs/experiment_name/test.jsonl.refs
Process your seq2seq data into {split}.source
and {split}.target
, with each line being the context or the target of
a sample, and split
being train
, val
, and test
.
Run the following script
bash scripts/ds_finetune_seq2seq.sh \
config_tasks/model_blocklm_10B.sh \
config_tasks/seq_customization.sh
You can specify the hyperparameters in config_tasks/seq_customization.sh
and config_tasks/config_blocklm_10B_cnndm.json
bash scripts/evaluate_multichoice.sh config_tasks/model_blocklm_10B.sh
Note that CHECKPOINT_PATH
and DATA_PATH
need to be changed to your local path.
The format of each line of the data file should be
{"inputs_pretokenized": "Context and question here", "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"], "label": int}
DATA_ROOT, CHECKPOINT_PATH
in scripts/evaluate_lm.shbash scripts/evaluate_lm.sh \
config_tasks/model_blocklm_large_generation.sh \
config_tasks/zero_lambada.sh
DATA_ROOT, CHECKPOINT_PATH
in scripts/evaluate_lm.shbash scripts/evaluate_lm.sh \
config_tasks/model_blocklm_large_generation.sh \
config_tasks/zero_wikitext.sh
Download the Yahoo dataset and check the experiment setup in
scripts/finetune_blank.sh. Change DATA_ROOT, CHECKPOINT_PATH, SAVE_PATH
to your
local path.
Run the following script
bash scripts/finetune_blank.sh \
config_tasks/model_blocklm_large.sh \
config_tasks/seq_blank.sh
Run the following script to pre-train the GLM-Large model
bash scripts/ds_pretrain_nvidia.sh config/ds_block_large.sh
The script scripts/ds_pretrain_nvidia.sh launches the training program with DeepSpeed.
You should change NUM_WORKERS
and NUM_GPUS_PER_WORKER
to the number of workers and the number of gpus per worker.
Also change HOST_FILE_PATH
to the path to an OpenMPI-style hostfile. More details about DeepSpeed launcher can be
found here.
The file config/ds_block_large.sh defines the hyperparameters for pretraining. Most of the
arguments are fairly self-explanatory. Specifically, --train-data
can be multiple keywords defined in NAMED_CORPORA
in data_utils/corpora.py. The hyperparameters of the optimizer are defined in the corresponding
json file under config
. The semantics of the json file can be found here.
Part of the code is based on Megatron-LM and PET.
Please cite our paper if you find this code useful for your research:
@article{DBLP:conf/acl/DuQLDQY022,
author = {Zhengxiao Du and
Yujie Qian and
Xiao Liu and
Ming Ding and
Jiezhong Qiu and
Zhilin Yang and
Jie Tang},
title = {{GLM:} General Language Model Pretraining with Autoregressive Blank Infilling},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
May 22-27, 2022},
pages = {320--335},
publisher = {Association for Computational Linguistics},
year = {2022},
}