bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language
https://arxiv.org/abs/2212.09535
Apache License 2.0
69 stars 15 forks source link

README

Notes

This repository is no longer actively maintained. This repo was created when BLOOM+1 paper was written, where we had to engineered the adapter modules due to the new BLOOM architecture.

But now, adapters for BLOOM models are readily available (see peft), and language adaptation of these models (i.e., training of LLMs on monolingual corpora of a particular language) can be done by following official documentations such as peft-blog using the same pretraining objective, next-token-prediction.


This repository contains code for performing language adaptation of multilingual pretrained large language model BLOOM-{560m,1b1,1b7,3b,7b1} to new unseen languages. Please refer to our ACL 2023 paper BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.

Our implementations support the following features:

Installation

  1. Install the packages from composable-sft. This is used for composable-SFT finetuning.
  2. Install the packages from rational_activations. You would need to follow the [Other CUDA/PyTorch] section for installation. This is used for adaptable-adapters.
  3. Install the packages from this repo using pip install -r requirements.txt.

If encounter error with the import transformer, uninstall transformers using the command pip uninstall transformers and rerun step 3 to reinstall transformers supported by adapter-transformers library.

Experimental Setup (Language Adaptation)

Tokenizer and Tokenization of Dataset

Run tokenized4clm_sampled.py to train the tokenizer on the subset of OSCAR dataset.

cache_dir=...
output_dir=...
lang=...  # language
sample_size=...  # training sample size
vocab_size=...  # vocab size of tokenizer
tok_strategy=...  # extend, replace, overlap-replace
bigs_model="bigscience/bloom-1b3"

tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}"

python ./scripts/lang_adapt/tokenized4clm_sampled.py \
--lang $lang \
--model $bigs_model \
--tokenizer_dir $tokenizer_dir \
--hf_cache_dir $cache_dir \
--vocab_size $vocab_size \
--sample_size $sample_size \
--tok_strategy $tok_strategy

Language Adaptation

Run madx_run_clm.py to finetune language model on a new language.

tokenizer_dir=... # as above tokenizer_dir="${tokenizerdir}/tok${BIGSMODEL##*/}${LANG}oscar${DATASAMPLES}samples${VOCABSIZE}vocab${TOK_STRATEGY}" cache_dir=... # as above

output_dir=... # directory to save adapted model output_dir="${output_dir}/${BIGSMODEL##*/}${LANG}_${ADPTSTRATEGY}${DATASAMPLES}samples${VOCABSIZE}vocab${EMBD_SRATEGY}" logging_dir=... # directory to log loss curves to tensorboard logging_dir="${logging_dir}/${BIGSMODEL##*/}${LANG}_${ADPTSTRATEGY}${DATASAMPLES}samples${VOCABSIZE}vocab${EMBD_SRATEGY}"

mkdir -p $output_dir mkdir -p $logging_dir

MAX_STEPS=50000 EVAL_STEPS=5000 SAVE_STEPS=5000

python ./scripts/lang_adapt/madx_run_clm.py \ --seed 0 \ --fp16 \ --model_name_or_path $BIGS_MODEL \ --tokenizer_name $tokenizer_dir \ --dataset_name oscar \ --cache_dir $cache_dir \ --dataset_config_name "unshuffleddeduplicated${LANG}" \ --logging_dir $logging_dir \ --report_to "tensorboard" \ --learning_rate 0.001 \ --do_train \ --do_eval \ --output_dir $output_dir \ --preprocessing_num_workers 8 \ --overwrite_output_dir \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 2 \ --eval_accumulation_steps 4 \ --eval_steps $EVAL_STEPS \ --evaluation_strategy "steps" \ --max_eval_samples 5000 \ --save_steps $SAVE_STEPS \ --save_strategy "steps" \ --max_train_samples $DATA_SAMPLES \ --max_steps $MAX_STEPS \ --logging_steps 1000 \ --lang_adapt_strategies $ADPT_STRATEGY \ --embedding_strategies $EMBD_SRATEGY \ --load_best_model_at_end \ --gradient_checkpointing \ --fp16


**BLOOM+1 Reproduction**: See `./scripts/lang_adapt/example_scripts/run_clm_ru_madx_560m.sh` to reproduce language adapation of BLOOM-560m models to Russian in our [BLOOM+1 paper](https://arxiv.org/abs/2212.09535).

### Language Adaptation with DeepSpeed
1. Replace `python ./scripts/lang_adapt/madx_run_clm.py` with `deepspeed --num_gpus=8 --master_port 60000`.
2. Pass deepspeed config file argument `--deepspeed "/home/zhengxinyong/multilingual-modeling/scripts/lang_adapt/ds_config_zero2.json" `

See example file at `./scripts/lang_adapt/example_scripts/run_clm_ru_madx_7b1_deepspeed.sh`, which adapts BLOOM-7b1 model on Google Cloud 8 A100 GPUs. 

## Experimental Setup (Evaluation)

### Zero-Shot Prompting

Prompt the adapted language model in a zero-shot fashion without any finetuning. You'll need to `git clone https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt` to be able to run the experiments. 

Here shows the evaluation code for XNLI zero-shot prompting. You can find it in `lm-evaluation-harness/examples/`. 

For BLOOM+1, the tasks used are: 
- `xnli` ([XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053))
- `amnli` ([AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages](https://arxiv.org/abs/2104.08726))
- `pawsx` ([PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828))
- `xcopa` ([XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning](https://arxiv.org/abs/2005.00333))
- `xstory` (Multilingual [Story Cloze Test and ROCStories Corpora](https://cs.rochester.edu/nlp/rocstories/))
- `xwino`([Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution](https://aclanthology.org/2021.emnlp-main.670/))

**Baseline or Model-Based (BitFit, FISH Mask, etc.)**

python3 lm-evaluation-harness/main.py \ --model bigscience \ --model_args tokenizer="bigscience/bloom-560m",pretrained="ZYONG2/saved_models/bloom-560m_de_bitfit100000samples-1vocab_original-frozen" \ --tasks xnli_de


**Using Adapters (MAD-X, Pfeiffer, IA3, LoRA, etc.)**

python3 m-evaluation-harness/main.py \ --model bigscience \ --model_args tokenizer="bigscience/bloom-560m",pretrained="bigscience/bloom-560m",adapter_ckpt_folder="ZYONG2/saved_models/bloom-560m_de_ia3100000samples-1vocab_original-frozen/oscar_ia3_de" \ --tasks xnli_de


### Supervised Finetuning or Cross-Lingual Transfer (Only used for preliminary experiments with BLOOM is released)

OUTPUT_DIR=... # where you want to save checkpoints at LANG="de" CACHE_DIR=... # cache dir for saving/loading HF models and XNLI datasets. LR=1e-5 MODEL_NAME="ZYONG2/bigscience/tr5b-1B3-multilingual-alpha-checkpoints" # previous version of BLOOM pre-release TOKENIZER_NAME="ZYONG2/processed/011/oscar-de-tokenizer"

language adapters checkpoint folder

MADX_LANG_ADAPTER_NAME=".../oscar_de"

we finetune task adapters for XNLI

FT_STRATEGIES="task_adapters"

mkdir -p $OUTPUT_DIR python adapters_xnli_de.py \ $OUTPUT_DIR \ --lang $LANG \ --cache_dir $CACHE_DIR \ --num_train_epochs 2 \ --learning_rate $LR \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 4 \ --pretrained_model $MODEL_NAME \ --tokenizer $TOKENIZER_NAME \ --do_train \ --do_eval_after_train \ --madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ --finetune_strategies $FT_STRATEGIES \ --zero_shot


Remove `--zero_shot` for supervised finetuning setting. 

See example scripts in `./scripts/eval/task_ftscripts_xnli/`. `train_xnli_zero_shot.sh` is the batch script for XNLI finetuning, and `run_eval_xnli_zero_shot.sh` is for evaluating trained XNLI task adapters.

## Citation

@inproceedings{yong-etal-2023-bloom, title = "{BLOOM}+1: Adding Language Support to {BLOOM} for Zero-Shot Prompting", author = "Yong, Zheng Xin and Schoelkopf, Hailey and Muennighoff, Niklas and Aji, Alham Fikri and Adelani, David Ifeoluwa and Almubarak, Khalid and Bari, M Saiful and Sutawika, Lintang and Kasai, Jungo and Baruwa, Ahmed and Winata, Genta and Biderman, Stella and Raff, Edward and Radev, Dragomir and Nikoulina, Vassilina", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.653", doi = "10.18653/v1/2023.acl-long.653", pages = "11682--11703", }