This repository contains the source code of the following ACL 2022 paper: Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages. It is divided into 5 sections:
First, create the conda environment from obpe_env.yml file using:
conda env create -f obpe_env.yml
conda activate obpe_env
After that, setup indic-trans library using the instructions from this repository.
Also note that the pretraining has been done using Google Cloud TPUs so some of the code will be TPU-specific.
To create OBPE tokenizer, run the follwoing three scripts sequentially
python3 get_vocab.py\
--mono_files ./l1_monolingual_data.txt ./l2_monolingual_data.txt\
--output_files vocabs/l1.pkl vocabs/l2.pkl
python3 tokenizer.py\
--vocab_files vocabs/l1.pkl vocabs/l2.pkl\
--use_vocab\
--output_dir models/l1_l2_p_-3\
--vocab_size 30000\
--model bpe\
--eow_suffix "</w>"\
--n_HRL 1\
--max_tok\
--alpha 0.5\
--overlap mean\
--p -3
OR
python3 tokenizer.py\
--vocab_files vocabs/l1.pkl vocabs/l2.pkl\
--use_vocab\
--output_dir models/l1_l2_min\
--vocab_size 30000\
--model bpe\
--eow_suffix "</w>"\
--n_HRL 1\
--max_tok\
--alpha 0.5\
--overlap min
python3 generate_json_from_model.py\
--vocab models/l1_l2_min/vocab.json\
--merges models/l1_l2_min/merges.txt\
--model bpe\
--outfile tokenizers/l1_l2_min_tokenizer.json
Use this json file as input to create_pretraining_data_ENS_with_diff_tokenizer.py while creating MLM pretraining data
We need to create 2 new conda environments for Pretraining with BERT. We will make use of some code from Google BERT Repo along with our code. Pretraining BERT has 2 components:
conda env create --name bert_preprocessing
conda activate bert_preprocessing
conda install tensorflow==2.3.0
(b) Run the following command from the directory "BERT Pretraining and Preprocessing/Preprocessing Code" to create the preprocessing code. Refer to the [Google BERT Repo](https://github.com/google-research/bert) for other information.
```shell
python3 create_pretraining_data_ENS_with_diff_tokenizer.py\
--input_file=./monolingual_data.txt\
--output_file=/tmp/tf_examples.tfrecord\
--json_vocab_file=$BERT_BASE_DIR/tokenizer.json\
--do_lower_case=False\
--max_seq_length=128\
--max_predictions_per_seq=20\
--do_whole_word_mask=False\
--masked_lm_prob=0.15\
--random_seed=12345\
--dupe_factor=2
conda env create --name bert_pretraining
conda activate bert_pretraining
conda install -c conda-forge tensorflow==1.14
(b) Clone the Original [Google BERT Repo](https://github.com/google-research/bert) and replace the create_pretraining_data.py with our "BERT Pretraining and Preprocessing/Pretraining Diff Files/run_pretraining_without_NSP.py". Note that to run the pretraining on TPUs, the init_checkpoint, input_file and output_dir need to be on a Google Cloud Bucket.
Run the following command for pretraining:
```shell
python run_pretraining_without_NSP.py\
--input_file=/tmp/tf_examples.tfrecord\
--output_dir=/tmp/pretraining_output\
--do_train=True\
--do_eval=True\
--bert_config_file=$BERT_CONFIG_DIR/bert_config.json\
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt\
--train_batch_size=32\
--max_seq_length=128\
--max_predictions_per_seq=20\
--num_train_steps=20\
--num_warmup_steps=10\
--learning_rate=2e-5\
--save_checkpoint_steps=10\
--iterations_per_loop=5\
--use_tpu=True\
--tpu_name=node-1\
--tpu_zone=zone-1\
--num_tpu_cores=8
We fine tune of 4 different tasks. The dataset procurement, data cleaning and fine-tuning steps are as follows:
The dataset is obtained from XTREME Dataset(for en and hi) and WikiAnn NER (for pa, gu, bn, or, as). For preprocessing the WikiAnn NER dataset files, use "Fine Tuning/Utility Files/wikiann_preprocessor.py" as follows:
python3 wikiann_preprocessor.py --infile language/language-train.txt --outfile language/train-language.tsv
Use the "Fine Tuning/NER_Fine_Tuning.ipynb" for NER evaluation.
POS Tagging and Text Classification : The datasets for POS Tagging and Text Classification has been obtained from (Indian Language Technology Proliferation and Deployment Centre)[http://tdil-dc.in/].
Preprocess the data using the preprocessing files from "Fine Tuning/Utility Files/POS/". The "file to language mapping" has been included in "Fine Tuning/Utility Files/POS/Language to File Mapping.txt". Then combine the files using "Fine Tuning/Utility Files/POS/files_combiner.py" to create the train-test splits.
python3 pos_preprocessor.py --input_folder Language_Raw_Files/ --output_folder Language_POS_Data/
python3 files_combiner.py --input_folder Language_POS/ --output_folder datasets/ --l_code_actual language_code_as_per_ISO_639 --l_code_in_raw_data language_code_as_per_tdil_dataset
We use the BIS Tagset as the POS tags. The Indian Languages are already tagged with the BIS Tagset whereas the English Dataset is labelled with Penn Tagset. To convert the Penn to BIS, use "Fine Tuning/Utility Files/convert_penn_to_bis.py" to run the following command on the directory containing preprocessed POS dataset files tagged with Penn Tagset:
python3 convert_penn_to_bis.py --input_folder English_POS_Penn/ --output_folder English_POS_BIS/
Use the "Fine Tuning/POS_Fine_Tuning.ipynb" for POS evaluation.
Preprocess the data using the preprocessing files from "Fine Tuning/Utility Files/Text Classification/". The "file to language mapping" has been included in "Fine Tuning/Utility Files/Doc Classification/Language to File Mapping.txt".
python3 doc_classification_preprocessor_for_chunked.py --input_folder Language_Raw_Files/ --output_folder Language_Doc_Classification_Data --l_code_actual language_code_as_per_ISO_639 --l_code_in_raw_data language_code_as_per_tdil_dataset --train_files_taken train_files_taken.txt --test_files_taken test_files_taken.txt --valid_files_taken val_files_taken.txt
Use the "Fine Tuning/Text_Classification_Fine_Tuning.ipynb" for Doc Classification evaluation.
Use the "Fine Tuning/XNLI_Fine_Tuning.ipynb" for Doc Classification evaluation.
Used for transliterating monolingual data to another languages's script. To use, run:
python3 transliterate_monolingual.py\
--mono path_to_monolingual_data\
--outfile path_to_output_transliterated_data\
--l1 source_lang\
--l2 target_lang
--mono
: Path to the monolingual (text) data
--outfile
: Path to output transliterated (text) file
--l1
: Code for source language
--l2
: Code for target language
If you use the code in this repo, please cite our paper \cite{patil-etal-2022-overlap}
.
@inproceedings{patil-etal-2022-overlap,
title = "Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages",
author = "Patil, Vaidehi and
Talukdar, Partha and
Sarawagi, Sunita",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.18",
pages = "219--233",
abstract = "Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting the downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.",
}