The code is for our ACL-IJCNLP 2021 paper: Automated Concatenation of Embeddings for Structured Prediction
ACE is a framework for automatically searching a good embedding concatenation for structured prediction tasks and achieving state-of-the-art accuracy. The code is based on flair version 0.4.3 with a lot of modifications.
Task | Language | Dataset | ACE | Previous best |
---|---|---|---|---|
Named Entity Recognition | English | CoNLL 03 (document-level) | 94.6 (F1) | 94.3 (Yamada et al., 2020) |
Named Entity Recognition | German | CoNLL 03 (document-level) | 88.3 (F1) | 86.4 (Yu et al., 2020) |
Named Entity Recognition | German | CoNLL 03 (06 Revision) (document-level) | 91.7 (F1) | 90.3 (Yu et al., 2020) |
Named Entity Recognition | Dutch | CoNLL 02 (document-level) | 95.7 (F1) | 93.7 (Yu et al., 2020) |
Named Entity Recognition | Spanish | CoNLL 02 (document-level) | 90.4 (F1) | 90.3 (Yu et al., 2020) |
Named Entity Recognition | English | CoNLL 03 (sentence-level) | 93.6 (F1) | 93.5 (Baevski et al., 2019) |
Named Entity Recognition | German | CoNLL 03 (sentence-level) | 87.0 (F1) | 86.4 (Yu et al., 2020) |
Named Entity Recognition | German | CoNLL 03 (06 Revision) (sentence-level) | 90.5 (F1) | 90.3 (Yu et al., 2020) |
Named Entity Recognition | Dutch | CoNLL 02 (sentence-level) | 94.6 (F1) | 93.7 (Yu et al., 2020) |
Named Entity Recognition | Spanish | CoNLL 02 (sentence-level) | 89.1 (F1) | 90.3 (Yu et al., 2020) |
POS Tagging | English | Ritter's | 93.4 (Acc) | 90.1 (Nguyen et al., 2020) |
POS Tagging | English | Ark | 94.4 (Acc) | 94.1 (Nguyen et al., 2020) |
POS Tagging | English | TweeBank v2 | 95.8 (Acc) | 95.2 (Nguyen et al., 2020) |
Aspect Extraction | English | SemEval 2014 Laptop | 87.4 (F1) | 84.3 (Xu et al., 2019) |
Aspect Extraction | English | SemEval 2014 Restaurant | 92.0 (F1) | 87.1 (Wei et al., 2020) |
Aspect Extraction | English | SemEval 2015 Restaurant | 80.3 (F1) | 72.7 (Wei et al., 2020) |
Aspect Extraction | English | SemEval 2016 Restaurant | 81.3 (F1) | 78.0 (Xu et al., 2019) |
Dependency Parsing | English | PTB | 95.7 (LAS) | 95.3 (Wang et al., 2020) |
Semantic Dependency Parsing | English | DM ID | 95.6 (LF1) | 94.4 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | DM OOD | 92.6 (LF1) | 91.0 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PAS ID | 95.8 (LF1) | 95.1 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PAS OOD | 94.6 (LF1) | 93.4 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PSD ID | 83.8 (LF1) | 82.6 (Fernández-González and Gómez-Rodríguez, 2020) |
Semantic Dependency Parsing | English | PSD OOD | 83.4 (LF1) | 82.0 (Fernández-González and Gómez-Rodríguez, 2020) |
The project is based on PyTorch 1.1+ and Python 3.6+. To run our code, install:
pip install -r requirements.txt
The following requirements should be satisfied:
In our code, most of the embeddings can be downloaded automatically (except ELMo for non-English languages). You can also download the embeddings manually. The embeddings we used in the paper can be downloaded here:
Name | Link |
---|---|
GloVe | nlp.stanford.edu/projects/glove |
fastText | github.com/facebookresearch/fastText |
ELMo | github.com/allenai/allennlp |
ELMo (Other languages) | github.com/TalSchuster/CrossLingualContextualEmb |
BERT | huggingface.co/bert-base-cased |
M-BERT | huggingface.co/bert-base-multilingual-cased |
BERT (Dutch) | huggingface.co/wietsedv/bert-base-dutch-cased |
BERT (German) | huggingface.co/bert-base-german-dbmdz-cased |
BERT (Spanish) | huggingface.co/dccuchile/bert-base-spanish-wwm-cased |
BERT (Turkish) | huggingface.co/dbmdz/bert-base-turkish-cased |
XLM-R | huggingface.co/xlm-roberta-large |
XLM-R (CoNLL 02 Dutch) | huggingface.co/xlm-roberta-large-finetuned-conll02-dutch |
XLM-R (CoNLL 02 Spanish) | huggingface.co/xlm-roberta-large-finetuned-conll02-spanish |
XLM-R (CoNLL 03 English) | huggingface.co/xlm-roberta-large-finetuned-conll03-english |
XLM-R (CoNLL 03 German) | huggingface.co/xlm-roberta-large-finetuned-conll03-german |
XLNet | huggingface.co/xlnet-large-cased |
After the embeddings are downloaded, you need to set the path of embeddings in the config file manually. For example in config/conll_03_english.yaml
:
TransformerWordEmbeddings-1:
model: your/embedding/path
layers: -1,-2,-3,-4
pooling_operation: mean
We provide pretrained models for Named Entity Recognition (Sentence-/Document-Level) and Dependency Parsing (PTB) on OneDrive. You can find the corresponding config file in config/
. For the zip files named with doc*.zip
, you need to extract document-level embeddings at first. Please check (Optional) Extract Document Features for BERT Embeddings.
unzip
the zip fileresources/taggers/
To check the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
where --config $config_file
is setting the configureation file.
Here we take CoNLL 2003 English NER as an example. The $config_file
is config/conll_03_english.yaml
.
Currently, we give an instruction for reproducing the results of our NER models is in named_entity_recognition.md. Other tasks can simply follow the guide of named_entity_recognition.md to reproduce the results.
To train the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file
To set the dataset manully, you can set the dataset in the $confile_file
by:
Sequence Labeling:
targets: ner
ner:
Corpus: ColumnCorpus-1
ColumnCorpus-1:
data_folder: datasets/conll_03_new
column_format:
0: text
1: pos
2: chunk
3: ner
tag_to_bioes: ner
tag_dictionary: resources/taggers/your_ner_tags.pkl
Parsing:
targets: dependency
dependency:
Corpus: UniversalDependenciesCorpus-1
UniversalDependenciesCorpus-1:
data_folder: datasets/ptb
add_root: True
tag_dictionary: resources/taggers/your_parsing_tags.pkl
The tag_dictionary
is a path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically. The dataset format is: Corpus: $CorpusClassName-$id
, where $id
is the name of datasets (anything you like). You can train multiple datasets jointly. For example:
Corpus: ColumnCorpus-1:ColumnCorpus-2:ColumnCorpus-3
ColumnCorpus-1:
data_folder: ...
column_format: ...
tag_to_bioes: ...
ColumnCorpus-2:
data_folder: ...
column_format: ...
tag_to_bioes: ...
ColumnCorpus-3:
data_folder: ...
column_format: ...
tag_to_bioes: ...
Please refer to Config File for more details.
You need to modifiy the embedding paths in the $config_file
to change the embeddings for concatenation. For example, you need to add bert-large-cased
in the config/conll_03_english.yaml
embeddings:
TransformerWordEmbeddings-0:
layers: '-1'
pooling_operation: first
model: xlm-roberta-large-finetuned-conll03-english
TransformerWordEmbeddings-1:
model: bert-base-cased
layers: -1,-2,-3,-4
pooling_operation: mean
TransformerWordEmbeddings-2:
model: bert-base-multilingual-cased
layers: -1,-2,-3,-4
pooling_operation: mean
TransformerWordEmbeddings-3: # New embeddings
model: bert-large-cased
layers: -1,-2,-3,-4
pooling_operation: mean
...
To archieve state-of-the-art accuracy, one optional approach is fine-tuning the transformer-based embeddings over the task. We use fine-tuned embeddings in huggingface for NER tasks while embeddings in other tasks are fine-tuned by ourselves. Then take the embeddings as an embedding candidate of ACE. Taking fine-tuning BERT model on PTB parsing as an example, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/en-bert-finetune-ptb.yaml
After the model is fine-tuned, you will find a tuned BERT model at
ls resources/taggers/en-bert_10epoch_0.5inter_2000batch_0.00005lr_20lrrate_ptb_monolingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_saving_nodev_dependency16/bert-base-cased
Then, replace bert-base-cased
with resources/taggers/en-bert_10epoch_0.5inter_2000batch_0.00005lr_20lrrate_ptb_monolingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_saving_nodev_dependency16/bert-base-cased
in the $config_file
of the ACE model (for example, config/ptb_parsing_model.yaml
).
The config config/en-bert-finetune-ptb.yaml
can be applied to fine-tuning other embeddings in parsing tasks. Here is an example config for fine-tuning NER (sequence labeling tasks): config/en-bert-finetune-ner.yaml
To archieve state-of-the-art accuracy of NER, one optional approach is extracting the document-level features from the BERT embeddings (for RoBERTa, XLM-R and XLNET, we feed the model with the whole document, if you are interested in this part, see embeddings.py). Then take the features as an embedding candidate of ACE. We follow the embedding extraction approach of Yu et al., 2020. We use the sentences with a single word -DOCSTART-
to split the documents. For CoNLL 2002 Spanish, there is not -DOCSTART-
sentences. Therefore, we add a -DOCSTART-
sentence for every 25 sentences. For CoNLL 2002 Dutch, the -DOCSTART-
is in the first sentence of the document, please split the -DOCSTART-
token into a single sentence. For example:
-DOCSTART- -DOCSTART- O
De Art O
tekst N O
van Prep O
het Art O
arrest N O
is V O
nog Adv O
niet Adv O
schriftelijk Adj O
beschikbaar Adj O
maar Conj O
het Art O
bericht N O
werd V O
alvast Adv O
bekendgemaakt V O
door Prep O
een Art O
communicatiebureau N O
dat Conj O
Floralux N B-ORG
inhuurde V O
. Punc O
...
Taking English BERT model on CoNLL English NER as an example, run:
CUDA_VISIBLE_DEVICES=0 python extract_features.py --config config/en-bert-extract.yaml --batch_size 32
If you want to parse a certain file, add train
in the file name and put the file in a certain $dir
(for example, parse_file_dir/train.your_file_name
). Run:
CUDA_VISIBLE_DEVICES=0 python train.py --config $config_file --parse --target_dir $dir --keep_order
The format of the file should be column_format={0: 'text', 1:'ner'}
for sequence labeling or you can modifiy line 337 in train.py
. The parsed results will be in outputs/
.
Note that you may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.
The config files are based on yaml format.
targets
: The target task
ner
: named entity recognitionupos
: part-of-speech taggingchunk
: chunkingast
: abstract extractiondependency
: dependency parsingenhancedud
: semantic dependency parsing/enhanced universal dependency parsingner
: An example for the targets
. If targets: ner
, then the code will read the values with the key of ner
.
Corpus
: The training corpora for the model, use :
to split different corpora.tag_dictionary
: A path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically.target_dir
: Save directory.model_name
: The trained models will be save in $target_dir/$model_name
.model
: The model to train, depending on the task.
FastSequenceTagger
: Sequence labeling model. The values are the parameters.SemanticDependencyParser
: Syntactic/semantic dependency parsing model. The values are the parameters.embeddings
: The embeddings for the model, each key is the class name of the embedding and the values of the key are the parameters, see flair/embeddings.py
for more details. For each embedding, use $classname-$id
to represent the class. For example, if you want to use BERT and M-BERT for a single model, you can name: TransformerWordEmbeddings-0
, TransformerWordEmbeddings-1
.trainer
: The trainer class.
ModelFinetuner
: The trainer for fine-tuning embeddings or simply train a task model without ACE.ReinforcementTrainer
: The trainer for training ACE.train
: the parameters for the train
function in trainer
(for example, ReinforcementTrainer.train()
).If you feel the code helpful, please cite:
@inproceedings{wang2020automated,
title = "{{Automated Concatenation of Embeddings for Structured Prediction}}",
author = "Wang, Xinyu and Jiang, Yong and Bach, Nguyen and Wang, Tao and Huang, Zhongqiang and Huang, Fei and Tu, Kewei",
booktitle = "{the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (\textbf{ACL-IJCNLP 2021})}",
month = aug,
year = "2021",
publisher = "Association for Computational Linguistics",
}
Feel free to email your questions or comments to issues or to Xinyu Wang.