This repository is an extension of my BiLSTM-CNN-CRF implementation.
It integrates the ELMo representations from the publication Deep contextualized word representations (Peters et al., 2018) into the BiLSTM-CNN-CRF architecture and can improve the performance significantly for different sequence tagging tasks.
The system is easy to use, optimized for high performance, and highly configurable.
Requirements:
Note: This implementation might be incompatible with different (e.g. more recent) versions of the frameworks. See docker/requirements.txt for a full list of all Python package requirements.
For an IPython Notebook with a simple example how to use ELMo representations for sentence classification, see: Keras_ELMo_Tutorial.ipynb.
This code is an extension of the emnlp2017-bilstm-cnn-crf implementation. Most examples can be used with only slight adaptation. Also please see that repository for an explanation about the definition of the datasets, the configuration of the hyperparameters, how to use it for multi-task learning, or how to create custom features.
Most aspects from emnlp2017-bilstm-cnn-crf work the same in this implementation.
This repository contains experimental software and is under active development. If you find the implementation useful, please cite the following paper: Alternative Weighting Schemes for ELMo Embedding
@article{Reimers:2019,
author = {Reimers, Nils, and Gurevych, Iryna},
title = {{Alternative Weighting Schemes for ELMo Embeddings}},
journal = {CoRR},
volume = {abs/1904.02954},
year = {2019},
url = {https://arxiv.org/abs/1904.02954}
}
Contact person: Nils Reimers, reimers@ukp.informatik.tu-darmstadt.de
https://www.ukp.tu-darmstadt.de/ https://www.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
In my publication Alternative Weighting Schemes for ELMo Embedding, I show that it is often sufficient to use only the first to layers of ELMo. The third layers led for various tasks to no significant improvement. Reducing the ELMo model from three to two layers increases the training speed up to 50%.
You can download the reduced, pre-trained models from here:
This reduced ELMo model is also compatible with the models from AllenNLP, just replace the options_file / weight_file in your config with the provided URLs.
In order to run the code, Python 3.6 or higher is required. The code is based on Keras 2.2.0 and as backend I recommend Tensorflow 1.8.0. I cannot ensure that the code works with different versions for Keras / Tensorflow or with different backends for Keras.
To get the ELMo representations, AllenNLP is required. The AllenNLP installation instructions describe a nice way how to setup a virtual enviromnent with the correct Python version.
Conda can be used set up a virtual environment with the version of Python required (3.6).
Create a Conda environment with Python 3.6
conda create -n elmobilstm python=3.6
Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to this code.
source activate elmobilstm
You can use pip
to install the dependencies.
pip install allennlp==0.5.1 tensorflow==1.8.0 Keras==2.2.0
In docker/requirements.txt) you find a full list of all used packages. You can install it via:
pip install -r docker/requirements.txt
The docker-folder contains an example how to create a Docker image that contains all required dependencies. It can be used to run your code within that container. See the docker-folder for more details.
If the installation was successful, you can test the code by running:
python Train_Chunking.py
This trains the ELMo-BiLSTM-CRF architecture on the CoNLL 2000 chunking dataset.
See Train_Chunking.py
for an example how to train and evaluate this implementation. The code assumes a CoNLL formatted dataset like the CoNLL 2000 dataset for chunking.
For training, you specify the datasets you want to train on:
datasets = {
'conll2000_chunking': #Name of the dataset
{'columns': {0:'tokens', 1:'POS', 2:'chunk_BIO'}, #CoNLL format for the input data. Column 0 contains tokens, column 1 contains POS and column 2 contains chunk information using BIO encoding
'label': 'chunk_BIO', #Which column we like to predict
'evaluate': True, #Should we evaluate on this task? Set true always for single task setups
'commentSymbol': None} #Lines in the input data starting with this string will be skipped. Can be used to skip comments
}
For more details, see the emnlp2017-bilstm-cnn-crf implementation.
The ELMoWordEmbeddings
-class provides methods for the efficient computation of ELMo representations. It has the following parameters:
The ELMoWordEmbeddings
provides methods for the efficient computation of ELMo representations. It has the following parameters:
embeddings_file
: The ELMo paper concatenates traditional word embeddings, like GloVe, with the context dependent embeddings. With embeddings_file
you can pass a path to a pre-trained word embeddings file. You can set it to none
if you don't want to use traditional word embeddings.elmo_options_file
and elmo_weight_file
: AllenNLP provides different pretrained ELMo models.elmo_mode
: Set to average
if you want all 3 layers to be averaged. Set to last
if you want to use only the final layer of the ELMo language model.elmo_cuda_device
: Can be set to the ID of the GPU which should compute the ELMo embeddings. Set to -1
to run ELMo on the CPU. Using a GPU drastically improves the computational time.The computation of ELMo representations is computationally expensive. A CNN is used to map the characters of a token to a dense vectors. These dense vectors are then fed through two BiLSTMs. The representation of each token and the two outputs of the BiLSTMs are used to form the final context-dependent word embedding.
In order speed-up the training, the context dependent word embeddings can be cached. Then, those embeddings must only be computed for the first epoch. For consecutive epochs, the embeddings are used from the cache.
To enable the caching, you must set embLookup.cache_computed_elmo_embeddings
to True:
embLookup = ELMoWordEmbeddings(embeddings_file, elmo_options_file, elmo_weight_file, elmo_mode, elmo_cuda_device)
#...
embLookup.cache_computed_elmo_embeddings = True
This method requires about 12 KB memory per token. For large datasets, you will need a few gigabyte of RAM.
The ELMoWordEmbeddings
class implements a caching mechanism for a quick lookup of sentences => context dependent word representations for all tokens in the sentence.
You can run Create_ELMo_Cache.py
to iterate through all you sentences in your dataset and create the ELMo embeddings for those. It stores these embeddings in the file embeddings/elmo_cache_[DatasetName].pkl
.
Once you create such a cache, you can load those in your experiments:
embLookup = ELMoWordEmbeddings(embeddings_file, elmo_options_file, elmo_weight_file, elmo_mode, elmo_cuda_device)
embLookup.loadCache('embeddings/elmo_cache_conll2000_chunking.pkl')
If a sentence is in the cache, the cached representations for all tokens in that sentence are used. This requires the computation of the ELMo embeddings for a dataset must only be done once.
Note: The cache file can become rather large, as 3*1024 float numbers per token must be stored. The cache file requires about 3.7 GB for the CoNLL 2000 dataset on chunking with about 13.000 sentences.
This repository is under active development as I'm currently running several experiments that involve ELMo embeddings.
If you have questions, feedback or find bugs, please send an email to me: reimers@ukp.informatik.tu-darmstadt.de