Introduced in Ein-dor et al. (2020), this is a framework for experimenting with text classification tasks. The focus is on low-resource scenarios, and examining how active learning (AL) can be used in combination with classification models.
The framework includes a selection of labeled datasets, machine learning models and active learning strategies (see Built-in Implementations below), and can be easily adapted for additional setups and scenarios.
Table of contents
Running active learning experiments
Adapting to additional scenarios:
Currently, the framework requires Python 3.7
Clone the repository locally:
git clone https://github.com/IBM/low-resource-text-classification-framework
Install the project dependencies: pip install -r lrtc_lib/requirements.txt
Windows users also need to download the latest Microsoft Visual C++ Redistributable for Visual Studio in order to support tensorflow
lrtc_lib/download_and_prepare_datasets.sh
.
This script downloads the datasets with built-in support.The ExperimentRunner
class enables running experiments in the vein of Ein-dor et al. (2020),
i.e. an experimental flow where an initial seed of labeled instances is used to train a model, and then several
iterations of active learning are performed. In each active learning iteration, the set of labeled instances is
expanded with the batch examples selected by the active learning module, and a new model is trained on this larger set.
Implementations of ExperimentRunner
vary in terms of how the initial seed of labeled instances is selected.
The three scenarios described in the paper are implemented by:
The experiment flow can be performed on a custom combination of datasets, model types and active learning strategies.
To run an experiment from a terminal, go to the repository directory
(usually <path_to_python_projects>/low-resource-text-classification-framework
) and run python -m path.to.module
,
for example:
python -m lrtc_lib.experiment_runners.experiment_runner_imbalanced_practical
Alternatively, an IDE such as PyCharm can be used.
The main function of each ExperimentRunner specifies all the experimental parameters. For information on all the
dataset and category names available for running experiments, run loaded_datasets_info.py
using python -m lrtc_lib.data_access.loaded_datasets_info
.
These are the steps for integrating a new classification model:
Implement a new TrainAndInferAPI
Machine learning models are integrated by adding a new implementation of the TrainAndInferAPI. The main functions are train and infer:
Train a new model and return a unique model identifier that will be used for inference.
def train(self, train_data: Sequence[Mapping], dev_data: Sequence[Mapping], test_data: Sequence[Mapping],
train_params: dict) -> str
Infer a given sequence of elements and return the results.
def infer(self, model_id, items_to_infer: Sequence[Mapping], infer_params: dict, use_cache=True) -> dict:
Returns a dictionary with at least the "labels" key, where the value is a list of numeric labels for each element in items_to_infer. Additional keys (with list values of the same length) can be passed, e.g. {"labels": [1, 0], "gradients": [[0.24, -0.39, -0.66, 0.25], [0.14, 0.29, -0.26, 0.16]]}
Specify a new ModelType in ModelTypes
Return the newly implemented TrainAndInferAPI in TrainAndInferFactory
The system assumes that active learning strategies that require special inference outputs (e.g. text embeddings)
are not supported by your new model. If your model does support this, add it to the appropriate category
in get_compatible_models
in strategies.py
Set your ModelType in one of the ExperimentRunners, and run
These are the steps for integrating a new active learning approach:
Implement a new ActiveLearner
Active learning modules inherit from the ActiveLearner API. The main function to implement is get_recommended_items_for_labeling:
def get_recommended_items_for_labeling(self, workspace_id: str, model_id: str, dataset_name: str,
category_name: str, sample_size: int = 1) -> Sequence[TextElement]:
This function returns a batch of sample_size elements suggested by the active learning module for a given dataset and category, based on the outputs of model model_id.
Optionally, the ActiveLearner can also implement the function get_per_element_score
, where the active learning
module does not just return a batch of selected elements, but can also assign each text element with a score.
Specify a new ActiveLearningStrategy in ActiveLearningStrategies
Return your new ActiveLearner in ActiveLearningFactory
If the active learner requires particular outputs from the machine learning model, update get_compatible_models
accordingly. For instance, if the strategy relies on model embeddings, add it to the set of embedding-based strategies.
Set your ActiveLearningStrategy in one of the ExperimentRunners, and run
These are the steps for adding a new dataset:
train.csv
, dev.csv
, and test.csv
.
label
and text
, and may have additional columns.lrtc_lib/data/available_datasets/<new_dataset_name>
Create a processor for the new dataset by extending CsvProcessor
(which implements DataProcessorAPI
)
and place it under lrtc_lib/data_access/processors
.
CsvProcessor
__init__
function looks like this:
def __init__(self, dataset_name: str, dataset_part: DatasetPart, text_col: str = 'text',
label_col: str = 'label', context_col: str = None,
doc_id_col: str = None,
encoding: str = 'utf-8'):
text
.label
.utf-8
.For example, here is the processor for DBPedia (which uses the default values of CsvProcessor
):
class DbpediaProcessor(CsvProcessor):
def __init__(self, dataset_part: DatasetPart):
super().__init__(dataset_name='dbpedia', dataset_part=dataset_part)
If more flexibility is needed, implement DataProcessorAPI
directly.
data_processor_factory
. Note, in this step you define the name of the new dataset. load_dataset
with the new dataset name (as defined in data_processor_factory
) to generate dump files under
data/data_access_dumps
(for the documents and text elements of the dataset) and data/oracle_access_dumps
(for the gold labels of the text elements).* _Loading the ISEAR dataset requires installing additional dependencies before
running the installation script, and is only supported on Mac/Linux. Specifically, you will need to
install mdbtools on your machine and then pip install pandas_access
_.
Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz and Noam Slonim (2020). Active Learning for BERT: An Empirical Study. EMNLP 2020
Please cite:
@inproceedings{ein-dor-etal-2020-active,
title = "Active Learning for {BERT}: An Empirical Study",
author = "Ein-Dor, Liat and
Halfon, Alon and
Gera, Ariel and
Shnarch, Eyal and
Dankin, Lena and
Choshen, Leshem and
Danilevsky, Marina and
Aharonov, Ranit and
Katz, Yoav and
Slonim, Noam",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.638",
pages = "7949--7962",
}
This work is released under the Apache 2.0 license. The full text of the license can be found in LICENSE.