Scikit-learn style model finetuning for NLP
Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide variety of downstream tasks.
Finetune currently supports TensorFlow implementations of the following models:
Section | Description |
---|---|
API Tour | Base models, configurables, and more |
Installation | How to install using pip or directly from source |
Finetune with Docker | Finetune and inference within a Docker Container |
Documentation | Full API documentation |
Finetuning the base language model is as easy as calling Classifier.fit
:
model = Classifier() # Load base model
model.fit(trainX, trainY) # Finetune base model on custom data
model.save(path) # Serialize the model to disk
...
model = Classifier.load(path) # Reload models from disk at any time
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
Choose your desired base model from finetune.base_models
:
from finetune.base_models import BERT, RoBERTa, GPT, GPT2, TextCNN, TCN
model = Classifier(base_model=BERT)
Optimize your model with a variety of configurables. A detailed list of all config items can be found in the finetune docs.
model = Classifier(low_memory_mode=True, lr_schedule="warmup_linear", max_length=512, l2_reg=0.01, oversample=True, ...)
The library supports finetuning for a number of tasks. A detailed description of all target models can be found in the finetune API reference.
from finetune import *
models = (Classifier, MultiLabelClassifier, MultiFieldClassifier, MultipleChoice, # Classify one or more inputs into one or more classes
Regressor, OrdinalRegressor, MultifieldRegressor, # Regress on one or more inputs
SequenceLabeler, Association, # Extract tokens from a given class, or infer relationships between them
Comparison, ComparisonRegressor, ComparisonOrdinalRegressor, # Compare two documents for a given task
LanguageModel, MultiTask, # Further pretrain your base models
DeploymentModel # Wrapper to optimize your serialized models for a production environment
)
For example usage of each of these target types, see the finetune/datasets directory. For purposes of simplicity and runtime these examples use smaller versions of the published datasets.
If you have large amounts of unlabeled training data and only a small amount of labeled training data, you can finetune in two steps for best performance.
model = Classifier() # Load base model
model.fit(unlabeledX) # Finetune base model on unlabeled training data
model.fit(trainX, trainY) # Continue finetuning with a smaller amount of labeled data
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
model.save(path) # Serialize the model to disk
Finetune can be installed directly from PyPI by using pip
pip3 install finetune
or installed directly from source:
git clone -b master https://github.com/IndicoDataSolutions/finetune && cd finetune
python3 setup.py develop # symlinks the git directory to your python path
pip3 install tensorflow-gpu --upgrade # or tensorflow-cpu
python3 -m spacy download en # download spacy tokenizer
In order to run finetune
on your host, you'll need a working copy of tensorflow-gpu >= 1.14.0 and up to date nvidia-driver versions.
You can optionally run the provided test suite to ensure installation completed successfully.
pip3 install pytest
pytest
If you'd prefer you can also run finetune
in a docker container. The bash scripts provided assume you have a functional install of docker and nvidia-docker.
git clone https://github.com/IndicoDataSolutions/finetune && cd finetune
# For usage with NVIDIA GPUs
./docker/build_gpu_docker.sh # builds a docker image
./docker/start_gpu_docker.sh # starts a docker container in the background, forwards $PWD to /finetune
docker exec -it finetune bash # starts a bash session in the docker container
For CPU-only usage:
./docker/build_cpu_docker.sh
./docker/start_cpu_docker.sh
Full documentation and an API Reference for finetune
is available at finetune.indico.io.