FerdinandZhong / punctuator

A small seq2seq punctuator tool based on DistilBERT
Apache License 2.0
49 stars 6 forks source link
bert bert-ner chinese-nlp deep-learning nlp punctuation pytorch seq2seq

Distilbert-punctuator

PyPI version PyPi Downloads PyPi Latest Month Downloads License

Introduction

Distilbert-punctuator is a python package provides a bert-based punctuator (fine-tuned model of pretrained huggingface DistilBertForTokenClassification) with following three components:

Installation

Data Process

Component for pre-processing the training data. To use this component, please install as pip install distilbert-punctuator[data_process]

The package is providing a simple pipeline for you to generate NER format training data.

Example

examples/data_sample.py

Train

Component for providing a training pipeline for fine-tuning a pretrained DistilBertForTokenClassification model from huggingface. The latest version has the implementation of R-Drop enhanced training. R-Drop github repo Paper of R-Drop

Example

examples/english_train_sample.py

Training_arguments:

Arguments required for the training pipeline.

You can also train your own NER models with the trainer provided in this repo. The example can be found in notebooks/R-drop NER.ipynb

Evaluation

Validation of fine-tuned model

Example

examples/train_sample.py

Validation_arguments:

Inference

Component for providing an inference interface for user to use punctuator.

Architecture

 +----------------------+              (child process)
 |   user application   |             +-------------------+
 +                      + <---------->| punctuator server |
 |   +inference object  |             +-------------------+
 +----------------------+

The punctuator will be deployed in a child process which communicates with main process through pipe connection. Therefore user can initialize an inference object and call its punctuation function when needed. The punctuator will never block the main process unless doing punctuation. There is a graceful shutdown methodology for the punctuator, hence user dosen't need to worry about the shutting-down.

Example

examples/inference_sample.py

Inference_arguments

Arguments required for the inference pipeline.