admk / sembr

⚡️ A semantic line breaker that truly breaks lines semantically. Powered by Transformers.
MIT License
16 stars 1 forks source link
formatter latex markdown semantic-line-breaks

Semantic Line Breaker (SemBr)

GitHub python pytorch PyPI

> When writing text
> with a compatible markup language,
> add a line break
> after each substantial unit of thought.

What is SemBr?

SemBr is a command-line tool powered by Transformer models that breaks lines in a text file at semantic boundaries.

Installation

SemBr is available as a Python package on PyPI. To install it, simply run the following command in your terminal, assuming that you have Python 3.10 or later installed:

pip install sembr

Supported Platforms

SemBr is supported on Linux, Mac and Windows. On machines with CUDA devices, or on Apple Silicon Macs, SemBr will use the GPU / Apple Neural Engine to accelerate inference.

Usage

To use SemBr, run the following command in your terminal:

sembr -i <input_file> -o <output_file>

where <input_file> and <output_file> are the paths to the input and output files respectively.

On the first run, it will download the SemBr model and cache it in ~/.cache/huggingface. Subsequent runs will check for updates and use the cached model if it is up-to-date.

Alternatively, you can pipe the input into sembr, and the output can also be printed to the terminal:

cat <input_file> | sembr

This is especially useful if you want to use SemBr with clipboard managers, for instance, on a Mac:

pbpaste | sembr | pbcopy

Or on Linux:

xclip -o | sembr | xclip -i

Additionally, you can specify the following options to customize the behavior of SemBr:

What are Semantic Line Breaks?

Semantic Line Breaks or Semantic Linefeeds describe a set of conventions for using insensitive vertical whitespace to structure prose along semantic boundaries.

Why use Semantic Line Breaks?

Semantic Line Breaks has the following advantages:

Why SemBr?

Converting existing text not written with semantic line breaks takes a long time to do it manually, and it is surprisingly difficult to do it automatically with rule-based methods.

Challenges of rule-based methods

Rule-based heuristics do not work well with the actual semantic structure of the text, often leading to incorrect semantic boundaries. Moreover, these boundaries are hierarchical and nested, and a rule-based approach cannot capture this structure. A semantic line break may occur after a dependent clause, but where to break clauses into lines is challenging to determine without syntactic and semantic reasoning capabilities. For examples:

For this reason, I have created SemBr, which uses finetuned Transformer models to predict line breaks at semantic boundaries.

How does SemBr work?

SemBr uses a Transformer model to predict line breaks at semantic boundaries.

A small dataset of text with semantic line breaks was created from my existing LaTeX documents. The dataset was split into training (46,295 lines, 170,681 words and 1,492,952 characters) and test (2,187 lines, 7,564 words and 72,231 characters) datasets.

The data was prepared by extracting line breaks and indent levels from the files, and then converting the result into strings of paragraphs with line breaks removed. The data can then be tokenized using the tokenizer and converted into a dataset with tokens, where each token has a label denoting if there is line break before it, and the indent level of the token.

For LaTeX documents, there are two types of line breaks: one with a normal line break that adds implicit spacing (e.g. line a⏎line b) and one with no spacing (e.g. line a%⏎line b). The data processor also tries to preserve the LaTeX syntax of the text by adding and removing comment symbols (%), if necessary.

The pretrained masked language model is then finetuned as a token classifier on the training dataset to predict the labels of the tokens. We save the model with the best F1 score on correctly predicting the existence of a line break on the test set. The finetuning logs for the following models can be found on this WandB report:

Performance

Current inference speed on an M2 Macbook Pro is about 850 words per second on bert-small with the default options, the memory usage is about 1.70 GB.

The link breaking accuracy is difficult to measure, and the locations of line breaks could also be subjective. On the test set, the per-token line break accuracy of the models are >95%, with ~80% F1 scores. Because of the sparse nature of line breaks, the accuracy is not a good metric to measure the performance of the model, and I used the F1 score instead to save best models.

Improvements and TODOs

Related Projects and References

Sentence splitting:

Semantic line breaking: