Neural CRF Model for Sentence Alignment in Text Simplification

This repository contains the code and resources from the following paper

Repo Structure:

aligner: Code for neural CRF sentence aligner.
wiki-manual: The Wiki-Manual dataset. The definitions of columns are: label, the index of simple sentence, the index of complex sentence, simple sentence, complex sentence.
wiki-auto: The Wiki-Auto dataset.
annotation_tool: The tool for in-house annotators to annotate the sentence alignment.
simplification: Code for text simplification experiments.

Checkpoints

Update on Feb. 22, 2023

We upload all fine-tuned BERT checkpoints to huggingface hub, and provide a sample code to use them.

We released the checkpoints of BERT model fine-tuned on Newsela-Manual and Wiki-Manual datasets. They are trained using the Hugging Face implementation of BERT_base architecture in the package pytorch-transformers==1.1.0. BERT_newsela and BERT_wiki.
If you want to align other monolingual parallel data, please try the fine-tuned BERT models. They should be able to achieve competitive performance. The performance boost of adding the neural CRF model is related to the structure of the articles. We have some experience in designing the paragraph alignment algorithm and using neural CRF model to align sentences, feel free to contact us if you want to have a discussion.
We also released the code for our neural CRF sentence alignment model, you can use it to train your own model.

Instructions:

To request the Newsela-Manual and Newsela-Auto datasets, please first obtain access to the Newsela corpus, then contact the authors.
Please use Python 3 to run the code.
We also have pre-processed Wikipedia data, alignments between complex and simple Wikipedia articles, and original sentence and paragraph alignments between Wikipedia article pairs, please contact us if you want to use that data.
We also have the original sentence and paragraph alignments between the Newsela articles, please contact us if you want to use that data.

Citation

Please cite if you use the above resources for your research

@inproceedings{jiang2020neural,
  title={Neural CRF Model for Sentence Alignment in Text Simplification},
  author={Jiang, Chao and Maddela, Mounica and Lan, Wuwei and Zhong, Yang and Xu, Wei},
  booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
  year={2020}
}