This repository contains the code and resources from the following paper
aligner
: Code for neural CRF sentence aligner.
wiki-manual
: The Wiki-Manual dataset. The definitions of columns are: label, the index of simple sentence, the index of complex sentence, simple sentence, complex sentence.
wiki-auto
: The Wiki-Auto dataset.
annotation_tool
: The tool for in-house annotators to annotate the sentence alignment.
simplification
: Code for text simplification experiments.
We upload all fine-tuned BERT checkpoints to huggingface hub, and provide a sample code to use them.
BERT
model fine-tuned on Newsela-Manual and Wiki-Manual datasets. They are trained using the Hugging Face implementation of BERT_base
architecture in the package pytorch-transformers==1.1.0
. BERT_newsela
and BERT_wiki
.To request the Newsela-Manual and Newsela-Auto datasets, please first obtain access to the Newsela corpus, then contact the authors.
Please use Python 3 to run the code.
We also have pre-processed Wikipedia data, alignments between complex and simple Wikipedia articles, and original sentence and paragraph alignments between Wikipedia article pairs, please contact us if you want to use that data.
We also have the original sentence and paragraph alignments between the Newsela articles, please contact us if you want to use that data.
Please cite if you use the above resources for your research
@inproceedings{jiang2020neural,
title={Neural CRF Model for Sentence Alignment in Text Simplification},
author={Jiang, Chao and Maddela, Mounica and Lan, Wuwei and Zhong, Yang and Xu, Wei},
booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
year={2020}
}