fajri91 / sum_liputan6

The first large-scale summarization corpus for the Indonesian language. AACL 2020.
35 stars 8 forks source link
bert corpus dataset huggingface indobert indonesian-language summarization summarization-model transformers

Liputan6: Summarization Corpus for Indonesian

About

Liputan6 is the first large-scale Indonesian corpus for Abstractive and Extractive summarization. This data is from year 2000 - 2010, and has two sets:

Data Train Dev Test
Canonical 193,883 10,972 10,972
Xtreme 193,883 4,948 3,862

Liputan6 is registered as a new dataset in IndoLEM (Indonesian resource collection encompassing morpho-syntax, semantics, and discourse).

Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. Liputan6: A Large-scale Indonesian Dataset for Text Summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2020)

Obtaining Liputan6 Data

Disclaimer

Although Liputan6 is a publicly available online news portal, according to Indonesian Copyright Law Number 28 Year 2014, this corpus can only be used for non-commercialized activities such as academic research. It is STRONGLY FORBIDDEN to use this corpus as well as any summarization models created using this corpus for commercialized activities. We highly encourage for another respective researcher to not re-distribute the dataset.

Way1 - By filling the form

Please fill this form. A url to download Liputan6 corpus will be sent to your email address.

Way2 - By running the codes

Training Neural Models

Test Set Output

We also provide test set output as reported in our paper. You can download them here.

Model R1 R2 RL
Lead-2 36.68 20.23 33.71
PTGen 36.10 19.19 33.56
BertExt (mBERT) 37.51 20.15 34.57
BertAbs (mBERT) 39.48 21.59 36.72
BertExtAbs (mBERT) 39.81 21.84 37.02
BertExt (indoBERT) 38.03 20.72 35.07
BertAbs (indoBERT) 40.94 23.01 37.89
BertExtAbs (indoBERT) 41.08 22.85 38.01

Evaluation

Please install pyrouge for evaluating the summary.