Liputan6 is the first large-scale Indonesian corpus for Abstractive and Extractive summarization. This data is from year 2000 - 2010, and has two sets:
Data | Train | Dev | Test |
---|---|---|---|
Canonical | 193,883 | 10,972 | 10,972 |
Xtreme | 193,883 | 4,948 | 3,862 |
Liputan6 is registered as a new dataset in IndoLEM (Indonesian resource collection encompassing morpho-syntax, semantics, and discourse).
Fajri Koto, Jey Han Lau, and Timothy Baldwin. Liputan6: A Large-scale Indonesian Dataset for Text Summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2020)
Although Liputan6 is a publicly available online news portal, according to Indonesian Copyright Law Number 28 Year 2014, this corpus can only be used for non-commercialized activities such as academic research. It is STRONGLY FORBIDDEN to use this corpus as well as any summarization models created using this corpus for commercialized activities. We highly encourage for another respective researcher to not re-distribute the dataset.
Please fill this form. A url to download Liputan6 corpus will be sent to your email address.
url.json
in this repository.number of thread
, please adjust the code manually.
pip install -r requirements.txt
python 0_download.py
python 1_preprocessing.py
python 2_create_extractive_label.py
python 3_get_xtreme.py
python 4_make_data_files_pg.py
python 5_make_data_files_presumm_mbert.py
We also provide test set output as reported in our paper. You can download them here.
Model | R1 | R2 | RL |
---|---|---|---|
Lead-2 | 36.68 | 20.23 | 33.71 |
PTGen | 36.10 | 19.19 | 33.56 |
BertExt (mBERT) | 37.51 | 20.15 | 34.57 |
BertAbs (mBERT) | 39.48 | 21.59 | 36.72 |
BertExtAbs (mBERT) | 39.81 | 21.84 | 37.02 |
BertExt (indoBERT) | 38.03 | 20.72 | 35.07 |
BertAbs (indoBERT) | 40.94 | 23.01 | 37.89 |
BertExtAbs (indoBERT) | 41.08 | 22.85 | 38.01 |
Please install pyrouge for evaluating the summary.