Yale-LILY / SummerTime

An open-source text summarization toolkit for non-experts. EMNLP'2021 Demo
https://arxiv.org/abs/2108.12738
Apache License 2.0
264 stars 30 forks source link

Murori: Adds more datasets, including SUMMscreen, QMsum, Arxiv, XSum, PubmedQA #35

Closed MuroriM closed 3 years ago

MuroriM commented 3 years ago

Added SUMMscreen dataset and QMsum datasets. These changes affect dataset/non_huggingface_datasets.py and dataset/init.py.

Made a few minor fixes in: huggingface_datasets.py - to correct the types of certain SummInstance variables tests/dataset_test.py - corrects to the correct implementation of checking whether SummInstance is a list or string

MuroriM commented 3 years ago

Adds Arxiv and Xsum datasets. Still testing the Arxiv dataset due to its enormous size, so in case there's an error in the code, I'll push some more commits to fix this.

The goal is to have all commits in by tomorrow morning

MuroriM commented 3 years ago

Adds the Pubmed QA dataset to cover the medical domain

Modifies: dataset/huggingface_datasets.py dataset/init.py

niansong1996 commented 3 years ago

Nice work! Are all five datasets done and equipped with tests?