SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for ID Newspapers 2018 #516

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: id_newspaper_2018/id_newspaper_2018.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?id_newspaper_2018

Dataset id_newspaper_2018
Description This dataset aims to provide open access to the public to thousands of articles in Indonesian from various sources. The articles are provided along with accompanying article metadata, such as source url, date, and title, along with the content of the article itself. Articles were taken over the period of 01 January 2018 to 20 August 2018 from 7 primary sources (Detik, Kompas, Tempo, CNN Indonesia, Sindo, Republika, Poskota). The original dataset also contains data in html format, which includes raw data (images, along with css and ) from the online news website.
Subsets -
Languages ind
Tasks Language Modeling
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/feryandi/Dataset-Artikel
HF URL https://huggingface.co/datasets/indonesian-nlp/id_newspapers_2018
Paper URL -
raileymontalan commented 5 months ago

self-assign