IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Create dataset loader for Indonesian News Dataset #366

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_news_dataset

Dataset id_news_dataset
Description The dataset compiles information from seven prominent Indonesian news platforms: Tempo, CNN Indonesia, CNBC Indonesia, Okezone, Suara, Kumparan, and JawaPos. Each source contributes a diverse range of articles, collectively forming a comprehensive repository of Indonesian news content. This dataset includes 2 special columns, 'embedding' which houses the text embeddings extracted using the OpenAI text-embedding-ada-002 model, and 'summary' which encapsulates the concise article summary generated via the ChatGPT API.
License CC-BY-NC-4.0