Create dataset loader for Indonesian News Dataset

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_news_dataset

Dataset	id_news_dataset
Description	The dataset compiles information from seven prominent Indonesian news platforms: Tempo, CNN Indonesia, CNBC Indonesia, Okezone, Suara, Kumparan, and JawaPos. Each source contributes a diverse range of articles, collectively forming a comprehensive repository of Indonesian news content. This dataset includes 2 special columns, 'embedding' which houses the text embeddings extracted using the OpenAI text-embedding-ada-002 model, and 'summary' which encapsulates the concise article summary generated via the ChatGPT API.
License	CC-BY-NC-4.0

Dataset

id_news_dataset

Description

The dataset compiles information from seven prominent Indonesian news platforms: Tempo, CNN Indonesia, CNBC Indonesia, Okezone, Suara, Kumparan, and JawaPos. Each source contributes a diverse range of articles, collectively forming a comprehensive repository of Indonesian news content. This dataset includes 2 special columns, 'embedding' which houses the text embeddings extracted using the OpenAI text-embedding-ada-002 model, and 'summary' which encapsulates the concise article summary generated via the ChatGPT API.

License

CC-BY-NC-4.0

IndoNLP / nusa-crowd

Create dataset loader for Indonesian News Dataset #366