SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for Malaysia-AI Hansard Scrape #423

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: malaysia_ai_hansard/malaysia_ai_hansard.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?malaysia_ai_hansard

Dataset malaysia_ai_hansard
Description The Malaysia AI Hansard Scrape dataset contains 142,766 PDFs from the Malaysian Parliament website (https://www.parlimen.gov.my/hansard-dewan-rakyat.html?uweb=dr). It includes a JSON file for each document with the text labeled "original", page numbers "no_page" and "actual_no_page", the document's "date", and the "url" of the original PDF.
Subsets -
Languages zlm
Tasks Language Modeling
License Apache license 2.0 (apache-2.0)
Homepage https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl
HF URL https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl
Paper URL -
ilhamfp commented 8 months ago

self-assign

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

github-actions[bot] commented 7 months ago

Hi @ilhamfp, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

ilhamfp commented 7 months ago

It turns out life gets in the way :) unassigning myself

BinWang28 commented 7 months ago

self-assign