SamuelCahyawijaya commented 8 months ago

Dataloader name: malaysia_ai_hansard/malaysia_ai_hansard.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?malaysia_ai_hansard

Dataset	malaysia_ai_hansard
Description	The Malaysia AI Hansard Scrape dataset contains 142,766 PDFs from the Malaysian Parliament website (https://www.parlimen.gov.my/hansard-dewan-rakyat.html?uweb=dr). It includes a JSON file for each document with the text labeled "original", page numbers "no_page" and "actual_no_page", the document's "date", and the "url" of the original PDF.
Subsets	-
Languages	zlm
Tasks	Language Modeling
License	Apache license 2.0 (apache-2.0)
Homepage	https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl
HF URL	https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl
Paper URL	-

ilhamfp commented 8 months ago

self-assign

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

github-actions[bot] commented 7 months ago

Hi @ilhamfp, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

ilhamfp commented 7 months ago

It turns out life gets in the way :) unassigning myself

BinWang28 commented 7 months ago

SEACrowd / seacrowd-datahub