SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for Malaysia-AI government #421

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: malaysia_ai_government/malaysia_ai_government.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?malaysia_ai_government

Dataset malaysia_ai_government
Description This is a dataset containing pdfs scraped from 735 gov.my websites. It consists of thousands of the unedited text, a link to the URL where the website was retrieved, and the name of the pdf.
Subsets gov.my.jsonl, govdocs.jsonl, muftiwp.gov.my.jsonl, myjms.mohe.gov.my.jsonl
Languages zlm
Tasks Language Modeling
License Apache license 2.0 (apache-2.0)
Homepage https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gov.my.jsonl
HF URL -
Paper URL -
ilhamfp commented 8 months ago

self-assign

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

ilhamfp commented 7 months ago

It turns out life gets in the way :) unassigning myself

BinWang28 commented 7 months ago

self-assign