SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for ViSoBERT #307

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: visobert/visobert.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?visobert

Dataset visobert
Description The ViSoBERT is textual data crawled from three most well-known Vietnamese public social networks (Facebook, TikTok, and YouTube) by research API of these platform. The dataset contains Facebook posts, TikTok comments, and Youtube comments of Vietnamese-verified users, from Jan 2016 (Jan 2020 for TikTok) to Dec 2022. A post-processing mechanism is applied to handles hashtags, emojis, misspellings, hyperlinks, and other noncanonical texts.
Subsets -
Languages vie
Tasks Language Modeling
License Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)
Homepage https://drive.google.com/drive/folders/1C144LOlkbH78m0-JoMckpRXubV7XT7Kb
HF URL https://huggingface.co/uitnlp/visobert
Paper URL https://aclanthology.org/2023.emnlp-main.315.pdf
revaldianggara commented 8 months ago

self-assign

github-actions[bot] commented 8 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

raileymontalan commented 7 months ago

self-assign