The ViSoBERT is textual data crawled from three most well-known Vietnamese public social networks (Facebook, TikTok, and YouTube) by research API of these platform. The dataset contains Facebook posts, TikTok comments, and Youtube comments of Vietnamese-verified users, from Jan 2016 (Jan 2020 for TikTok) to Dec 2022. A post-processing mechanism is applied to handles hashtags, emojis, misspellings, hyperlinks, and other noncanonical texts.
Subsets
-
Languages
vie
Tasks
Language Modeling
License
Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0)
Dataloader name:
visobert/visobert.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?visobert