Create dataset loader for CulturaY

Dataset	culturay
Description	CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages From the team that brought you CulutraX, we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat.
Subsets	-
Languages	mya, fil, zlm, vie, ind, tha
Tasks	Language Modeling
License	Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage	https://huggingface.co/datasets/ontocord/CulturaY
HF URL	https://huggingface.co/datasets/ontocord/CulturaY
Paper URL	https://huggingface.co/datasets/ontocord/CulturaY

SEACrowd / seacrowd-datahub