Create dataset loader for CCMatrix

Dataset	ccmatrix
Description	The CCMatrix dataset was collected from web crawls and released by Meta. The dataset is constructed based on the margin-based bitext mining which can be applied to monolingual corpora of billions of sentences to produce high quality aligned translation data.
Subsets	-
Languages	jav, eng, vie, ind, tgl, mya, zlm
Tasks	Language Modeling, Machine Translation
License	BSD license family (bsd)
Homepage	https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix
HF URL	https://huggingface.co/datasets/yhavinga/ccmatrix
Paper URL	https://aclanthology.org/2021.acl-long.507/

SEACrowd / seacrowd-datahub