Open SamuelCahyawijaya opened 7 months ago
Question: I checked the datasets and the contents are mostly metadata.
An example: {"filename": "a_1696081", "url": "https://www.voacambodia.com/a/1696081.html", "url_origin": "https://www.voacambodia.com/sitemap_423_1.xml.gz", "content_type": "article", "site_language": "khm", "time_published": "2013-07-05T00:00:00", "time_modified": "2013-07-05T19:20:49", "time_retrieved": "2021-06-24T11:21:22.649000", "title": "Critics Say Hun Sen’s Land Title Program Is Biased", "authors": ["Khoun Theara"], "paragraphs": ["PHNOM PENH —", "PHNOM PENH —"], "n_paragraphs": 2, "n_chars": 24, "cld3_detected_languages": {"hin": {"cld3_language": "hi-Latn", "probability": 0.8355782628059387, "is_reliable": true, "proportion": 1.0}}, "predicted_language": "khm", "sentences": [["PHNOM PENH —"], ["PHNOM PENH —"]], "tokens": [[["PHNOM", "PENH", "—"]], [["PHNOM", "PENH", "—"]]], "n_tokens": 6, "n_sentences": 2, "keywords": ["Cambodia", "Human Rights"], "section": "cambodia"}
The "Sentences" seems to be only the header of the article. I would have to extract the text from the article myself.
Moreover, the datasets provide metadata to multiple modalities per language:
Article, Audio, Image, Video
And, the Seacrowd SSP Schema cannot handle this, unless I am missing something.
Requesting input on what I should do, should I just create a dataloader for the metadata instead?.
Dataloader name:
mot/mot.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mot