SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for Multilanguage Open Text (MOT) #611

Open SamuelCahyawijaya opened 3 months ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: mot/mot.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mot

Dataset mot
Description The Multilanguage Open Text corpus comprises 2.8 million news articles in 44 languages from Voice of America (VOA) news websites, each provided as a separate .tgz file. Content includes text descriptions for different media types, such as articles and videos, and maintains the original HTML format with paragraph and sentence structures. Each file contains details like filename, URL, content type, site language, timestamps, title, authors, and text in paragraphs. Additional data fields include language detection and analysis, keywords, sections, and sentence and token segmentation for select languages. Some languages also feature links to corresponding English articles for translated content. From the paper: "Each file contains the following fields: • filename: the name of the file derived from the URL • url: the URL from which the document was retrieved • url origin: the sitemap from which the URL was retrieved • content type: the type of content (e.g., article, audio, photo, video) of the document • site language: the language of the VOA site • time published: the timestamp for when the document was published • time modified: the timestamp for when the document was last modified • time retrieved: the timestamp for when the document was retrieved from the sitemap • title: the title of the document • authors: the author(s) of the document • paragraphs: the text extracted from the document • n paragraphs: the number of paragraphs in the document • n chars: the number of characters in the document • cld3 detected languages: the language(s) identified by CLD3 from the full extracted text of the document (see Section 4.3) – language: the language outputted by CLD3 – probability: the probability that the language identified is correct (passed directly from CLD3) – is reliable: if probability is above 0.7 (passed directly from CLD3) – proportion: the proportion of the text identified as the language (passed directly from CLD3) • predicted language: the language that we predict that the document is in, based on rules that take into account the site, the CLD3 predictions, and whether the site language is supported by CLD3 • keywords: the terms relating to the text content of the document • section: the subpage the document falls under These additional fields are included only for subset of languages: • sentences: the text extracted from the document segmented into sentences • n sentences: the number of sentences in the document • tokens: the text extracted from the document segmented into tokens • n tokens: the number of tokens in the document • parallel english article: the URL for the English document from which the current document was translated from into the site language (this currently only appears in Lao articles)"
Subsets -
Languages lao, tha, vie, ind, khm, mya
Tasks Language Modeling
License MIT (mit)
Homepage https://github.com/bltlab/mot/releases/tag/v1.0
HF URL -
Paper URL -
luckysusanto commented 2 months ago

self-assign

luckysusanto commented 2 months ago

Question: I checked the datasets and the contents are mostly metadata.

An example: {"filename": "a_1696081", "url": "https://www.voacambodia.com/a/1696081.html", "url_origin": "https://www.voacambodia.com/sitemap_423_1.xml.gz", "content_type": "article", "site_language": "khm", "time_published": "2013-07-05T00:00:00", "time_modified": "2013-07-05T19:20:49", "time_retrieved": "2021-06-24T11:21:22.649000", "title": "Critics Say Hun Sen’s Land Title Program Is Biased", "authors": ["Khoun Theara"], "paragraphs": ["PHNOM PENH —", "PHNOM PENH —"], "n_paragraphs": 2, "n_chars": 24, "cld3_detected_languages": {"hin": {"cld3_language": "hi-Latn", "probability": 0.8355782628059387, "is_reliable": true, "proportion": 1.0}}, "predicted_language": "khm", "sentences": [["PHNOM PENH —"], ["PHNOM PENH —"]], "tokens": [[["PHNOM", "PENH", "—"]], [["PHNOM", "PENH", "—"]]], "n_tokens": 6, "n_sentences": 2, "keywords": ["Cambodia", "Human Rights"], "section": "cambodia"}

The "Sentences" seems to be only the header of the article. I would have to extract the text from the article myself.

Moreover, the datasets provide metadata to multiple modalities per language:

Article, Audio, Image, Video

And, the Seacrowd SSP Schema cannot handle this, unless I am missing something.

Requesting input on what I should do, should I just create a dataloader for the metadata instead?.