# xMIND [![CC BY-NC-SA 4.0][cc-by-nc-sa-shield]][cc-by-nc-sa] [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/ [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png [cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg

Description

xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND (https://msnews.github.io/) dataset using open-source neural machine translation (i.e., NLLB 3.3B). xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.

The table below summarizes information about each language included in xMIND, according to the following criteria:

Code: the three-letter ISO 693-3 code of the language;
Language: the language name from WALS;
Script: the English name of the script;
Macro-area, Family ,and Genus: the macro-area, language family and genus from WALS and Glottolog
Res.: the classification from into low-resource and high-resource

Code	Language	Script	Macro-area	Family	Genus	Res.
SWH	Swahili	Latin	Africa	Niger-Congo	Bantu	high
SOM	Somali	Latin	Africa	Afro-Asiatic	Lowland East Cushitic	low
CMN	Mandarin Chinese	Han	Eurasia	Sino-Tibetan	Sinitic	high
JPN	Japanese	Japanese	Eurasia	Japonic	Japanesic	high
TUR	Turkish	Latin	Eurasia	Altaic	Turkic	high
TAM	Tamil	Tamil	Eurasia	Dravidian	Dravidian	low
VIE	Vietnamese	Latin	Eurasia	Austro-Asiatic	Vietic	high
THA	Thai	Thai	Eurasia	Tai-Kadai	Kam-Tai	high
RON	Romanian	Latin	Eurasia	Indo-European	Romance	high
FIN	Finnish	Latin	Eurasia	Uralic	Finnic	high
KAT	Georgian	Georgian	Eurasia	Kartvelic	Georgian-Zan	low
HAT	Haitian Creole	Latin	North-America	Indo-European	Creoles and Pidgins	low
IND	Indonesian	Latin	Papunesia	Austronesian	Malayo-Sumbawan	high
GRN	Guarani	Latin	South-America	Tupian	Maweti-Guarani	low

Download

The xMIND dataset is free to download for research purposes.

We release the xMIND in two versions, corresponding to the original splits of MIND: xMINDsmall (training and validation sets) and xMINDlarge (training, validation, and test sets).

The zip-compressed TSV file containing the translated news, for each language and each split, can be downloaded from xMIND.

Automatically download

The download script enables automatically downloading the dataset for the chosen language, dataset size, and dataset split. By default, the scripts downloads the zipped dataset, extracts the TSV news file, and deletes the zip file.

The following commands can be used to choose which dataset version to dowload:

Download xMIND for all languages, all dataset sizes, all dataset splits (default setting):
```
    python download.py
```
Download only one or more languages:
```
    python download.py --languages {language_1} {language_2}
```
Use the ISO 693-3 code of the language from the table above to choose a specific language.
Download only one or more dataset sizes:
```
    python download.py --sizes {dataset_size_1} {dataset_size_2}
```
Supported dataset sizes: large or small.

Download only one or more dataset splits:

    python download.py --splits {dataset_split_1} {dataset_split_2} {dataset_split_3}

Supported dataset splits: train, dev, or test.

Download without extracting the zipped file:

    python download.py --extract_archive

Download without deleting the zipped file:
```
    python download.py --clean_archive 
```
The downloaded dataset is by default stored in a newly created directory called xmIND. Change the destination directory as follows:
```
    python download.py --dst_dir 'my_folder' 
```

Data Format

Each news.tsv file contains the translated news; it has 3 columns, separated by the tab symbol:

nid: News ID of the article, identical to the corresponding news ID from the MIND dataset of the article.
title: The title of the news translated into the target language.
abstract: The abstract of the news (when provided in the original MIND dataset) translated into the target language.

An example for Romanian (RON) is shown below:

nid	title	abstract
N49265	Aceste reţete cu sos de afine sunt perfecte pentru cina de Ziua Recunoştinţei.	Nu vei mai vrea niciodată versiunea cumpărată din magazin.

Integration with MIND

The news in xMIND can be easily combined with the corresponding source news in English from the MIND dataset based on the unique news IDs. This should help researchers use xMIND in conjunction with the additional news annotations (e.g., categories, subcategories, named entities) and user behavior information provided in MIND.

To facilitate a seamless integration of xMIND with the MIND data, we provide scripts for loading the dataset and constructing bilingual user consumption patterns in the NewsRecLib library.

License

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].

[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

If you intend to use, adapt, or share xMIND, particularly together with additional news and click behavior information from the original MIND dataset, please read and reference the Microsoft Research License Terms of MIND.

Citation

If you use xMIND, please cite the following publication:

@misc{iana2024mind,
      title={MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation}, 
      author={Andreea Iana and Goran Glavaš and Heiko Paulheim},
      year={2024},
      eprint={2403.17876},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Also consider citing the following:

@inproceedings{wu2020mind,
  title={Mind: A large-scale dataset for news recommendation},
  author={Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and others},
  booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},
  pages={3597--3606},
  year={2020}
}

andreeaiana / xMIND

readme