andreeaiana / xMIND

A Multilingual Dataset For Cross-lingual News Recommendation
Other
18 stars 0 forks source link
# xMIND [![CC BY-NC-SA 4.0][cc-by-nc-sa-shield]][cc-by-nc-sa] [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/ [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png [cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg

Description

xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND (https://msnews.github.io/) dataset using open-source neural machine translation (i.e., NLLB 3.3B). xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.

The table below summarizes information about each language included in xMIND, according to the following criteria:

Code Language Script Macro-area Family Genus Res.
SWH Swahili Latin Africa Niger-Congo Bantu high
SOM Somali Latin Africa Afro-Asiatic Lowland East Cushitic low
CMN Mandarin Chinese Han Eurasia Sino-Tibetan Sinitic high
JPN Japanese Japanese Eurasia Japonic Japanesic high
TUR Turkish Latin Eurasia Altaic Turkic high
TAM Tamil Tamil Eurasia Dravidian Dravidian low
VIE Vietnamese Latin Eurasia Austro-Asiatic Vietic high
THA Thai Thai Eurasia Tai-Kadai Kam-Tai high
RON Romanian Latin Eurasia Indo-European Romance high
FIN Finnish Latin Eurasia Uralic Finnic high
KAT Georgian Georgian Eurasia Kartvelic Georgian-Zan low
HAT Haitian Creole Latin North-America Indo-European Creoles and Pidgins low
IND Indonesian Latin Papunesia Austronesian Malayo-Sumbawan high
GRN Guarani Latin South-America Tupian Maweti-Guarani low

Download

The xMIND dataset is free to download for research purposes.

We release the xMIND in two versions, corresponding to the original splits of MIND: xMINDsmall (training and validation sets) and xMINDlarge (training, validation, and test sets).

The zip-compressed TSV file containing the translated news, for each language and each split, can be downloaded from xMIND.

Automatically download

The download script enables automatically downloading the dataset for the chosen language, dataset size, and dataset split. By default, the scripts downloads the zipped dataset, extracts the TSV news file, and deletes the zip file.

The following commands can be used to choose which dataset version to dowload:

Data Format

Each news.tsv file contains the translated news; it has 3 columns, separated by the tab symbol:

An example for Romanian (RON) is shown below:

nid title abstract
N49265 Aceste reţete cu sos de afine sunt perfecte pentru cina de Ziua Recunoştinţei. Nu vei mai vrea niciodată versiunea cumpărată din magazin.

Integration with MIND

The news in xMIND can be easily combined with the corresponding source news in English from the MIND dataset based on the unique news IDs. This should help researchers use xMIND in conjunction with the additional news annotations (e.g., categories, subcategories, named entities) and user behavior information provided in MIND.

To facilitate a seamless integration of xMIND with the MIND data, we provide scripts for loading the dataset and constructing bilingual user consumption patterns in the NewsRecLib library.

License

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].

[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

If you intend to use, adapt, or share xMIND, particularly together with additional news and click behavior information from the original MIND dataset, please read and reference the Microsoft Research License Terms of MIND.

Citation

If you use xMIND, please cite the following publication:

@misc{iana2024mind,
      title={MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation}, 
      author={Andreea Iana and Goran Glavaš and Heiko Paulheim},
      year={2024},
      eprint={2403.17876},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Also consider citing the following:

@inproceedings{wu2020mind,
  title={Mind: A large-scale dataset for news recommendation},
  author={Wu, Fangzhao and Qiao, Ying and Chen, Jiun-Hung and Wu, Chuhan and Qi, Tao and Lian, Jianxun and Liu, Danyang and Xie, Xing and Gao, Jianfeng and Wu, Winnie and others},
  booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},
  pages={3597--3606},
  year={2020}
}