Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

The dataset for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages" published at COLING22-SIGHUM workshop.

File organization

Data : contains dataset with (train/dev/test) splits
Statistics : contains statistics of the dataset

Dataset

You can find the download link and more description about the dataset HERE

Citation

If you use our dataset, we'd appreciate if you cite our paper:

@inproceedings{sandhan-etal-2022-prabhupadavani,
    title = "Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages",
    author = "Sandhan, Jivnesh  and Daksh, Ayush  and Paranjay, Om Adideva  and Behera, Laxmidhar  and Goyal, Pawan",
    booktitle = "Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Conference on Computational Linguistics",
    url = "https://aclanthology.org/2022.latechclfl-1.4",
    pages = "24--29",
    abstract = "Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at: https://github.com/frozentoad9/CMST.",
}

Acknowledgements

We would like to thank the Vanipedia team of 700+ translators for establishing this multi-lingual database for us to develop. We thank the Bhaktivedanta Book Trust International for permitting us to use Prabhupadavani audio in our dataset.

Copyright

Audio of 1080 audio clips provided by Vanipedia courtesy Bhaktivendata Book Trust International, Inc. www.Krishna.com. Used with permission.

License

This project is licensed under the terms of the Apache license 2.0.

frozentoad9 / CMST

readme