gentaiscool / code-switching-papers

A curated list of research papers and resources on code-switching
Apache License 2.0
289 stars 36 forks source link

open source speech dataset for code switching #6

Closed ashu5644 closed 3 years ago

ashu5644 commented 3 years ago

HI, I am unable to find any open source speech corpus related to mentioned papers. Can you list down some good open source speech datasets for code switching?

gentaiscool commented 3 years ago

hi @ragnarlbrok. I will add information about open source speech corpus to this repository. Thanks for your suggestion.

ashu5644 commented 3 years ago

Hi @gentaiscool, can you provide some links for code switching speech datasets here? currently, I am working with TTS generated datasets, need it in urgent for my experiments.

gentaiscool commented 3 years ago

Sorry I was very busy with other projects. Have you check the datasets from this competition https://www.microsoft.com/en-us/research/event/workshop-on-speech-technologies-for-code-switching-2020/ ?

There is also another shared task prepared by Microsoft this year https://twitter.com/Ashish_Mittal1/status/1358664855616557061/photo/1

or if you have subscription to LDC, I recommend you to check https://catalog.ldc.upenn.edu/LDC2015S04, but this is not free. Hopefully it helps.

ashu5644 commented 3 years ago

@gentaiscool, thanks for providing links. I have checked LDC dataset already but unluckily not free. for link https://www.microsoft.com/en-us/research/event/workshop-on-speech-technologies-for-code-switching-2020/ I am not able to find dataset download link. https://twitter.com/Ashish_Mittal1/status/1358664855616557061/photo/1 dataset is asking for password, I think, will need to register in competition for access. Hindi-English dataset is good for me, but Chinese-English dataset is on my priority. if you have some link for Chinese-English or Spanish-English, pls share them, it would be great.

gentaiscool commented 3 years ago

Hi @ragnarlbrok, I don't have the datasets that I can share with you currently. I suggested to try with the LDC dataset. They have ~200 hours data. If you are studying in a university, probably your uni has the access to LDC, you should check with the admin.

ashu5644 commented 3 years ago

Ok, thank you, @gentaiscool, for the information.