Sreyan88 / MMER

Code for the InterSpeech 2023 paper: MMER: Multimodal Multi-task learning for Speech Emotion Recognition
https://arxiv.org/abs/2203.16794
61 stars 14 forks source link

Regarding data preprocessing issues #11

Closed 736958408 closed 1 month ago

736958408 commented 1 month ago

Dear Author,

This project is indeed excellent, but I have encountered some issues with data preprocessing and would like to ask for your advice. I want to train other datasets for this model, but I am unsure about the models you used for back-translation and text-to-speech (TTS). I saw an example of Coqui-TTS in the code for TTS. Did you use this model for text-to-speech? For back-translation, which model did you use to enhance the text? If possible, could you please provide the code for data preprocessing? Thank you very much. I look forward to your reply.

ramaneswaran commented 1 month ago

Hi,

Thanks for your interest in the project. We don't have the script for backtranslation present, but I can give you some high level guideline for backtranslation pipeline

Model Used: NLLB 600M Distilled Source Language: English Target Language: French

There was a pipeline to translate source texts to target language. Huggingface implementation has num_return_sequences, keep that a value greater than 1 (we used 5) and keep sampling parameters that promote diversity.

After this, we translate these back to source language, again with sampling parameters that promote diversity. Now, the reason for 5 sequences is that often, doing backtranslation results in same source text (looks like the translation models are getting better and better) and out of 5 we take the one which shows most diversity

Regarding TTS, yes we used Coqui-TTS.

Let us know if you have any further questions

Sreyan88 commented 1 month ago

Additionally, once you have successfully back-translated, you can use this file for TTS synthesis. I have tested this and it works as is!

736958408 commented 1 month ago

Thank you for your attention and response. I have generated augmented text using NLLB as suggested. However, we encountered an issue while running Coqui-TTS. We found that the files config.json, language.json, speakers.json, and best_model.pth.tar are no longer available for download. Could you please provide the download links for these files? Thank you very much!

Sreyan88 commented 1 month ago

Hi @736958408 ,

Unfortunately, it looks like I don't have the checkpoints anymore. Additionally, you are right that the original YourTTS does not host the ckpts anymore.

However, have you tried searching in Coqui? You might be able to get it there:

https://github.com/coqui-ai/tts?tab=readme-ov-file#command-line-tts

tts --list_models

Please let me know if you see the model here! If not, I can look for an alternative!

Additionally, it is worth mentioning that MMER can perform almost as well as any other TTS+VC model like YourTTS. You can also use stronger alternatives to YourTTS if they are not available. The code I have provided remains the same for most models.

736958408 commented 1 month ago

I am very pleased to discuss these issues with you, and I appreciate the help you have provided. Regarding TTS, indeed, the latest version has significant changes, making it relatively more challenging to resolve issues. I installed the latest version directly using pip install TTS, and it's important to note that Python version must be >= 3.9.0 to avoid dependency errors. For the model, I am using "vocoder_models--en--ljspeech--multiband-melgan" to generate audio by enhancing text. The data is currently being processed, and I will use MMER to evaluate the processed dataset later. Thank you for your guidance and attention. If I have any questions later, I will leave you a message. Thank you very much!