0nutation / SpeechGPT

SpeechGPT Series: Speech Large Language Models
https://0nutation.github.io/SpeechGPT.github.io/
Apache License 2.0
1.04k stars 64 forks source link

Pretraining on multiple languages #17

Closed aquorio15 closed 8 months ago

aquorio15 commented 8 months ago

Hi thank you for the amazing work,

Is there any way I could pre-train in multiple languages? Actually, I tried doing that but it's not working as intended. For example, if I discretize the features from a pre-trained German HuBERT and an English HuBERT, I am getting the same unit for different phones. Let's say <1> may represent a specific phone in English and the same unit <1> may represent a different unit in German.

I tried to append unique IDs in case of both English and German during the data preparation stage, which will be additional tokens during training but that's not working

<sosp><en><189><247><922>......................<9><4><en><eosp>
<sosp><de><20><333><245>......................<78><999><de><eosp>
0nutation commented 8 months ago

Thank you for reaching out. Regarding your question, I would suggest using distinct speech tokens for each language to prevent any conflicts. For example:

<sosp><en189><en247><en922>...<eosp>
<sosp><de189><de247><de922>...<eosp>

However, if you prefer to use the same speech tokens for multiple languages, as you mentioned in your approach, I believe training a multilingual HuBERT model and subsequently performing clustering on it would be a suitable solution.