Pretraining on multiple languages

0nutation / SpeechGPT

SpeechGPT Series: Speech Large Language Models

Apache License 2.0

1.04k stars 64 forks source link

Hi thank you for the amazing work,

Is there any way I could pre-train in multiple languages? Actually, I tried doing that but it's not working as intended. For example, if I discretize the features from a pre-trained German HuBERT and an English HuBERT, I am getting the same unit for different phones. Let's say <1> may represent a specific phone in English and the same unit <1> may represent a different unit in German.

I tried to append unique IDs in case of both English and German during the data preparation stage, which will be additional tokens during training but that's not working

<sosp><en><189><247><922>......................<9><4><en><eosp>
<sosp><de><20><333><245>......................<78><999><de><eosp>

0nutation / SpeechGPT

Pretraining on multiple languages #17