Closed aquorio15 closed 8 months ago
Thank you for reaching out. Regarding your question, I would suggest using distinct speech tokens for each language to prevent any conflicts. For example:
<sosp><en189><en247><en922>...<eosp>
<sosp><de189><de247><de922>...<eosp>
However, if you prefer to use the same speech tokens for multiple languages, as you mentioned in your approach, I believe training a multilingual HuBERT model and subsequently performing clustering on it would be a suitable solution.
Hi thank you for the amazing work,
Is there any way I could pre-train in multiple languages? Actually, I tried doing that but it's not working as intended. For example, if I discretize the features from a pre-trained German HuBERT and an English HuBERT, I am getting the same unit for different phones. Let's say <1> may represent a specific phone in English and the same unit <1> may represent a different unit in German.
I tried to append unique IDs in case of both English and German during the data preparation stage, which will be additional tokens during training but that's not working