coding-phoenix-12 / LIMMITS25_evaluation

0 stars 0 forks source link

Questions about Track2 rules #2

Closed aluminumbox closed 4 days ago

aluminumbox commented 6 days ago

Hi,

in the example json1-3 files, the data format in json1/3 is clear

    {
        "id": "bho_m_social_02577-bho_f_food_02681",
        "text": "प्रायोगिक पुरातत्व सिध्द करै ला की कवन समाज कै हजार साल पुरान बा. चाऊमीन चाइनीस नूडल ब्रेस्ड पोर्क सेंचुरी एग कुंग पाओ चिकन अउर बुद्धा डिलाइट ई सब चाइनीज व्यंजन के सूची में आवेला. लुची एक्टे गहिराह तले अऊर बहुत पसंद कइल जाए वाला बंगाली रोटी हऊवे,लुची ऑल पर्पस आटा, नमक अऊरई घी से बनावल जाला",
        "language": "bhojpuri",
        "speaker": "Gujarathi_F",
        "save_file_name": "Gujarathi_F-bho_m_social_02577-bho_f_food_02681-bhojpuri"
    }

I believe it means using specific speaker to synthesis specific text.

However, in track2 we need to do zero-shot tts. For example for the json2 data example below:

    {
        "id": "bho_f_politics_02008",
        "text": "मंदिरे के उद्घाटन मे राज्य के कैबिनेट मंत्री के नेउता देवल रहल",
        "language": "bhojpuri",
        "speaker": "Gujarathi_F_IndicTTS",
        "save_file_name": "Gujarathi_F_IndicTTS-bho_f_politics_02008-bhojpuri"
    },

Question1. In the website, it says " For track 2, the evaluation will involve synthesis using zero-shot voice cloning". I thought it means that both specific-speaker tts and zeroshot tts will be evaluated, however it seems that there is only zero-shot tts? I don't see any instruct for specific-speaker tts in track2.json.

Question2. In the website, it says "with codecs representing at least 3 attributes like speaker identity, content, pitch, energy etc". I thought it means that the codes need to embed at least 3 attributes, but not necessarily 3 different codebook. Our system use only 1 codebook, but it has speaker identity / emotion / content involved, is this okay?

Thanks, Looking forward for your reply!

coding-phoenix-12 commented 4 days ago

For the first question we will be evaluating just for zero shot tts voice cloning as attribute specific codecs must be trained. Regarding the second question, it is fine to have a single codebook as long as each attribute represents different set of embeddings. For example, if embeddings 0-511 represents speaker identity and embeddings 512-1023 represents emotions, that is fine too.