Questions about Track2 rules

Hi,

in the example json1-3 files, the data format in json1/3 is clear

    {
        "id": "bho_m_social_02577-bho_f_food_02681",
        "text": "प्रायोगिक पुरातत्व सिध्द करै ला की कवन समाज कै हजार साल पुरान बा. चाऊमीन चाइनीस नूडल ब्रेस्ड पोर्क सेंचुरी एग कुंग पाओ चिकन अउर बुद्धा डिलाइट ई सब चाइनीज व्यंजन के सूची में आवेला. लुची एक्टे गहिराह तले अऊर बहुत पसंद कइल जाए वाला बंगाली रोटी हऊवे,लुची ऑल पर्पस आटा, नमक अऊरई घी से बनावल जाला",
        "language": "bhojpuri",
        "speaker": "Gujarathi_F",
        "save_file_name": "Gujarathi_F-bho_m_social_02577-bho_f_food_02681-bhojpuri"
    }

I believe it means using specific speaker to synthesis specific text.

However, in track2 we need to do zero-shot tts. For example for the json2 data example below:

    {
        "id": "bho_f_politics_02008",
        "text": "मंदिरे के उद्घाटन मे राज्य के कैबिनेट मंत्री के नेउता देवल रहल",
        "language": "bhojpuri",
        "speaker": "Gujarathi_F_IndicTTS",
        "save_file_name": "Gujarathi_F_IndicTTS-bho_f_politics_02008-bhojpuri"
    },

Question1. In the website, it says " For track 2, the evaluation will involve synthesis using zero-shot voice cloning". I thought it means that both specific-speaker tts and zeroshot tts will be evaluated, however it seems that there is only zero-shot tts? I don't see any instruct for specific-speaker tts in track2.json.

Question2. In the website, it says "with codecs representing at least 3 attributes like speaker identity, content, pitch, energy etc". I thought it means that the codes need to embed at least 3 attributes, but not necessarily 3 different codebook. Our system use only 1 codebook, but it has speaker identity / emotion / content involved, is this okay?

Thanks, Looking forward for your reply!

coding-phoenix-12 / LIMMITS25_evaluation

Questions about Track2 rules #2