MoonInTheRiver / DiffSinger

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code
MIT License
4.25k stars 712 forks source link

The model takes the phoneme duration as input when inference? #28

Closed YawYoung closed 2 years ago

YawYoung commented 2 years ago

Thanks for your wonderful work! I was running the inference of 0128_opencpop_ds58_midi, but there's a problem that bothers me.

https://github.com/MoonInTheRiver/DiffSinger/blob/master/tasks/tts/fs2.py#L348

    ############
    # infer
    ############
    def test_step(self, sample, batch_idx):
        spk_embed = sample.get('spk_embed') if not hparams['use_spk_id'] else sample.get('spk_ids')
        txt_tokens = sample['txt_tokens']
        mel2ph, uv, f0 = None, None, None
        ref_mels = None
        if hparams['profile_infer']:
            pass
        else:
            if hparams['use_gt_dur']:
                mel2ph = sample['mel2ph']
            if hparams['use_gt_f0']:
                f0 = sample['f0']
                uv = sample['uv']
                print('Here using gt f0!!')
            if hparams.get('use_midi') is not None and hparams['use_midi']:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True,
                    pitch_midi=sample['pitch_midi'], midi_dur=sample.get('midi_dur'), is_slur=sample.get('is_slur'))
            else:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True)

The param use_gt_dur is True, that is, the model takes the phoneme duration as input when inference. Is it correct?

MoonInTheRiver commented 2 years ago

I mentioned this in https://github.com/MoonInTheRiver/DiffSinger/blob/master/usr/configs/midi/readme.md , issue: b "b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[F0]) to predict F0 contour, but used the ground truth ph-durations."

I think a fine-grained music score could include phoneme duration, and it is not necessary to write a duration predictor. If I have time, maybe I will update this feature by adding a duration predictor. You can also add it by yourself.

YawYoung commented 2 years ago

Got it, thanks!

YawYoung commented 2 years ago

DiffSinger(PopCS) does not need the ground truth phoneme duration as the input when we do inference?

YawYoung commented 2 years ago

If a score has no phoneme duration annotation, is there an automatic phoneme duration annotation method?

MoonInTheRiver commented 2 years ago

I have updated the codes: [[lyric + MIDI]->[F0+ph_dur]]