MoonInTheRiver / DiffSinger

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code
MIT License
4.35k stars 715 forks source link

DiffSinger infer problem #41

Closed leon2milan closed 2 years ago

leon2milan commented 2 years ago

I want to test opencpop preitrain model on unseen song. I don't know how to generate the wav file.

  1. What data I should prepare for model?
  2. How to do it? I saw test_step in FastSpeech2Task, but it seems for tts task. So I need override test_step in DiffSingerMIDITask? Is there other way to solve this? Without packing data into dataloader, just load model, and infer.
newportchen commented 2 years ago

You should prepare: phoneme | pitch_midi | pitch_dur | is_slur ,then write to the data 'test“ just use IndexedDatasetBuilder like process_data() in

You need to fix some data loading problems (getitem 、collater in Just set it to None . They are not necessary in the synthesis stage.

leon2milan commented 2 years ago

This is my first exposure to singing synthesis. So I have some question about the terminology. Does pitch_midi | pitch_dur mean note & note duration ? Should I set is_slur through staffs ?
And I don't know how to set pitch_dur in a unseen song. Should I use Logic Pro to label it ? Or I can get this by some model or something like this.

newportchen commented 2 years ago

This is my first exposure to singing synthesis. So I have some question about the terminology. Does pitch_midi | pitch_dur mean note & note duration ? Should I set is_slur through staffs ? And I don't know how to set pitch_dur in a unseen song. Should I use Logic Pro to label it ? Or I can get this by some model or something like this.

Wait a minute. I'll find you a picture

newportchen commented 2 years ago


We use the data marked by yellow box, phoneme | pitch_midi | pitch_dur

newportchen commented 2 years ago

pitch_dur = 60 * NoteBeats / bmp

bmp : beats per minute --the speed

leon2milan commented 2 years ago

Thank you very much. I know how to do this. But I have another question. There is silence in music. And it won't work if I simply turn text into pinyin? Should I do singing - Lyrics alignment?

newportchen commented 2 years ago

2001000005|面对浩瀚的星海我们微小得像尘埃|m ian d ui h ao h an an d e x ing h ai ai ai AP w o m en w ei x iao d e x iang ch en ai ai ai SP|C#4/Db4 C#4/Db4 D#4/Eb4 D#4/Eb4 C#4/Db4 C#4/Db4 D#4/Eb4 D#4/Eb4 E4 D#4/Eb4 D#4/Eb4 E4 E4 G#4/Ab4 G#4/Ab4 A4 G#4/Ab4 rest C#4/Db4 C#4/Db4 C#4/Db4 C#4/Db4 D#4/Eb4 D#4/Eb4 C#4/Db4 C#4/Db4 D#4/Eb4 D#4/Eb4 E4 E4 E4 E4 G#4/Ab4 A4 G#4/Ab4 rest|0.196990 0.196990 0.102120 0.102120 0.304680 0.304680 0.096780 0.096780 0.100220 0.150010 0.150010 0.361460 0.361460 0.221070 0.221070 0.183240 0.478670 0.384620 0.106510 0.106510 0.143020 0.143020 0.169480 0.169480 0.224180 0.224180 0.089360 0.089360 0.414460 0.414460 0.378050 0.378050 0.162790 0.207380 0.317260 0.297040|0.02765 0.16934 0.01874 0.08338 0.0821 0.22258 0.0693 0.02748 0.10022 0.07137 0.07864 0.12471 0.23675 0.12356 0.09751 0.18324 0.47867 0.38462 0.0405 0.06601 0.08303 0.05999 0.04687 0.12261 0.09778 0.1264 0.02321 0.06615 0.11958 0.29488 0.06723 0.31082 0.16279 0.20738 0.31726 0.29704|0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0

You should learn from transcriptions.txt

leon2milan commented 2 years ago

OK。 Thank you so much. I'll try.

imiskolee commented 2 years ago

@leon2milan did you succeed? can you share an example code?