VOICEVOX / voicevox

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのエディター
https://voicevox.hiroshiba.jp/
Other
2.47k stars 300 forks source link

Continuous Tuning #1026

Open Patchethium opened 1 year ago

Patchethium commented 1 year ago

内容

I was making a tuning panel that allows users to drag and set the pitch continuously, by saying that I mean a 1:1 copy of VOICEPEAK's tuning panel.

Screen recording 2022-11-22 00.38.05.webm

It was originally written for my project, in Svelte. But I can easily port it to Vue and make some adjustments to fit the Japanese task. What do you guys think, shall I make a PR for it?

Hiroshiba commented 1 year ago

I think it's so great!!! I would love to hear your pull request!!!

Patchethium commented 1 year ago

Great, so there's a problem, the distribution of predicted pitch is highly unbalanced. I tested the engine with the following code:

Test code ```python import collections import matplotlib.pyplot as plt import urllib import urllib.parse import urllib.request import json def main(): lines = [] pitches = {} for i in range(0,65): pitches[round(i/10,2)] = 0 with open("data.txt") as f: lines = f.readlines() sentences = [] for line in lines: sents = line.split("。") for sent in sents: sentences.append(sent.rstrip().replace("\u3000", "") + "。") url = "http://localhost:50021/audio_query" for sentence in sentences: data = { "speaker" : "1", "text" : sentence } url_values = urllib.parse.urlencode(data) full_url = url + "?" + url_values req = urllib.request.Request(full_url, method="POST") response = urllib.request.urlopen(req) data = json.loads(response.read().decode('utf8')) try: for accent_phrases in data["accent_phrases"]: for mora in accent_phrases["moras"]: pitch = mora["pitch"] pitch = round(pitch,1) if pitch not in pitches: pitches[pitch] = 1 else: pitches[pitch] += 1 except: continue od = collections.OrderedDict(sorted(pitches.items())) print(od) plt.bar(range(len(od)), list(od.values()), align="center") plt.xticks(range(len(od)), list(od.keys()), rotation=60) plt.show() if __name__ == "__main__": main() ```

The data.txt contains about 10000 characters from 人間失格, and the result is:

Figure_1

OrderedDict([(0.0, 763), (0.1, 0), (0.2, 0), (0.3, 0), (0.4, 0), (0.5, 0), (0.6, 0), (0.7, 0), (0.8, 0), (0.9, 0), (1.0, 0), (1.1, 0), (1.2, 0), (1.3, 0), (1.4, 0), (1.5, 0), (1.6, 0), (1.7, 0), (1.8, 0), (1.9, 0), (2.0, 0), (2.1, 0), (2.2, 0), (2.3, 0), (2.4, 0), (2.5, 0), (2.6, 0), (2.7, 0), (2.8, 0), (2.9, 0), (3.0, 0), (3.1, 0), (3.2, 0), (3.3, 0), (3.4, 0), (3.5, 0), (3.6, 0), (3.7, 0), (3.8, 0), (3.9, 0), (4.0, 0), (4.1, 0), (4.2, 0), (4.3, 0), (4.4, 0), (4.5, 0), (4.6, 0), (4.7, 0), (4.8, 0), (4.9, 0), (5.0, 0), (5.1, 0), (5.2, 0), (5.3, 0), (5.4, 14), (5.5, 179), (5.6, 645), (5.7, 1086), (5.8, 1634), (5.9, 1866), (6.0, 2448), (6.1, 2406), (6.2, 790), (6.3, 27), (6.4, 0)])

As you can see, the pitches all land at 5.4-6.5 and 0. For precise control, I should limit the panel in this range, but users may want a value lower than 5.4, and also, male speakers like Ryusei tend to have a lower pitch. Is there any statistical magic to normalize this distribution?

Hiroshiba commented 1 year ago

There is a way to calculate the mean μ and variance σ of a non-zero pitch and limit them to the range (μ-aσ, μ+aσ) using the appropriate factor α. It seems that we can adjust α to include values a little smaller than 5.4 in the range.

However, this would require that each speaker's μ and σ be obtained and put into SpeakerInfo, which is not immediately ready. How about first implementing the same range of controls as before?

japanese 0ではないピッチの平均`μ`と分散`σ`を算出して、適当な係数`α`を用いて範囲`(μ-aσ, μ+aσ)`に制限する手があります。 `5.4`より少し小さい値も値域に入るように`α`を調整すれば良さそうです。 しかし、そのためには`μ`と`σ`を話者ごとに求めて`SpeakerInfo`に含める必要があり、すぐには準備できなそうです。 まずは今まで通りの操作範囲で実装するのはどうでしょう?
Patchethium commented 1 year ago

Good idea, but I doubt if we need that $\sigma$, since the $\alpha \sigma$ can simply be any value. And btw, should we calculate the mean $\mu$ with weight? Assume the probability of the pitch taking $a_i$ is $pi$, $i \in 0 \dots k$, we can calculate the weighted mean with $\mu = \displaystyle \sum{i=0}^{k} {p_i} a_i$.

Hiroshiba commented 1 year ago

Good idea, but I doubt if we need that $\sigma$, since the $\alpha \sigma$ can simply be any value.

For example, if a speaker speaks expressively, σ (≒抑揚) will be larger. With ασ, the range is automatically adjusted, so there may be less worry.

And btw, should we calculate the mean $\mu$ with weight? Assume the probability of the pitch taking $a_i$ is $pi$, $i \in 0 \dots k$, we can calculate the weighted mean with $\mu = \displaystyle \sum{i=0}^{k} {p_i} a_i$.

There may be many ways to do this, but I think simply averaging the pitch of the voice data used for machine learning is a good idea!

日本語版 > Good idea, but I doubt if we need that $\sigma$, since the $\alpha \sigma$ can simply be any value. 例えば、表現豊かに喋る話者の場合、`σ(≒抑揚)`が大きくなります。 `ασ`だと範囲が自動調整されるので、悩みが少なくなるかもしれません。 > And btw, should we calculate the mean $\mu$ with weight? Assume the probability of the pitch taking $a_i$ is $p_i$, $i \in 0 \dots k$, we can calculate the weighted mean with $\mu = \displaystyle \sum_{i=0}^{k} {p_i} a_i$. いろんな方法がありそうですが、単純に機械学習に用いる音声データのピッチの平均が良いと思います!
Patchethium commented 1 year ago

Okay, let's take one step a time.