Open Patchethium opened 1 year ago
I think it's so great!!! I would love to hear your pull request!!!
Great, so there's a problem, the distribution of predicted pitch is highly unbalanced. I tested the engine with the following code:
The data.txt
contains about 10000 characters from 人間失格
, and the result is:
OrderedDict([(0.0, 763), (0.1, 0), (0.2, 0), (0.3, 0), (0.4, 0), (0.5, 0), (0.6, 0), (0.7, 0), (0.8, 0), (0.9, 0), (1.0, 0), (1.1, 0), (1.2, 0), (1.3, 0), (1.4, 0), (1.5, 0), (1.6, 0), (1.7, 0), (1.8, 0), (1.9, 0), (2.0, 0), (2.1, 0), (2.2, 0), (2.3, 0), (2.4, 0), (2.5, 0), (2.6, 0), (2.7, 0), (2.8, 0), (2.9, 0), (3.0, 0), (3.1, 0), (3.2, 0), (3.3, 0), (3.4, 0), (3.5, 0), (3.6, 0), (3.7, 0), (3.8, 0), (3.9, 0), (4.0, 0), (4.1, 0), (4.2, 0), (4.3, 0), (4.4, 0), (4.5, 0), (4.6, 0), (4.7, 0), (4.8, 0), (4.9, 0), (5.0, 0), (5.1, 0), (5.2, 0), (5.3, 0), (5.4, 14), (5.5, 179), (5.6, 645), (5.7, 1086), (5.8, 1634), (5.9, 1866), (6.0, 2448), (6.1, 2406), (6.2, 790), (6.3, 27), (6.4, 0)])
As you can see, the pitches all land at 5.4-6.5
and 0
. For precise control, I should limit the panel in this range, but users may want a value lower than 5.4, and also, male speakers like Ryusei
tend to have a lower pitch. Is there any statistical magic to normalize this distribution?
There is a way to calculate the mean μ
and variance σ
of a non-zero pitch and limit them to the range (μ-aσ, μ+aσ)
using the appropriate factor α
.
It seems that we can adjust α
to include values a little smaller than 5.4
in the range.
However, this would require that each speaker's μ
and σ
be obtained and put into SpeakerInfo
, which is not immediately ready.
How about first implementing the same range of controls as before?
Good idea, but I doubt if we need that $\sigma$, since the $\alpha \sigma$ can simply be any value. And btw, should we calculate the mean $\mu$ with weight? Assume the probability of the pitch taking $a_i$ is $pi$, $i \in 0 \dots k$, we can calculate the weighted mean with $\mu = \displaystyle \sum{i=0}^{k} {p_i} a_i$.
Good idea, but I doubt if we need that $\sigma$, since the $\alpha \sigma$ can simply be any value.
For example, if a speaker speaks expressively, σ (≒抑揚)
will be larger.
With ασ
, the range is automatically adjusted, so there may be less worry.
And btw, should we calculate the mean $\mu$ with weight? Assume the probability of the pitch taking $a_i$ is $pi$, $i \in 0 \dots k$, we can calculate the weighted mean with $\mu = \displaystyle \sum{i=0}^{k} {p_i} a_i$.
There may be many ways to do this, but I think simply averaging the pitch of the voice data used for machine learning is a good idea!
Okay, let's take one step a time.
内容
I was making a tuning panel that allows users to drag and set the pitch continuously, by saying that I mean a 1:1 copy of
VOICEPEAK
's tuning panel.Screen recording 2022-11-22 00.38.05.webm
It was originally written for my project, in
Svelte
. But I can easily port it toVue
and make some adjustments to fit the Japanese task. What do you guys think, shall I make a PR for it?