How to map to sound time series?

williamFalcon commented 6 years ago

Hi, I don't have a lot of experience with audio processing, what's unclear to me is how to map these features to the corresponding sound sequence?

I'm feeding these to an RNN along with the corresponding sound chunk. Do you know how I would set that up?

Say I have a sound file with 32,000 values (16mhz for 2 seconds). I'm feeding the RNN a sequence of 1024 items at a time. BUT I'm grouping them by frames where each frame has 16 sound steps.

So

wav = load(...)
wav.shape  
# [1 x 32000]

sub_seq = wav[0:1024]
sub_seq = suq_seq.reshape(1, 64, 16)

f0_contour, spectral_envelope, aperiodicity = pw.wav2world(wav, sample_rate=16000, frame_period=16)

f0_contour.shape
# [157]
# ??? unclear how to match to the (1, 64, 16) piece of sound

Thanks for an awesome package!!

williamFalcon commented 6 years ago

Would also appreciate pointers about where I can learn about this? (ie, audio processing, binning, the padding, etc...)

JeremyCCHsu commented 6 years ago

Hi,

The frames in WORLD vocoder actually have variable lengths, so they cannot be easily mapped into the format you specified. See the official repo for more description: https://github.com/mmorise/World/issues/51 You might also consult mmorise's advice on this issue.

It's still possible to align the features with the sample sequence by exploiting the time bounding boxes from pw.dio output. However, if you really need to fit fixed length frames, then you might need to consider short-time Fourier Transform (STFT) which is available in many signal processing toolkits (e.g., scipy, Tensorflow, etc).

As I'm not sure what kind of application you're doing and why you need to align the features to the sample sequence, I'm afraid that I cannot provide further help. In some application scenario, treating [f0, ap, sp] as a time sequence and applying RNN to it is a common approach.

Lastly, as for the material, I studied these on a speech signal processing course several years ago.

williamFalcon commented 6 years ago

Ok... makes sense. So, if I'm using this to synthesize speech from these features (x = vocoder features, y=audio, model=rnn), I understand that now I can use [f0, ap, sp] as features, but i need to know the amount of time that corresponds to since the sequences are extremely long and I need to do TBPTT.

I noticed if I use frame period = 5 with audio sample rate of 16khz, then i get 13 frames for every 952 audio samples. Q1. Does that mean I can just reliably go 13 frames and 952 steps at a time to do what I need to do?

The formula I came up with which (i think) estimates the number of frames is

    def number_of_frames(self, audio, sample_rate=16000, frame_period=5):
        frames = np.ceil(((len(audio) / sample_rate) / frame_period) * 1000)
        return frames

The paper (char2wav) I'm going off of says:

First, we pretrained the reader and the neural vocoder separately. We used normalized WORLD
vocoder features (Morise et al., 2016; Wu et al., 2016) as targets for the reader and as inputs for the
neural vocoder. Finally, we fine-tuned the whole model end-to-end. Our code is available online.1

Q2: Does pyworld have a normalize function? Or is it just (x-mean(x)) / var(x) ? For each feature.

JeremyCCHsu commented 6 years ago

I'm so sorry that I got your question wrong and answered you the wrong thing. The frames are actually aligned with the waveform.* In your case, you messed up with the frame_period argument-- the unit is in millisecond. Therefore, instead of setting it to 16, you should've set it to 1. (The sampling rate is 16 kHz => 1 ms = 16 samples).**

As for Q2: how Char2wav normalize the features, you might want to contact the authors. There isn't a normalizing function in WORLD as far as I know. Z-normalization makes sense (I assume you meant std in the denominator).

Hope this helps.

footnotes: *You might get output features (f0, ap, sp) of length 2001; in this case, the last frame should be ignored.

**Strictly speaking, the frames have overlapping, so a frame of (f0, sp, ap) actually contributes to more than 16 samples (here I'm assuming that frame_period=1). Nonetheless, the starting point of every frame is aligned with the signal.

williamFalcon commented 6 years ago

thanks! Yeah meant std.

with frame_period = 1, will the f0, sp, ap still capture enough information? My (shallow) understanding was that frame_period needed to be > 1 for it to capture anything meaningful?

That’s all! Thank you

JeremyCCHsu commented 6 years ago

Take this example. Q=frame_period, K=frame_length (frame_length cannot be specified and is variable-length in WORLD), so actually, forframe_period, the smaller, the better (typically 5 or 1 ms).

williamFalcon commented 6 years ago

Thanks!

JeremyCCHsu / Python-Wrapper-for-World-Vocoder

How to map to sound time series? #34