ValueError: operands could not be broadcast together with shapes (155,81) (161,)

deepgram / kur

Descriptive Deep Learning

Apache License 2.0

814 stars 107 forks source link

ValueError: operands could not be broadcast together with shapes (155,81) (161,) #34

Closed michaelcapizzi closed 7 years ago

michaelcapizzi commented 7 years ago

When trying to use my own data for a speech example, I get this issue very early on:

ValueError: operands could not be broadcast together with shapes (155,81) (161,)

I looked through the log, and I see that the model inferred an input dimension of 161. And so it's clear that when it goes to load a batch of data with a different dimension (in this case 155), it fails.

So I have two questions:

What does that dimension represent (I'm using data.type=spec)?
How does the model "infer" that dimension should be 161?

ajsyp commented 7 years ago

Hi @michaelcapizzi!

With data.type=spec, Kur is creating spectrograms of your audio data. For 16kHz audio and a 10ms timestep in the STFT, you end up with 161 frequency bins. The fact that you have 81 in your error suggests that you're using 8kHz audio.
The model can infer that the dimension in two ways: from the data itself (measuring the shape that comes out of the spectrogram) or from the Kurfile (in case the input layer in the model has a shape explicitly specified).

Do you have one or two files that you can share? Even if you just generated two files, each with 5 seconds of white noise, for example, then I may be able to help debug.

michaelcapizzi commented 7 years ago

Thank you for the quick reply @ajsyp .

You are correct that I'm using 8kHz data.

So a few follow up questions:

is 8kHz data supported?
is it really "inferring" 161 then by somehow looking at my data or is it more like it's "assuming" I have 16kHz data.
If I explicitly set the shape as you suggest in your #2 above, would that solve the problem? When I tried to do this:

- input: utterance
    shape: 81

I got an error:

yaml.scanner.ScannerError: mapping values are not allowed here
  in "<unicode string>", line 115, column 12:
          shape: 81

I've also attached two audio samples. Thanks again, and any further guidance is greatly appreciated.

audio_samples.zip

scottstephenson commented 7 years ago

You can upsample with ffmpeg. This command converts 8k to 16k (mono, 16bit, 16kHz):

ffmpeg -i thefile_8k.wav -acodec pcm_s16le -ac 1 -ar 16000 thefile_16k.wav

Upsampling will take a little longer to train but it's a good way to mix 8k and 16k audio training data.

ajsyp commented 7 years ago

Two comments:

Are you sure there isn't a rogue norm.yml file lying around? The normalization features (stored in that file) might suggest to Kur that the audio is 16kHz. If you just delete/rename that file, it should regenerate a new one for your 8k data.
To specify an explicit shape, do it like this:
```
- input:
shape: [null, 81]
name: utterance
```
shape is a parameter of the input layer, and is therefore indented underneath. The [null, 81] means "variable length utterances (in the time domain), but 81 frequency features."

michaelcapizzi commented 7 years ago

That must be it, @ajsyp !

I didn't realize that the norm.yml file held information like that. It was lying around from when I ran your original example. I've since removed that and the model now properly "infers" dimension of 81. Thank you.

And thanks for the bit of .yml syntax help as well.