SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.
MIT License
259 stars 73 forks source link

Question about alternative sample rates #62

Closed AdolfVonKleist closed 7 years ago

AdolfVonKleist commented 8 years ago

I'm interested in using another dataset with 8kHz audio, but I'm having a bit of trouble. I also tried to resample all the AN4 data to 8kHz, but I see the same issue. This is probably a parameters issues, but maybe you have a couple of hints.

For the AN4 set, as stated, I resampled all test and train audio files to 8kHz with sox:

for f in `ls -1 16k/*sph | sed -e"s/\.sph$//" | sed -e"s/^16k\///"`; do \
      echo $f; sox 16k/${f}.sph -r 8000 ${f}.sph ;  
done

and then deleted sort_ids_test.t7 and sort_ids_train.t7 and lmdb directory, and ran the MakeLMDB.lua command:

th MakeLMDB.lua -rootPath prepare_datasets/an4_dataset -lmdbPath prepare_datasets/an4_lmdb -windowSize 0.02 -stride 0.01 -sampleRate 8000

This all worked fine, but upon running the train command, I get a dimension mismatch:

$ th Train.lua
Number of parameters:   112854204
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 7 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/View.lua:47: input view (20x32x1x13) and desired view (1056x-1) do not match
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/nn/View.lua:47: in function 'batchsize'
        /home/ubuntu/torch/install/share/lua/5.1/nn/View.lua:79: in function </home/ubuntu/torch/install/share/lua/5.1/nn/View.lua:77>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        ./Network.lua:133: in function 'opfunc'
        /home/ubuntu/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
        ./Network.lua:155: in function 'trainNetwork'
        Train.lua:43: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d50

In the DeepSpeechModel.lua I see a line:

    local rnnInputsize = 32 * 41 -- based on the above convolutions and 16khz audio.

But I am not sure if/how I should modify this to accommodate 8kHz audio.

Edit: On the other hand, it appears that performing the resampling, but leaving all training settings set to 16kHz somehow 'works', but produces very bad results.

SeanNaren commented 8 years ago

From the error message it says: input view (20x32x1x13) and desired view (1056x-1) do not match, so I assume if you change local rnnInputsize = 32 * 41 to local rnnInputsize = 32 * 1 it would work.

Just for more information sake I think baidus' models are based around 16khz and they upsample to 16khz all their training data.

AdolfVonKleist commented 8 years ago

Thanks! It seems to match now. I would be interested to know how the 41 corresponds to 16kHz audio vs 8kHz? While I can understand why they might upsample given the mixed corpora they use in the papers, some of which are natively recorded at 16kHz, for a pure 8kHz corpus I don't think this makes sense.

SeanNaren commented 8 years ago

That's the size of the output channels after the convolutions (after the striding etc). And I agree with you on that one :)

AdolfVonKleist commented 8 years ago

Ok so there was a bit more I had to do here in the deepSpeech function to get it to work right. I had to remap the kernel width [dH] as well from 20 + 1 + 20 = 41 to 10 + 1 + 10 = 21 in the first SpatialConvolution, and from 10 + 1 + 10 = 21 to 5 + 1 + 5 = 11 in the second instance. Same then in resetting: rnnInputsize = 32 * 21. In this way we look at the same actual span in the time dimension [I guess].

SeanNaren commented 8 years ago

Thanks for that, I think this would make sense to clarify on the main branch somewhere this. I'll try to figure out the best place to put this!

AdolfVonKleist commented 8 years ago

Awesome! It would be great to also include some further information about how the rest of the stride and window parameters interact with the model. Some of this is described in the architecture wiki page:

but it would be really nice to have the rest of the relationships between the ideas and the Torch model parameterization all enumerated.

chanil1218 commented 7 years ago

I had similar bug when training audio files with sampleRate 22050. But I got a different error following,

/home/ubuntu/Developer/torch/install/bin/luajit: ...u/Developer/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 1 module of nn.Sequential:
...ntu/Developer/torch/install/share/lua/5.1/cudnn/init.lua:118: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)
stack traceback:
        [C]: in function 'error'
        ...ntu/Developer/torch/install/share/lua/5.1/cudnn/init.lua:118: in function 'errcheck'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:143: in function 'createIODescriptors'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:374: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:371>
        [C]: in function 'xpcall'
        ...u/Developer/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        .../Developer/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <.../Developer/torch/install/share/lua/5.1/nn/Sequential.lua:41>
        [C]: in function 'xpcall'
        ...u/Developer/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        .../Developer/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        ./Network.lua:136: in function 'opfunc'
        ...untu/Developer/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
        ./Network.lua:171: in function 'trainNetwork'
        Train.lua:42: in main chunk
        [C]: in function 'dofile'
        ...oper/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d50

I didn't change except for MakeLMDB.lua -sampleRate opt to 22050. (same rnnInputsize, windowSize, stride)

I've printed out inputSize 20x1x221x5.

Should I change elsewhere other than MakeLMDB.lua sampleRate?

SeanNaren commented 7 years ago

@chanil1218 The cuDNN debug info is very confusing, hopefully in the future this gets better!

What you need to check is what the size output from the convolutions are by adding a require 'dpnn' conv:add(nn.PrintSize("Conv Size")) here.

Based on the size it gives you (the two middle numbers), modify this line to reflect the actual size of the output given your new sample rate.

EDIT: Also to keep to the idea of looking at a context of either side in the convolutional layers, check Adolf's comment here.