dspavankumar / keras-kaldi

Keras Interface for Kaldi ASR
GNU General Public License v3.0
121 stars 41 forks source link

Different form of input? #7

Closed Miail closed 7 years ago

Miail commented 7 years ago

Is it possible given this implementation to train an acoustic models given a different kind of inputs than audio frames?.. In my case spectrograms of audio files.

I an currently seeking a way in which i can implement a CNN-HMM using the kaldi interface, Training the CNN part is possible in keras, but connecting it to kaldi seem to cause some problems.

Is it possible to create such an acoustic model using your implementation, and still be able to decode using the kale interface?

dspavankumar commented 7 years ago

It is possible to train acoustic models with any kind of input, if you can store the features in Kaldi format.

A CNN acoustic model trained in Keras can be used to extract posterior features by forward pass on the test features (similar to nnet-forward). For this your network should, after some convolutional layers, convert the 3D signal to 2D and use a Dense layer with softmax at the output. Then its outputs can be converted to likelihoods and can be sent to the decoder using latgen-faster-mapped.

yh1008 commented 7 years ago

Just curious, If I have fbank features extracted, and GMM force-alignment trained, and I would like to train CNN on top of it. Can I simply replace your

m = keras.models.Sequential([
                keras.layers.LSTM(256, input_shape=(learning['spliceSize'],trGen.inputFeatDim), activation='tanh', return_sequences=True),
                keras.layers.LSTM(256, activation='tanh', return_sequences=True),
                keras.layers.LSTM(256, activation='tanh'),
                keras.layers.Dense(trGen.outputFeatDim, activation='softmax')])

(in trian*.py) to something like the following

m = Sequential()
m.add(Convolution2D(150,8,8), input_shape=trGen.inputFeatDim)
m.add(MaxPooling2D(6,6)) 
m.add(Flatten())
m.add(Dense(1024))
m.add(Activation('relu'))
m.add(Dense(output_dim=treGen.outputFeatDim)
m.add(Activation('softmax'))

and have the rest of the files remained the same?

Is there anything else you would like us to know that requires modification for this pipeline to work? like do I need to modify dataGenerator?

Thanks in advanced!

dspavankumar commented 7 years ago

Yes, but I guess Convolution1D makes sense in the case of filterbank features (because we want each filter of the Kernel to move across time and capture sound patterns by looking at the frequencies, and so we don't want the Kernel to move on the frequency axis). You could try that. The batch_size could be kept None, size could be your context and input_dim could be the number of filters in the filterbank. And then you can flatten the layer's output and use Dense layer(s) with a softmax at the output. You can use dataGenSequences for this purpose. I haven't tested any code though. I will try to include a CNN example in a later revision.