Closed Miail closed 7 years ago
It is possible to train acoustic models with any kind of input, if you can store the features in Kaldi format.
A CNN acoustic model trained in Keras can be used to extract posterior features by forward pass on the test features (similar to nnet-forward
). For this your network should, after some convolutional layers, convert the 3D signal to 2D and use a Dense
layer with softmax
at the output. Then its outputs can be converted to likelihoods and can be sent to the decoder using latgen-faster-mapped
.
Just curious, If I have fbank features extracted, and GMM force-alignment trained, and I would like to train CNN on top of it. Can I simply replace your
m = keras.models.Sequential([
keras.layers.LSTM(256, input_shape=(learning['spliceSize'],trGen.inputFeatDim), activation='tanh', return_sequences=True),
keras.layers.LSTM(256, activation='tanh', return_sequences=True),
keras.layers.LSTM(256, activation='tanh'),
keras.layers.Dense(trGen.outputFeatDim, activation='softmax')])
(in trian*.py) to something like the following
m = Sequential()
m.add(Convolution2D(150,8,8), input_shape=trGen.inputFeatDim)
m.add(MaxPooling2D(6,6))
m.add(Flatten())
m.add(Dense(1024))
m.add(Activation('relu'))
m.add(Dense(output_dim=treGen.outputFeatDim)
m.add(Activation('softmax'))
and have the rest of the files remained the same?
Is there anything else you would like us to know that requires modification for this pipeline to work? like do I need to modify dataGenerator
?
Thanks in advanced!
Yes, but I guess Convolution1D
makes sense in the case of filterbank features (because we want each filter of the Kernel to move across time and capture sound patterns by looking at the frequencies, and so we don't want the Kernel to move on the frequency axis). You could try that. The batch_size
could be kept None
, size
could be your context and input_dim
could be the number of filters in the filterbank. And then you can flatten the layer's output and use Dense
layer(s) with a softmax
at the output. You can use dataGenSequences
for this purpose. I haven't tested any code though. I will try to include a CNN example in a later revision.
Is it possible given this implementation to train an acoustic models given a different kind of inputs than audio frames?.. In my case spectrograms of audio files.
I an currently seeking a way in which i can implement a CNN-HMM using the kaldi interface, Training the CNN part is possible in keras, but connecting it to kaldi seem to cause some problems.
Is it possible to create such an acoustic model using your implementation, and still be able to decode using the kale interface?