HawkAaron / RNN-Transducer

MXNet implementation of RNN Transducer (Graves 2012): Sequence Transduction with Recurrent Neural Networks
136 stars 31 forks source link

Why adding a while loop when decoding on the gpu? #13

Closed Marcovaldong closed 5 years ago

Marcovaldong commented 5 years ago

I want to know, why adding a while loo when decoding on the gpu? Now I introduced an embedding layer into the prediction network to replace the one-hot. However it will clapse into the while loop and cannot get out when decoding.

HawkAaron commented 5 years ago

@Marcovaldong You need to change one_hot to your embedding layer in here.

Marcovaldong commented 5 years ago

@HawkAaron Yes, I knew that, I replaced every one-hot with the embedding layer. There was no logic error in code when I got this problem.

I am thinking the meaning of the while loop. That means there may be more than one label (word/phone) can be decoded out from a single feature vector, is it?

However, I think there may be some problems as I have got. Before replacing the one-hot with an embedding layer, I didn't got the problem of inifite loop. I print every output when the program falls into the infinite loop, all these output are the same, it means the current feature vector can only output the same label, while it cannot stop the while loop. So I think there should be a if-else to prevent this case.

英文比较烂,不知道说没说清楚,中文说一下。我把one-hot换成了embedding,代码上应该没什么问题,每个one-hot的地方都进行了替换,代码能跑起来。我现在不太懂这个while循环的意义。其意义是不是说有可能从某一个输入的特征向量中解码出多个label?在替换之前代码是没有出现过死循环的,在替换了one-hot之后发现有这种情况。我打印了while循环里输出的label,在死循环时模型每次解码出的label都是一样的,这是不是说模型在当前帧特征中只能解码出一个label(但是它有可能不能输出一个blank,所以会陷入死循环)。我仔细看了输出,在有的样本下虽然没有死循环,但while 循环里也是会输出多个结果,这多个结果也是同一个label。这个地方是不是要加一个判断条件,当while循环里前后两次输出的结果相同时就break掉,我现在打算试一下这个。

不知道我的理解对不对,还是说用Embedding来替换one-hot时还需要更多的一些操作。求大佬指点。

Marcovaldong commented 5 years ago

@HawkAaron 我按我的理解改了一下,大概是下面的样子:

        for xi in h:
            flag = None
            while True:
                ytu = self.joint(xi, y[0][0])
                ytu = mx.nd.log_softmax(ytu)
                yi = mx.nd.argmax(ytu, axis=0)  # for Graves2012 transducer
                pred = int(yi.asscalar())
                logp += float(ytu[yi].asscalar())
                if pred == self.blank or pred == flag:
                    break
                flag = pred
                y_seq.append(pred)
                y = mx.nd.one_hot(yi.reshape((1, 1))-1, self.vocab_size-1).as_in_context(ctx)
                y, hid = self.decoder(y, hid)  # forward first zero

更改后就不会出现死循环了。

另外,这里还有另一个问题。我是在一个3000h的中文数据集上训练的,训练过程中可以看到embedding比one-hot的loss下降得更快。但是实际解码后的CER却不是这样。训练1轮后one-hot在测试集上的CER为18.28%,更换embedding层(embedding到一个250维的特征)之后CER却在52%以上。具体如下:

method one-hot embedding
while 100 52.10
while 18.28 inifinite
while + flag 18.27 55.34

出现这个结果是因为embedding的维度太低了吗?感觉上换成embedding应该会好一点吧。我看论文Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer中用的是Embedding啊,毕竟prediction network应该是一个语言模型啊

HawkAaron commented 5 years ago

@Marcovaldong Note that I used zero vector to represent blank in here, and the embedding vocab size is exclude blank. If you get blank vector from embedding layer, then the initial state should be changed.

Marcovaldong commented 5 years ago

@HawkAaron Thanks for your reply. The embedding vocab layer size is include blank (This should be an error, while I think it doesn't inference the result). However I use a zero vector to represent blank in here. When decoding, I still use a zero vector to represent the start of the sentence. The detail is as follows.

class Transducer(gluon.Block):
    ''' When joint training, remove RNNModel decoder layer '''
    def __init__(self, vocab_size, num_hidden, num_layers, dropout=0, blank=0, bidirectional=False):
        super(Transducer, self).__init__()
        self.num_hidden = num_hidden
        self.num_layers = num_layers
        self.vocab_size = vocab_size
        self.loss = RNNTLoss(blank_label=blank)
        self.blank = blank
        with self.name_scope():
            # acoustic model NOTE only initialize encoder.rnn, we can reuse encoder.decoder
            self.encoder = RNNModel(num_hidden, num_hidden, num_layers, dropout, bidirectional)
            # prediction model
            self.decoder = rnn.LSTM(num_hidden, 1, 'NTC', dropout=dropout)
            # joint 
            self.fc1 = nn.Dense(num_hidden, flatten=False, in_units=2*num_hidden)
            self.fc2 = nn.Dense(vocab_size, flatten=False, in_units=num_hidden)

    def joint(self, f, g):
        ''' `f`: encoder lstm output (B,T,U,2H) expanded
        `g`: decoder lstm output (B,T,U,H) expanded
        NOTE f and g must have the same size except the last dim '''
        dim = len(f.shape) - 1
        out = mx.nd.concat(f, g, dim=dim)
        out = mx.nd.tanh(self.fc1(out))
        return self.fc2(out)

    def forward(self, xs, ys, xlen, ylen):
        # forward acoustic model
        f = self.encoder(xs)
        # forward prediction model
        ymat = mx.nd.one_hot(ys-1, self.vocab_size-1) # pm input size
        ymat = mx.nd.concat(mx.nd.zeros((ymat.shape[0], 1, ymat.shape[2]), ctx=ymat.context), ymat, dim=1) # concat zero vector
        g = self.decoder(ymat)
        # rnnt loss
        f1 = mx.nd.expand_dims(f, axis=2) # BT1H
        g1 = mx.nd.expand_dims(g, axis=1) # B1UH
        f1 = mx.nd.broadcast_axis(f1, 2, g1.shape[2])
        g1 = mx.nd.broadcast_axis(g1, 1, f1.shape[1])
        ytu = mx.nd.log_softmax(self.joint(f1, g1), axis=3)
        loss = self.loss(ytu, ys, xlen, ylen)
        return loss

    def greedy_decode(self, xs):
        ctx = xs.context
        f = self.encoder(xs)
        h = f[0]
        y = mx.nd.zeros((1, 1, self.vocab_size-1), ctx=ctx) # first zero vector 
        hid = [mx.nd.zeros((1, 1, self.num_hidden), ctx=ctx)] * 2 # support for one sequence
        y, hid = self.decoder(y, hid) # forward first zero
        y_seq = []; logp = 0
        for xi in h:
            flag = None
            while True:
                ytu = self.joint(xi, y[0][0])
                ytu = mx.nd.log_softmax(ytu)
                yi = mx.nd.argmax(ytu, axis=0)  # for Graves2012 transducer
                pred = int(yi.asscalar())
                logp += float(ytu[yi].asscalar())
                if pred == self.blank or pred == flag:
                    break
                flag = pred
                y_seq.append(pred)
                y = mx.nd.one_hot(yi.reshape((1, 1))-1, self.vocab_size-1).as_in_context(ctx)
                y, hid = self.decoder(y, hid)  # forward first zero
        return y_seq, -logp

Is there something wrong?

HawkAaron commented 5 years ago

@Marcovaldong Where is your embedding layer ?

Marcovaldong commented 5 years ago

@HawkAaron Sorry, I gave your original code, the embedding version is as follows:

class Transducer2(gluon.Block):
    ''' When joint training, remove RNNModel decoder layer '''

    def __init__(self, vocab_size, num_hidden, num_layers, dropout=0, blank=0, bidirectional=False):
        super(Transducer2, self).__init__()
        self.num_hidden = num_hidden
        self.num_layers = num_layers
        self.vocab_size = vocab_size
        self.loss = RNNTLoss(blank_label=blank)
        self.blank = blank
        with self.name_scope():
            # acoustic model NOTE only initialize encoder.rnn, we can reuse encoder.decoder
            self.encoder = RNNModel(num_hidden, num_hidden, num_layers, dropout, bidirectional)
            # prediction model
            self.embedding = nn.Embedding(input_dim=vocab_size, output_dim=num_hidden,
                                          weight_initializer=mx.init.Uniform(0.3))
            self.decoder = rnn.LSTM(num_hidden, 1, 'NTC', dropout=dropout)
            # joint
            self.fc1 = nn.Dense(num_hidden, flatten=False, in_units=2 * num_hidden)
            self.fc2 = nn.Dense(vocab_size, flatten=False, in_units=num_hidden)

    def joint(self, f, g):
        ''' `f`: encoder lstm output (B,T,U,2H) expanded
        `g`: decoder lstm output (B,T,U,H) expanded
        NOTE f and g must have the same size except the last dim '''
        dim = len(f.shape) - 1
        out = mx.nd.concat(f, g, dim=dim)
        out = mx.nd.tanh(self.fc1(out))
        return self.fc2(out)

    def forward(self, xs, ys, xlen, ylen):
        # forward acoustic model
        f = self.encoder(xs)
        # forward prediction model
        # ymat = mx.nd.one_hot(ys - 1, self.vocab_size - 1)  # pm input size
        # ymat = mx.nd.concat(mx.nd.zeros((ymat.shape[0], 1, ymat.shape[2]), ctx=ymat.context), ymat,
        #                     dim=1)  # concat zero vector
        ymat = self.embedding(ys)
        ymat = mx.nd.concat(mx.nd.zeros((ymat.shape[0], 1, ymat.shape[2]), ctx=ymat.context), ymat,
                            dim=1)  # concat zero vector which means a "zero" beginning
        g = self.decoder(ymat)
        # rnnt loss
        f1 = mx.nd.expand_dims(f, axis=2)  # BT1H
        g1 = mx.nd.expand_dims(g, axis=1)  # B1UH
        f1 = mx.nd.broadcast_axis(f1, 2, g1.shape[2])
        g1 = mx.nd.broadcast_axis(g1, 1, f1.shape[1])
        ytu = mx.nd.log_softmax(self.joint(f1, g1), axis=3)
        loss = self.loss(ytu, ys, xlen, ylen)
        return loss

    def greedy_decode(self, xs):
        ctx = xs.context
        f = self.encoder(xs)
        h = f[0]
        # y = mx.nd.zeros((1, 1, self.vocab_size - 1), ctx=ctx)  # first zero vector
        y = mx.nd.zeros((1, 1, self.num_hidden), ctx=ctx)  # first zero vector which doesn't through embedding layer
        hid = [mx.nd.zeros((1, 1, self.num_hidden), ctx=ctx)] * 2  # support for one sequence
        y, hid = self.decoder(y, hid)  # forward first zero
        y_seq = []
        logp = 0
        # for xi in h:
        #     flag = None
        #     while True:
        #         ytu = self.joint(xi, y[0][0])
        #         ytu = mx.nd.log_softmax(ytu)
        #         yi = mx.nd.argmax(ytu, axis=0)  # for Graves2012 transducer
        #         pred = int(yi.asscalar())
        #         # print(pred)
        #         logp += float(ytu[yi].asscalar())
        #         if pred == self.blank or pred == flag:
        #             break
        #         flag = pred
        #         y_seq.append(pred)
        #         # y = mx.nd.one_hot(yi.reshape((1, 1)) - 1, self.vocab_size - 1).as_in_context(ctx)
        #         y = self.embedding(yi.reshape((1, 1))-1).as_in_context(ctx)  # use embedding instead of one-hot
        #         # print(y.shape)
        #         y, hid = self.decoder(y, hid)  # forward first zero

        for xi in h:
            # print(h.shape)
            ytu = self.joint(xi, y[0][0])
            ytu = mx.nd.log_softmax(ytu)
            yi = mx.nd.argmax(ytu, axis=0)  # for Graves2012 transducer
            pred = int(yi.asscalar())
            # print(i, pred)
            logp += float(ytu[yi].asscalar())
            if pred != self.blank:
                y_seq.append(pred)
                # y = mx.nd.one_hot(yi.reshape((1, 1)) - 1, self.vocab_size - 1).as_in_context(ctx)
                y = self.embedding(yi.reshape((1, 1))-1).as_in_context(ctx)  # use embedding instead of one-hot
                y, hid = self.decoder(y, hid)  # forward first zero

        return y_seq, -logp
HawkAaron commented 5 years ago

@Marcovaldong Your embedding vocab size is vocab_size, which includes blank in position 0. Then the vector for any label exclude blank should start from 1, which means y = self.embedding(yi.reshape((1, 1))-1).as_in_context(ctx) should be y = self.embedding(yi.reshape((1, 1))).as_in_context(ctx)

Marcovaldong commented 5 years ago

@HawkAaron I'm very appreciate for your reviewing my code, I am an idiot, I forgot to change my code in that position. I will compare the performance of one-hot and embedding layer again. Thank you very much.

HawkAaron commented 5 years ago

@Marcovaldong You are welcome. Please feel free to raise any questions.