How to Pad the Sequential Data with Zeros in an LSTM Model?

Gu-Youngfeng commented 5 years ago

To construct an LSTM model, we have to pad the data with 0 to pre-process the sequential data with un-fixed lengths. For e.g., we wanna change the data_1 to data_2,

# data_1 (the length of this sequential data is 3)
[ [[1, 2, 3, 4], [5, 6, 7, 8]],
  [[1, 2, 3, 4], [6, 7, 8, 9], [5, 1, 1, 2]],
  [[1, 2, 3, 4]],
  ...
]

# data_2
[ [[1, 2, 3, 4], [5, 6, 7, 8], [0, 0, 0, 0]],
  [[1, 2, 3, 4], [6, 7, 8, 9], [5, 1, 1, 2]],
  [[1, 2, 3, 4], [0, 0, 0, 0], [0, 0, 0, 0]],
  ...
]

Gu-Youngfeng commented 5 years ago

I find an available function named keras.preprocessing.sequence.pad_sequences() can solve this problem, try the following code,

# import the keras
from keras.preprocessing.sequence import pad_sequences

# data_1
data_1 = [[[1, 2, 3, 4], [5, 6, 7, 8]],[[1, 2, 3, 4], [6, 7, 8, 9], [5, 1, 1, 2]],[[1, 2, 3, 4]]]
# data_2
data_2 = pad_sequences(data_1, padding='post', maxlen=3)

print(data_1)
print(data_2)

Its result looks like this,

Using TensorFlow backend.
[[[1 2 3 4]
  [5 6 7 8]
  [0 0 0 0]
  [0 0 0 0]]

 [[1 2 3 4]
  [6 7 8 9]
  [5 1 1 2]
  [0 0 0 0]]

 [[1 2 3 4]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]]

Gu-Youngfeng commented 5 years ago

pad_sequence() seems to solve the problem, but after you pad the sequential data, the LSTM will still calculate the loss along with the lstm cells. A safer method is to add the parameter sequence_length to tf.nn.dynamic_rnn, the partial code is as follows,

seq_length = tf.placeholder(tf.int32)
outputs, states = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, sequence_length=seq_length)

sess = tf.Session()
feed = {
    seq_length: 20,
    #other feeds
}
sess.run(outputs, feed_dict=feed)

Gu-Youngfeng commented 5 years ago

Note that the parameter sequence_length in function tf.nn.dynmaic_rnn() is not a integer value but a vector, that is, Tensorflow wanna know how many steps in each of your sequential data should be calculated. The official explanation is in https://tensorflow.google.cn/api_docs/python/tf/nn/dynamic_rnn.

sequence_length: (optional) An int32/int64 vector sized [batch_size]. Used to copy-through state and zero-out outputs when past a batch element's sequence length. So it's more for performance than correctness.

so the complete code is as follows,

# step-1: padding the train set and define the placeholder x and y
features_train = np.array(features_train) # features
features_train = pad_sequences(features_train, padding='post', maxlen=sequence_size) # padding with 0
labels_train = np.array(labels_train) # labels
# x has the shape of (4, 10, 45)
x = tf.placeholder(tf.float32, shape=(None, sequence_size, feature_size), name="features")
# y has the shape of (None, 1)
y = tf.placeholder(tf.float32, shape=(None,1), name="labels")

# step-2: set function to define the parameter sequence_length
# this function can calculate the non-zero value in each sequential data, for example,
# if seq = [[[1,2,3], [3,4,5]], [[3,4,5]], [[2,2,2],[4,5,6],[7,7,7],[8,9,0]]]
# then length = [2, 1, 4]
def cal_length(seq):
    used = tf.sign(tf.reduce_max(tf.abs(seq), 2))
    length = tf.reduce_sum(used, 1)
    length = tf.cast(length, tf.int32)
    return length

# step-3: build the LSTM model
lstm_cell = tf.nn.rnn_cell.LSTMCell(num_units=128)
outputs, state = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=x, sequence_length=cal_length(x), dtype=tf.float32)

Gu-Youngfeng / LearningTensorflow

How to Pad the Sequential Data with Zeros in an LSTM Model? #3