GuangmingZhu / AttentionConvLSTM

"Attention in Convolutional LSTM for Gesture Recognition" in NIPS 2018
http://papers.nips.cc/paper/7465-attention-in-convolutional-lstm-for-gesture-recognition
MIT License
218 stars 51 forks source link

function error #12

Open JwDong2019 opened 5 years ago

JwDong2019 commented 5 years ago

Thanks for your code. Do you know whether the keras.layers.GatedConvLSTM2D is changed into keras.layers.GatedConvLSTM2D? And i also can't find keras.layers.GatedConvLSTM2D function in keras.Can you provide some reference?Thank you. Another question: When i add lstm in i3d code but the loss is nan,do you know why?

GuangmingZhu commented 5 years ago

@JianweiDong (1) The keras.layers.GatedConvLSTM2D was modified from keras.layers.ConvLSTM2D by myself, it is just released in this repo. (2) I have no idea why the loss is nan. My previous researches show that the deeper 3DCNN component in our "Res3D+ConvLSTM+MobileNet" architecture will not always result in better performance.

JwDong2019 commented 5 years ago

@GuangmingZhu Thank you teacher. I see the modified in your code. Now i know why the loss is Nan. Maybe it is because I forget to initializer 'kernel_initializer' and 'recurrent_initializer' so the loss is nan. And now my architecture is work on. I noticed the input of keras.layers.ConvLSTM2D is '' if data_format='channels_last' 5D tensor with shape: (samples, time, rows, cols, channels)'',i don't know the meaning of 'samples and time'. My input of keras.layers.ConvLSTM2D is 5D tensor with shape: (batchize, num_frames, rows, cols, channels). num_frames is the frame selected in a video. Is it the right input?

GuangmingZhu commented 5 years ago

@JianweiDong I think 'samples and time' does mean 'batchsize and num_frames'. However, I am not sure whether the num_frames is the count of the frames selected in a video in your case, since temporal pooling is used in my 3DCNN component prior ConvLSTM2D.

JwDong2019 commented 5 years ago

@GuangmingZhu Yeah, my structure is 3DCNN + ConvLSTM2D. 3DCNN component is prior ConvLSTM2D. num_frames is the count of the frames selected in a video in my case. I see the input of your 3DCNN is (batch_size, seq_len, 112, 112, 3) and the output of 3DCNN is (samples, new_conv_dim1, new_conv_dim2, new_conv_dim3, filters)。So I'm confused with the input of ConvLSTM2D and don't know the meaning of 'samples and time' (i can't clearly find proper translation about them). So how do you understand them? I refer your code to learn how to connect 3DCNN and LSTM for my undergraduate project in video recognition and i don't run your code so i don't know clearly the input of your ConvLSTM2D. Thank you.

GuangmingZhu commented 5 years ago

@JianweiDong It is simple. Assume that the 3DCNN component has a temporal pooling and two spatial poolings which all have the stride of 2, then the output of 3DCNN should have the shape of (batchsize, seq_len/2, 28, 28, filters). Therefore, the input of ConvLSTM2D has also the shape of (batchsize, seq_len/2, 28, 28, filters).

JwDong2019 commented 5 years ago

@GuangmingZhu Oh I see. I use the ConvLSTM2D as you say. It mean the 'samples' is correspond to to 'batchsize' and 'time' is correspond to to temporal sequence length? (samples,time, rows, cols, channels). Thanks for your warm hearted help.

GuangmingZhu commented 5 years ago

@JianweiDong YES.

double2b commented 4 years ago

@GuangmingZhu ,Hi! In the code, the output featuremap of Gatedconvlstm2d is (16,?, 28, 28,256). The dim2 "?" makes the dense layer error. Is the "?" here appropriate? (The proper versions of Python and TF have been used, and the initpy files has been replaced) Looking forward to your reply.