abduallahmohamed / Social-STGCNN

Code for "Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction" CVPR 2020
MIT License
483 stars 141 forks source link

Question about TXP-CNN #29

Closed DarkstartsUp closed 4 years ago

DarkstartsUp commented 4 years ago

I have a question about the choose of CNN as time sequence predictor. The input of TXP-CNN is in the shape of (time length T x embedding length P x node number N), and treats the time dimension as feature channels. So the height and width of input map of CNN are P and N, correspondingly. Because CNN extracts image features in the receptive field, I don't understand what the physical meaning of information in the receptive field is under your setting. Are adjacent nodes related or adjacent values in embedding related ? Otherwise, what is the meaning of convolution? Will different sizes of convolution kernels have an impact?

Looking forward to your reply, thank you very much!

abduallahmohamed commented 4 years ago

Hi, thanks for asking!

The main goal for TXP-CNN is to receive (TxPxN) (T = time frame, P = position, N = number of pedestrians) and output a prediction (T*xPxN) where T* is the next time steps of an arbitrary size, which is set to 12 in our case. In order to understand the meaning of the input to TXP-CNN we need to get back to the previous layer where the ST-GCNN operates on the graph. The output of ST-GCNN is an embedding of this temporal graph which you can think of as a dense vector that holds valuable information about the interaction between pedestrians where each PxN holds a representation of this information that is relative to this sepcific pedestrian living in PxN. The next step is we need to expand this "dense" information and predict the next steps. If you notice in our implementation TXP-CNN is uses kernel size of (3) and padding of (1). This configuration is set to keep the PxN dimensions as it is. So the convolution operation in our case tries to extract helpful features through time dimension of T and using the neighbor P T information to predict the next steps. We didn't try different kernel size and I would be happy to see the results. Also, the embedding coming out from ST-GCNN is not permutation invariant, this shows that the information is dependent on each other, thus the convolution use is helpful.

I hope I answered your question.

DarkstartsUp commented 4 years ago

Thanks a lot for your reply!

I think if the embedding coming out from ST-GCNN is not permutation invariant, convolution may indeed be useful. However, if we want convolution to obtain effective information from the adjacent pedestrians on the feature map, I think the arrangement of the feature map should use special design, such as arranging the pedestrians with close spatial position together, which I did not find in the paper. I suggest that if there is no evidence to show that the current PxN arrangement can guarantee the correlation between adjacent pixels, it is necessary to compare the results of time dimension prediction based on fully connected layer or 1x1 convolution, so as to prove the necessity of convolution operation.

Another point I'm curious about is that after convolution, each pedestrian has P predicted time series. How does the final result fuse these P outputs? Are P putput sequences used to estimate the parameters of bi-variate Gaussian distribution?

Looking forward to your reply, thank you very much!

abduallahmohamed commented 4 years ago

Thanks for your reply! though the embedding after ST-GCNN is not permutation invariant the original graph input {V,A} is permutation invariant. Aka, if you changed the order of pedestrains in the graph from the input you will get the same results. The values of the adjacency matrix A is what governs the relationship between pedestrians and emphasis on the proximity as a concept. We might not be able to compare against Fully connected as the graph nodes counts varies which is an advantage of the design. The 1x1 CNN could be a good comparison and will take into consideration (thanks for hinting this!) . Each pedestrian P has a predicted T time series. For each T we predict the parameters of a bi-variate Gaussian distribution (2 means, 2 variances and a correlation), then sample from these predicted distribution.

DarkstartsUp commented 4 years ago

Thanks a lot for your reply!