Question about the attention feature aggregation equation 6

Dear authors,

Thanks a lot for releasing the dataset and code, amazing work! I had a specific question regarding the feature aggregation for a given query point. In main paper equation 6, the feature vector is refined by its nearest neighbors. When I check the implementation, I found the nearest neighbors are defined in 3D space only, no temporal information is used, code here. I am wondering wouldn't it lead to aggregation of features from different irrelevant objects? For example, space $p$ at time $t$ is occupied by object $i$, but at some moment $t_q$ this space might be occupied by another object $j$. When querying prediction for a point near $p$, we will obtain features from both objects $i$ and $j$, is this desired behavior?

In the screenshot below, the green ball seems to be fully coupled with the red cylinder, which might be the reason why the prediction is not good at this moment?

Or do you have other comments regarding this? Thanks a lot!

Best, Xianghui

Hi Xianghui,

Thanks for your comments and the great question!

Essentially the best answer is that, despite how the referenced attention equation works, all point clouds are really 4D rather than 3D. This is because they don't merely contain (x, y, z) coordinates but also time indices, which is (x, y, z, t). Because it is already hard enough to visualize 3D representations, let alone four dimensions, we typically visualize only one frame (t = constant) at a time. However, one array or tensor can still contain data from multiple different points in time. In particular, the pcl_key is the encoded version of a spatiotemporal input point cloud, i.e. a collection of point cloud frames that have been stacked together after attaching appropriate values of the time t. This featurized point cloud therefore contains contextual information along the 3 spatial dimensions as well as the temporal axis. Even pcl_query in principle contains 4D coordinates, but in practice we pick one fixed frame index at a time when querying the implicit representation.

You are correct in pointing out that the feature interpolation (and the relative positional encoding for the attention mechanism) works with respect to spatial difference vectors only, which is 3D, and at first glance seems to ignore the temporal dimension. However, when you think about the nearest neighbors step (due to us not having infinite memory), this is intentional -- specifically, I want to ensure that all the available temporal context will be attended to as long as it is within a certain spatial neighborhood in terms of Euclidean distance. If I were to include the time index in this distance calculation, then temporally far-away points would be less frequently included in the featurization during downsampling and/or attention steps. But despite all this, the time indices of both the query and key points are always included with all points and therefore part of the representation, since they constitute the 4D coordinates of those respective embeddings, even if only 3 out of those 4 coordinates are actually used for the purposes of spatial kNN / positional embedding calculations.

This is quite a mouthful so I hope it makes sense. The most important takeaway is that attention works across time, so the query points for the teal sphere themselves have the chance to cross-attend to points corresponding to the same identical sphere earlier in time (when it was still visible in the input video). This contextualization mechanism exists regardless of whether the nearest neighbor calculation happens with spatial positions only or all spatiotemporal coordinates.

I'll be closing this issue, but feel free to continue replying in this thread if you would like any clarification!

basilevh / occlusions-4d

Question about the attention feature aggregation equation 6 #2