about transformer block

Hi drprojects! Thank you for your excellent work！ I am using your project, but I have two questions

1.When processing node features, neighborhood features were not considered, and only the dimensionality was increased from (N, 11) to (N, 128). Most point cloud feature extraction often starts with finding K-nearest neighbors and then performs feature aggregation to learn global features.eg:（N,K,11) -> （N,K,128）-> (maxpool) （N，128） 2.When using self attention mechanism to learn superpoint contextual information, a 2D image like swim transformer approach is used. The attention mechanism in 3D point clouds is currently well developed, why not consider using the attention mechanism in 3D point clouds? For example: Point cloud Transformer and Point Transformer

1.When processing node features, neighborhood features were not considered, and only the dimensionality was increased from (N, 11) to (N, 128). Most point cloud feature extraction often starts with finding K-nearest neighbors and then performs feature aggregation to learn global features.eg:（N,K,11) -> （N,K,128）-> (maxpool) （N，128）

Indeed, we are using a PointNet-like structure, which encodes the superpoints in a lightweight fashion. The focus of our Superpoint Transformer paper was efficiency, so minimizing pointwise operations was a must. Other design choices could certainly be explored for better expressivity (we are actually considering some at the moment, but this is not publicly reseased).

2.When using self attention mechanism to learn superpoint contextual information, a 2D image like swim transformer approach is used. The attention mechanism in 3D point clouds is currently well developed, why not consider using the attention mechanism in 3D point clouds? For example: Point cloud Transformer and Point Transformer

Our self-attention mechanism should not be compared to SwinT (nor Point Transformer): we reason on a data structure specifically following the geometric/radiometric complexity of the scene. SwinT, and basically most other transformer-based architectures, use nodes and edges based on XY pixel (or XYZ voxel in our case) grids. Comparable to SwinT, Stratified Transformer uses the same shifted window strategy on 3D grids. These approaches are compute and memory intensive and do not scale to large scenes. This is the very essence and philosophy of Superpoint Transformer and explains both the high performance and efficiency of our approach: we reason on a hierarchical data structure informed by the scene's geometric and radiometric complexity.

drprojects / superpoint_transformer

about transformer block #82