Abstract
Deep neural networks have achieved great success in many data processing applications. However, the high computational complexity and storage cost makes deep learning hard to be used on resource-constrained devices, and it is not environmental-friendly with much power cost. In this paper, we focus on low-rank optimization for efficient deep learning techniques. In the space domain, deep neural networks are compressed by low rank approximation of the network parameters, which directly reduces the storage requirement with a smaller number of network parameters. In the time domain, the network parameters can be trained in a few subspaces, which enables efficient training for fast convergence. The model compression in the spatial domain is summarized into three categories as pre-train, pre-set, and compression-aware methods, respectively. With a series of integrable techniques discussed, such as sparse pruning, quantization, and entropy coding, we can ensemble them in an integration framework with lower computational complexity and storage. Besides of summary of recent technical advances, we have two findings for motivating future works: one is that the effective rank outperforms other sparse measures for network compression. The other is a spatial and temporal balance for tensorized neural networks.
How Does Attention Work in Vision Transformers? A Visual Analytics Attempt
Authors: Yiran Li, Junpeng Wang, Xin Dai, Liang Wang, Chin-Chia Michael Yeh, Yan Zheng, Wei Zhang, Kwan-Liu Ma
Abstract
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
Efficient Execution of SPARQL Queries with OPTIONAL and UNION Expressions
Authors: Lei Zou, Yue Pang, M. Tamer Özsu, Jiaqi Chen
Abstract
The proliferation of RDF datasets has resulted in studies focusing on optimizing SPARQL query processing. Most existing work focuses on basic graph patterns (BGPs) and ignores other vital operators in SPARQL, such as UNION and OPTIONAL. SPARQL queries with these operators, which we abbreviate as SPARQL-UO, pose serious query plan generation challenges. In this paper, we propose techniques for executing SPARQL-UO queries using BGP execution as a building block, based on a novel BGP-based Evaluation (BE)-Tree representation of query plans. On top of this, we propose a series of cost-driven BE-tree transformations to generate more efficient plans by reducing the search space and intermediate result sizes, and a candidate pruning technique that further enhances efficiency at query time. Experiments confirm that our method outperforms the state-of-the-art by orders of magnitude.
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
Authors: Yong Hyun Ahn, Gyeong-Moon Park, Seong Tae Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
It is important to quantify the uncertainty of input samples, especially in mission-critical domains such as autonomous driving and healthcare, where failure predictions on out-of-distribution (OOD) data are likely to cause big problems. OOD detection problem fundamentally begins in that the model cannot express what it is not aware of. Post-hoc OOD detection approaches are widely explored because they do not require an additional re-training process which might degrade the model's performance and increase the training cost. In this study, from the perspective of neurons in the deep layer of the model representing high-level features, we introduce a new aspect for analyzing the difference in model outputs between in-distribution data and OOD data. We propose a novel method, Leveraging Important Neurons (LINe), for post-hoc Out of distribution detection. Shapley value-based pruning reduces the effects of noisy outputs by selecting only high-contribution neurons for predicting specific classes of input data and masking the rest. Activation clipping fixes all values above a certain threshold into the same value, allowing LINe to treat all the class-specific features equally and just consider the difference between the number of activated feature differences between in-distribution and OOD data. Comprehensive experiments verify the effectiveness of the proposed method by outperforming state-of-the-art post-hoc OOD detection methods on CIFAR-10, CIFAR-100, and ImageNet datasets.
PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration
Authors: Richard Petri, Grace Li Zhang, Yiran Chen, Ulf Schlichtmann, Bing Li
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Abstract
Deep neural networks (DNNs) have been successfully applied in various fields. A major challenge of deploying DNNs, especially on edge devices, is power consumption, due to the large number of multiply-and-accumulate (MAC) operations. To address this challenge, we propose PowerPruning, a novel method to reduce power consumption in digital neural network accelerators by selecting weights that lead to less power consumption in MAC operations. In addition, the timing characteristics of the selected weights together with all activation transitions are evaluated. The weights and activations that lead to small delays are further selected. Consequently, the maximum delay of the sensitized circuit paths in the MAC units is reduced even without modifying MAC units, which thus allows a flexible scaling of supply voltage to reduce power consumption further. Together with retraining, the proposed method can reduce power consumption of DNNs on hardware by up to 78.3% with only a slight accuracy loss.
Keyword: neural\ architecture\ search
There is no result
Keyword: 3d object detection
MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth Estimation
Abstract
Monocular 3D object detection (Mono3D) in mobile settings (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Due to the near-far disparity phenomenon of monocular vision and the ever-changing camera pose, it is hard to acquire high detection accuracy, especially for far objects. Inspired by the insight that the depth of an object can be well determined according to the depth of the ground where it stands, in this paper, we propose a novel Mono3D framework, called MoGDE, which constantly estimates the corresponding ground depth of an image and then utilizes the estimated ground depth information to guide Mono3D. To this end, we utilize a pose detection network to estimate the pose of the camera and then construct a feature map portraying pixel-level ground depth according to the 3D-to-2D perspective geometry. Moreover, to improve Mono3D with the estimated ground depth, we design an RGB-D feature fusion network based on the transformer structure, where the long-range self-attention mechanism is utilized to effectively identify ground-contacting points and pin the corresponding ground depth to the image feature map. We conduct extensive experiments on the real-world KITTI dataset. The results demonstrate that MoGDE can effectively improve the Mono3D accuracy and robustness for both near and far objects. MoGDE yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.
Keyword: voxel
UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative Neural Feature Fields
Abstract
Generating photorealistic images with controllable camera pose and scene contents is essential for many applications including AR/VR and simulation. Despite the fact that rapid progress has been made in 3D-aware generative models, most existing methods focus on object-centric images and are not applicable to generating urban scenes for free camera viewpoint control and scene editing. To address this challenging task, we propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior, including the layout distribution of uncountable stuff and countable objects, to guide a 3D-aware generative model. Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky. Using stuff prior in the form of semantic voxel grids, we build a conditioned stuff generator that effectively incorporates the coarse semantic and geometry information. The object layout prior further allows us to learn an object generator from cluttered scenes. With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability, including large camera movement, stuff editing, and object manipulation. We validate the effectiveness of our model on both synthetic and real-world datasets, including the challenging KITTI-360 dataset.
Keyword: lidar
Collaboration Helps Camera Overtake LiDAR in 3D Detection
Abstract
Camera-only 3D detection provides an economical solution with a simple configuration for localizing objects in 3D space compared to LiDAR-based detection systems. However, a major challenge lies in precise depth estimation due to the lack of direct 3D measurements in the input. Many previous methods attempt to improve depth estimation through network designs, e.g., deformable layers and larger receptive fields. This work proposes an orthogonal direction, improving the camera-only 3D detection by introducing multi-agent collaborations. Our proposed collaborative camera-only 3D detection (CoCa3D) enables agents to share complementary information with each other through communication. Meanwhile, we optimize communication efficiency by selecting the most informative cues. The shared messages from multiple viewpoints disambiguate the single-agent estimated depth and complement the occluded and long-range regions in the single-agent view. We evaluate CoCa3D in one real-world dataset and two new simulation datasets. Results show that CoCa3D improves previous SOTA performances by 44.21% on DAIR-V2X, 30.60% on OPV2V+, 12.59% on CoPerception-UAVs+ for AP@70. Our preliminary results show a potential that with sufficient collaboration, the camera might overtake LiDAR in some practical scenarios. We released the dataset and code at https://siheng-chen.github.io/dataset/CoPerception+ and https://github.com/MediaBrain-SJTU/CoCa3D.
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Authors: Haojie Zhao, Junsong Chen, Lijun Wang, Huchuan Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Compared with traditional RGB-only visual tracking, few datasets have been constructed for RGB-D tracking. In this paper, we propose ARKitTrack, a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes, we also annotate this dataset with 123.9K pixel-level target masks. Besides, the camera intrinsic and camera pose of each frame are provided for future developments. To demonstrate the potential usefulness of this dataset, we further present a unified baseline for both box-level and pixel-level tracking, which integrates RGB features with bird's-eye-view representations to better explore cross-modality 3D geometry. In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts. The code and dataset is available at https://arkittrack.github.io.
CCL: Continual Contrastive Learning for LiDAR Place Recognition
Authors: Jiafeng Cui, Xieyuanli Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Place recognition is an essential and challenging task in loop closing and global localization for robotics and autonomous driving applications. Benefiting from the recent advances in deep learning techniques, the performance of LiDAR place recognition (LPR) has been greatly improved. However, current deep learning-based methods suffer from two major problems: poor generalization ability and catastrophic forgetting. In this paper, we propose a continual contrastive learning method, named CCL, to tackle the catastrophic forgetting problem and generally improve the robustness of LPR approaches. Our CCL constructs a contrastive feature pool and utilizes contrastive loss to train more transferable representations of places. When transferred into new environments, our CCL continuously reviews the contrastive memory bank and applies a distribution-based knowledge distillation to maintain the retrieval ability of the past data while continually learning to recognize new places from the new data. We thoroughly evaluate our approach on Oxford, MulRan, and PNV datasets using three different LPR methods. The experimental results show that our CCL consistently improves the performance of different methods in different environments outperforming the state-of-the-art continual learning method. The implementation of our method has been released at https://github.com/cloudcjf/CCL.
StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion
Authors: Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaoefeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, Dalong Du
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
3D semantic scene completion (SSC) is an ill-posed task that requires inferring a dense 3D scene from incomplete observations. Previous methods either explicitly incorporate 3D geometric input or rely on learnt 3D prior behind monocular RGB images. However, 3D sensors such as LiDAR are expensive and intrusive while monocular cameras face challenges in modeling precise geometry due to the inherent ambiguity. In this work, we propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors. Our key insight is to leverage stereo matching to resolve geometric ambiguity. To improve its robustness in unmatched areas, we introduce bird's-eye-view (BEV) representation to inspire hallucination ability with rich context information. On top of the stereo and BEV representations, a mutual interactive aggregation (MIA) module is carefully devised to fully unleash their power. Specifically, a Bi-directional Interaction Transformer (BIT) augmented with confidence re-weighting is used to encourage reliable prediction through mutual guidance while a Dual Volume Aggregation (DVA) module is designed to facilitate complementary aggregation. Experimental results on SemanticKITTI demonstrate that the proposed StereoScene outperforms the state-of-the-art camera-based methods by a large margin with a relative improvement of 26.9% in geometry and 38.6% in semantic.
Keyword: pruning
Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training
How Does Attention Work in Vision Transformers? A Visual Analytics Attempt
Efficient Execution of SPARQL Queries with OPTIONAL and UNION Expressions
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration
Keyword: neural\ architecture\ search
There is no result
Keyword: 3d object detection
MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth Estimation
Keyword: voxel
UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative Neural Feature Fields
Keyword: lidar
Collaboration Helps Camera Overtake LiDAR in 3D Detection
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
CCL: Continual Contrastive Learning for LiDAR Place Recognition
StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion