New submissions for Fri, 7 Apr 23

Keyword: efficient

Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks

Authors: Michael Weiss, Paolo Tonella
Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2304.02654
Pdf link: https://arxiv.org/pdf/2304.02654
Abstract Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as mobile phones or IoT microcontrollers. Systems relying on large-scale DNNs thus have to call the corresponding model over the network, leading to substantial costs for hosting and running the large-scale remote model, costs which are often charged on a per-use basis. In this paper, we propose BiSupervised, a novel architecture, where, before relying on a large remote DNN, a system attempts to make a prediction on a small-scale local model. A DNN supervisor monitors said prediction process and identifies easy inputs for which the local prediction can be trusted. For these inputs, the remote model does not have to be invoked, thus saving costs, while only marginally impacting the overall system accuracy. Our architecture furthermore foresees a second supervisor to monitor the remote predictions and identify inputs for which not even these can be trusted, allowing to raise an exception or run a fallback strategy instead. We evaluate the cost savings, and the ability to detect incorrectly predicted inputs on four diverse case studies: IMDB movie review sentiment classification, Github issue triaging, Imagenet image classification, and SQuADv2 free-text question answering
nD-PDPA: nDimensional Probability Density Profile Analysis
Authors: Arjang Fahim, Stephanie Irausquin, Homayoun Valafar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2304.02682
Pdf link: https://arxiv.org/pdf/2304.02682
Abstract Despite the recent advances in various Structural Genomics Projects, a large gap remains between the number of sequenced and structurally characterized proteins. Some reasons for this discrepancy include technical difficulties, labor, and the cost related to determining a structure by experimental methods such as NMR spectroscopy. Several computational methods have been developed to expand the applicability of NMR spectroscopy by addressing temporal and economical problems more efficiently. While these methods demonstrate successful outcomes to solve more challenging and structurally novel proteins, the cost has not been reduced significantly. Probability Density Profile Analysis (PDPA) has been previously introduced by our lab to directly address the economics of structure determination of routine proteins and the identification of novel structures from a minimal set of unassigned NMR data. 2D-PDPA (in which 2D denotes incorporation of data from two alignment media) has been successful in identifying the structural homolog of an unknown protein within a library of ~1000 decoy structures. In order to further expand the selectivity and sensitivity of PDPA, the incorporation of additional data was necessary. However, the expansion of the original PDPA approach was limited by its computational requirements where the inclusion of additional data would render it computationally intractable. Here we present the most recent developments of PDPA method (nD-PDPA: n Dimensional Probability Density Profile Analysis) that eliminate 2D-PDPA's computational limitations, and allows inclusion of RDC data from multiple vector types in multiple alignment media.
A Certified Radius-Guided Attack Framework to Image Segmentation Models
Authors: Wenjie Qu, Youqi Li, Binghui Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02693
Pdf link: https://arxiv.org/pdf/2304.02693
Abstract Image segmentation is an important problem in many safety-critical applications. Recent studies show that modern image segmentation models are vulnerable to adversarial perturbations, while existing attack methods mainly follow the idea of attacking image classification models. We argue that image segmentation and classification have inherent differences, and design an attack framework specially for image segmentation models. Our attack framework is inspired by certified radius, which was originally used by defenders to defend against adversarial perturbations to classification models. We are the first, from the attacker perspective, to leverage the properties of certified radius and propose a certified radius guided attack framework against image segmentation models. Specifically, we first adapt randomized smoothing, the state-of-the-art certification method for classification models, to derive the pixel's certified radius. We then focus more on disrupting pixels with relatively smaller certified radii and design a pixel-wise certified radius guided loss, when plugged into any existing white-box attack, yields our certified radius-guided white-box attack. Next, we propose the first black-box attack to image segmentation models via bandit. We design a novel gradient estimator, based on bandit feedback, which is query-efficient and provably unbiased and stable. We use this gradient estimator to design a projected bandit gradient descent (PBGD) attack, as well as a certified radius-guided PBGD (CR-PBGD) attack. We prove our PBGD and CR-PBGD attacks can achieve asymptotically optimal attack performance with an optimal rate. We evaluate our certified-radius guided white-box and black-box attacks on multiple modern image segmentation models and datasets. Our results validate the effectiveness of our certified radius-guided attack framework.
Recovering Continuous Scene Dynamics from A Single Blurry Image with Events
Authors: Zhangyi Cheng, Xiang Zhang, Lei Yu, Jianzhuang Liu, Wen Yang, Gui-Song Xia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02695
Pdf link: https://arxiv.org/pdf/2304.02695
Abstract This paper aims at demystifying a single motion-blurred image with events and revealing temporally continuous scene dynamics encrypted behind motion blurs. To achieve this end, an Implicit Video Function (IVF) is learned to represent a single motion blurred image with concurrent events, enabling the latent sharp image restoration of arbitrary timestamps in the range of imaging exposures. Specifically, a dual attention transformer is proposed to efficiently leverage merits from both modalities, i.e., the high temporal resolution of event features and the smoothness of image features, alleviating temporal ambiguities while suppressing the event noise. The proposed network is trained only with the supervision of ground-truth images of limited referenced timestamps. Motion- and texture-guided supervisions are employed simultaneously to enhance restorations of the non-referenced timestamps and improve the overall sharpness. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate that our proposed method outperforms state-of-the-art methods by a large margin in terms of both objective PSNR and SSIM measurements and subjective evaluations.
Agnostic proper learning of monotone functions: beyond the black-box correction barrier
Authors: Jane Lange, Arsen Vasilyan
Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02700
Pdf link: https://arxiv.org/pdf/2304.02700
Abstract We give the first agnostic, efficient, proper learning algorithm for monotone Boolean functions. Given $2^{\tilde{O}(\sqrt{n}/\varepsilon)}$ uniformly random examples of an unknown function $f:{\pm 1}^n \rightarrow {\pm 1}$, our algorithm outputs a hypothesis $g:{\pm 1}^n \rightarrow {\pm 1}$ that is monotone and $(\mathrm{opt} + \varepsilon)$-close to $f$, where $\mathrm{opt}$ is the distance from $f$ to the closest monotone function. The running time of the algorithm (and consequently the size and evaluation time of the hypothesis) is also $2^{\tilde{O}(\sqrt{n}/\varepsilon)}$, nearly matching the lower bound of Blais et al (RANDOM '15). We also give an algorithm for estimating up to additive error $\varepsilon$ the distance of an unknown function $f$ to monotone using a run-time of $2^{\tilde{O}(\sqrt{n}/\varepsilon)}$. Previously, for both of these problems, sample-efficient algorithms were known, but these algorithms were not run-time efficient. Our work thus closes this gap in our knowledge between the run-time and sample complexity. This work builds upon the improper learning algorithm of Bshouty and Tamon (JACM '96) and the proper semiagnostic learning algorithm of Lange, Rubinfeld, and Vasilyan (FOCS '22), which obtains a non-monotone Boolean-valued hypothesis, then corrects'' it to monotone using query-efficient local computation algorithms on graphs. This black-box correction approach can achieve no error better than $2\mathrm{opt} + \varepsilon$ information-theoretically; we bypass this barrier by a) augmenting the improper learner with a convex optimization step, and b) learning and correcting a real-valued function before rounding its values to Boolean. Our real-valued correction algorithm solves theposet sorting'' problem of [LRV22] for functions over general posets with non-Boolean labels.
A Unified Taxonomy for Automated Vehicles: Individual, Cooperative, Collaborative, On-Road, and Off-Road
Authors: Fredrik Warg, Anders Thorsén, Victoria Vu, Carl Bergenhem
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02705
Pdf link: https://arxiv.org/pdf/2304.02705
Abstract Various types of vehicle automation is increasingly used in a variety of environments including road vehicles such as cars or automated shuttles, confined areas such as mines or harbours, or in agriculture and forestry. In many use cases, the benefits are greater if several automated vehicles (AVs) cooperate to aid each other reach their goals more efficiently, or collaborate to complete a common task. Taxonomies and definitions create a common framework that helps researchers and practitioners advance the field. However, most existing work focus on road vehicles. In this paper, we review and extend taxonomies and definitions to encompass individually acting as well as cooperative and collaborative AVs for both on-road and off-road use cases. In particular, we introduce classes of collaborative vehicles not defined in existing literature, and define levels of automation suitable for vehicles where automation applies to additional functions in addition to the driving task.
Efficient OCR for Building a Diverse Digital History
Authors: Jacob Carlson, Tom Bryan, Melissa Dell
Subjects: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2304.02737
Pdf link: https://arxiv.org/pdf/2304.02737
Abstract Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.
Sejarah dan Perkembangan Teknik Natural Language Processing (NLP) Bahasa Indonesia: Tinjauan tentang sejarah, perkembangan teknologi, dan aplikasi NLP dalam bahasa Indonesia
Authors: Mukhlis Amien
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2304.02746
Pdf link: https://arxiv.org/pdf/2304.02746
Abstract This study provides an overview of the history of the development of Natural Language Processing (NLP) in the context of the Indonesian language, with a focus on the basic technologies, methods, and practical applications that have been developed. This review covers developments in basic NLP technologies such as stemming, part-of-speech tagging, and related methods; practical applications in cross-language information retrieval systems, information extraction, and sentiment analysis; and methods and techniques used in Indonesian language NLP research, such as machine learning, statistics-based machine translation, and conflict-based approaches. This study also explores the application of NLP in Indonesian language industry and research and identifies challenges and opportunities in Indonesian language NLP research and development. Recommendations for future Indonesian language NLP research and development include developing more efficient methods and technologies, expanding NLP applications, increasing sustainability, further research into the potential of NLP, and promoting interdisciplinary collaboration. It is hoped that this review will help researchers, practitioners, and the government to understand the development of Indonesian language NLP and identify opportunities for further research and development.
Robust, privacy-preserving, transparent, and auditable on-device blocklisting
Authors: Kurt Thomas, Sarah Meiklejohn, Michael A. Specter, Xiang Wang, Xavier Llorà, Stephan Somogyi, David Kleidermacher
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2304.02810
Pdf link: https://arxiv.org/pdf/2304.02810
Abstract With the accelerated adoption of end-to-end encryption, there is an opportunity to re-architect security and anti-abuse primitives in a manner that preserves new privacy expectations. In this paper, we consider two novel protocols for on-device blocklisting that allow a client to determine whether an object (e.g., URL, document, image, etc.) is harmful based on threat information possessed by a so-called remote enforcer in a way that is both privacy-preserving and trustworthy. Our protocols leverage a unique combination of private set intersection to promote privacy, cryptographic hashes to ensure resilience to false positives, cryptographic signatures to improve transparency, and Merkle inclusion proofs to ensure consistency and auditability. We benchmark our protocols -- one that is time-efficient, and the other space-efficient -- to demonstrate their practical use for applications such as email, messaging, storage, and other applications. We also highlight remaining challenges, such as privacy and censorship tensions that exist with logging or reporting. We consider our work to be a critical first step towards enabling complex, multi-stakeholder discussions on how best to provide on-device protections.
GIF: A General Graph Unlearning Strategy via Influence Function
Authors: Jiancan Wu, Yi Yang, Yuchun Qian, Yongduo Sui, Xiang Wang, Xiangnan He
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2304.02835
Pdf link: https://arxiv.org/pdf/2304.02835
Abstract With the greater emphasis on privacy and security in our society, the problem of graph unlearning -- revoking the influence of specific data on the trained GNN model, is drawing increasing attention. However, ranging from machine unlearning to recently emerged graph unlearning methods, existing efforts either resort to retraining paradigm, or perform approximate erasure that fails to consider the inter-dependency between connected neighbors or imposes constraints on GNN structure, therefore hard to achieve satisfying performance-complexity trade-offs. In this work, we explore the influence function tailored for graph unlearning, so as to improve the unlearning efficacy and efficiency for graph unlearning. We first present a unified problem formulation of diverse graph unlearning tasks \wrt node, edge, and feature. Then, we recognize the crux to the inability of traditional influence function for graph unlearning, and devise Graph Influence Function (GIF), a model-agnostic unlearning method that can efficiently and accurately estimate parameter changes in response to a $\epsilon$-mass perturbation in deleted data. The idea is to supplement the objective of the traditional influence function with an additional loss term of the influenced neighbors due to the structural dependency. Further deductions on the closed-form solution of parameter changes provide a better understanding of the unlearning mechanism. We conduct extensive experiments on four representative GNN models and three benchmark datasets to justify the superiority of GIF for diverse graph unlearning tasks in terms of unlearning efficacy, model utility, and unlearning efficiency. Our implementations are available at \url{https://github.com/wujcan/GIF-torch/}.
Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
Authors: Jonas Ngnawe, Marianne ABEMGNIGNI NJIFON, Jonathan Heek, Yann Dauphin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02847
Pdf link: https://arxiv.org/pdf/2304.02847
Abstract Deep networks have achieved impressive results on a range of well-curated benchmark datasets. Surprisingly, their performance remains sensitive to perturbations that have little effect on human performance. In this work, we propose a novel extension of Mixup called Robustmix that regularizes networks to classify based on lower-frequency spatial features. We show that this type of regularization improves robustness on a range of benchmarks such as Imagenet-C and Stylized Imagenet. It adds little computational overhead and, furthermore, does not require a priori knowledge of a large set of image transformations. We find that this approach further complements recent advances in model architecture and data augmentation, attaining a state-of-the-art mCE of 44.8 with an EfficientNet-B8 model and RandAugment, which is a reduction of 16 mCE compared to the baseline.
Towards an Effective and Efficient Transformer for Rain-by-snow Weather Removal
Authors: Tao Gao, Yuanbo Wen, Kaihao Zhang, Peng Cheng, Ting Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02860
Pdf link: https://arxiv.org/pdf/2304.02860
Abstract Rain-by-snow weather removal is a specialized task in weather-degraded image restoration aiming to eliminate coexisting rain streaks and snow particles. In this paper, we propose RSFormer, an efficient and effective Transformer that addresses this challenge. Initially, we explore the proximity of convolution networks (ConvNets) and vision Transformers (ViTs) in hierarchical architectures and experimentally find they perform approximately at intra-stage feature learning. On this basis, we utilize a Transformer-like convolution block (TCB) that replaces the computationally expensive self-attention while preserving attention characteristics for adapting to input content. We also demonstrate that cross-stage progression is critical for performance improvement, and propose a global-local self-attention sampling mechanism (GLASM) that down-/up-samples features while capturing both global and local dependencies. Finally, we synthesize two novel rain-by-snow datasets, RSCityScape and RS100K, to evaluate our proposed RSFormer. Extensive experiments verify that RSFormer achieves the best trade-off between performance and time-consumption compared to other restoration methods. For instance, it outperforms Restormer with a 1.53% reduction in the number of parameters and a 15.6% reduction in inference time. Datasets, source code and pre-trained models are available at \url{https://github.com/chdwyb/RSFormer}.
VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection
Authors: Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02867
Pdf link: https://arxiv.org/pdf/2304.02867
Abstract Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.
Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach
Authors: Zhixuan Xu, Kechun Xu, Yue Wang, Rong Xiong
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02893
Pdf link: https://arxiv.org/pdf/2304.02893
Abstract We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial relational constraints in language instructions. Previous works based on rule-based language parsing or scene-centric visual representation have restrictions on the form of instructions and reference objects or require large amounts of training data. We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable. Experiments indicate that our model can achieve a 97.75% success rate of placement with only ~0.26M trainable parameters. Besides, our method generalizes better to both unseen objects and instructions. Moreover, with only 25% training data, we still outperform the top competing approach.
Affect as a proxy for literary mood
Authors: Emily Öhman, Riikka Rossi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2304.02894
Pdf link: https://arxiv.org/pdf/2304.02894
Abstract We propose to use affect as a proxy for mood in literary texts. In this study, we explore the differences in computationally detecting tone versus detecting mood. Methodologically we utilize affective word embeddings to look at the affective distribution in different text segments. We also present a simple yet efficient and effective method of enhancing emotion lexicons to take both semantic shift and the domain of the text into account producing real-world congruent results closely matching both contemporary and modern qualitative analyses.
LSketch: A Label-Enabled Graph Stream Sketch Toward Time-Sensitive Queries
Authors: Yiling Zeng, Chunyao Song, Yuhan Li, Tingjian Ge
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2304.02897
Pdf link: https://arxiv.org/pdf/2304.02897
Abstract Graph streams represent data interactions in real applications. The mining of graph streams plays an important role in network security, social network analysis, and traffic control, among others. However, the sheer volume and high dynamics cause great challenges for efficient storage and subsequent query analysis on them. Current studies apply sketches to summarize graph streams. We propose LSketch that works for heterogeneous graph streams, which effectively preserves the label information carried by the streams in real scenes, thereby enriching the expressive ability of sketches. In addition, as graph streams continue to evolve over time, edges too old may lose their practical significance. Therefore, we introduce the sliding window model into LSketch to eliminate the expired edges automatically. LSketch uses sub-linear storage space and can support structure based queries and time-sensitive queries with high accuracy. We perform extensive experiments over four real datasets, demonstrating the superiority of the proposed method over state-of-the-art methods, in aspects of query accuracy and time efficiency.
InterFormer: Real-time Interactive Image Segmentation
Authors: You Huang, Hao Yang, Ke Sun, Shengchuan Zhang, Guannan Jiang, Rongrong Ji, Liujuan Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2304.02942
Pdf link: https://arxiv.org/pdf/2304.02942
Abstract Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks. However, the existing interactive segmentation pipeline suffers from inefficient computations of interactive models because of the following two issues. First, annotators' later click is based on models' feedback of annotators' former click. This serial interaction is unable to utilize model's parallelism capabilities. Second, the model has to repeatedly process the image, the annotator's current click, and the model's feedback of the annotator's former clicks at each step of interaction, resulting in redundant computations. For efficient computation, we propose a method named InterFormer that follows a new pipeline to address these issues. InterFormer extracts and preprocesses the computationally time-consuming part i.e. image processing from the existing process. Specifically, InterFormer employs a large vision transformer (ViT) on high-performance devices to preprocess images in parallel, and then uses a lightweight module called interactive multi-head self attention (I-MSA) for interactive segmentation. Furthermore, the I-MSA module's deployment on low-power devices extends the practical application of interactive segmentation. The I-MSA module utilizes the preprocessed features to efficiently response to the annotator inputs in real-time. The experiments on several datasets demonstrate the effectiveness of InterFormer, which outperforms previous interactive segmentation models in terms of computational efficiency and segmentation quality, achieve real-time high-quality interactive segmentation on CPU-only devices.
When approximate design for fast homomorphic computation provides differential privacy guarantees
Authors: Arnaud Grivet Sébert, Martin Zuber, Oana Stan, Renaud Sirdey, Cédric Gouy-Pailler
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02959
Pdf link: https://arxiv.org/pdf/2304.02959
Abstract While machine learning has become pervasive in as diversified fields as industry, healthcare, social networks, privacy concerns regarding the training data have gained a critical importance. In settings where several parties wish to collaboratively train a common model without jeopardizing their sensitive data, the need for a private training protocol is particularly stringent and implies to protect the data against both the model's end-users and the actors of the training phase. Differential privacy (DP) and cryptographic primitives are complementary popular countermeasures against privacy attacks. Among these cryptographic primitives, fully homomorphic encryption (FHE) offers ciphertext malleability at the cost of time-consuming operations in the homomorphic domain. In this paper, we design SHIELD, a probabilistic approximation algorithm for the argmax operator which is both fast when homomorphically executed and whose inaccuracy is used as a feature to ensure DP guarantees. Even if SHIELD could have other applications, we here focus on one setting and seamlessly integrate it in the SPEED collaborative training framework from "SPEED: Secure, PrivatE, and Efficient Deep learning" (Grivet S\'ebert et al., 2021) to improve its computational efficiency. After thoroughly describing the FHE implementation of our algorithm and its DP analysis, we present experimental results. To the best of our knowledge, it is the first work in which relaxing the accuracy of an homomorphic calculation is constructively usable as a degree of freedom to achieve better FHE performances.
A Fast and Lightweight Network for Low-Light Image Enhancement
Authors: Yu Zhang, Xiaoguang Di, Junde Wu, RAO FU, Yong Li, Yue Wang, Yanwu Xu, Guohui YANG, Chunhui Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2304.02978
Pdf link: https://arxiv.org/pdf/2304.02978
Abstract Low-light images often suffer from severe noise, low brightness, low contrast, and color deviation. While several low-light image enhancement methods have been proposed, there remains a lack of efficient methods that can simultaneously solve all of these problems. In this paper, we introduce FLW-Net, a Fast and LightWeight Network for low-light image enhancement that significantly improves processing speed and overall effect. To achieve efficient low-light image enhancement, we recognize the challenges of the lack of an absolute reference and the need for a large receptive field to obtain global contrast. Therefore, we propose an efficient global feature information extraction component and design loss functions based on relative information to overcome these challenges. Finally, we conduct comparative experiments to demonstrate the effectiveness of the proposed method, and the results confirm that FLW-Net can significantly reduce the complexity of supervised low-light image enhancement networks while improving processing effect. Code is available at https://github.com/hitzhangyu/FLW-Net
IoT Federated Blockchain Learning at the Edge
Authors: James Calo, Benny Lo
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2304.03006
Pdf link: https://arxiv.org/pdf/2304.03006
Abstract IoT devices are sorely underutilized in the medical field, especially within machine learning for medicine, yet they offer unrivaled benefits. IoT devices are low-cost, energy-efficient, small and intelligent devices. In this paper, we propose a distributed federated learning framework for IoT devices, more specifically for IoMT (Internet of Medical Things), using blockchain to allow for a decentralized scheme improving privacy and efficiency over a centralized system; this allows us to move from the cloud-based architectures, that are prevalent, to the edge. The system is designed for three paradigms: 1) Training neural networks on IoT devices to allow for collaborative training of a shared model whilst decoupling the learning from the dataset to ensure privacy. Training is performed in an online manner simultaneously amongst all participants, allowing for the training of actual data that may not have been present in a dataset collected in the traditional way and dynamically adapt the system whilst it is being trained. 2) Training of an IoMT system in a fully private manner such as to mitigate the issue with confidentiality of medical data and to build robust, and potentially bespoke, models where not much, if any, data exists. 3) Distribution of the actual network training, something federated learning itself does not do, to allow hospitals, for example, to utilize their spare computing resources to train network models.
PointCAT: Cross-Attention Transformer for point cloud
Authors: Xincheng Yang, Mingze Jin, Weiji He, Qian Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03012
Pdf link: https://arxiv.org/pdf/2304.03012
Abstract Transformer-based models have significantly advanced natural language processing and computer vision in recent years. However, due to the irregular and disordered structure of point cloud data, transformer-based models for 3D deep learning are still in their infancy compared to other methods. In this paper we present Point Cross-Attention Transformer (PointCAT), a novel end-to-end network architecture using cross-attentions mechanism for point cloud representing. Our approach combines multi-scale features via two seprate cross-attention transformer branches. To reduce the computational increase brought by multi-branch structure, we further introduce an efficient model for shape classification, which only process single class token of one branch as a query to calculate attention map with the other. Extensive experiments demonstrate that our method outperforms or achieves comparable performance to several approaches in shape classification, part segmentation and semantic segmentation tasks.
Tensor Slicing and Optimization for Multicore NPUs
Authors: Rafael Sousa, Marcio Pereira, Yongin Kwon, Taeho Kim, Namsoon Jung, Chang Soo Kim, Michael Frank, Guido Araujo
Subjects: Performance (cs.PF); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03013
Pdf link: https://arxiv.org/pdf/2304.03013
Abstract Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai-ned Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing parallelism and MAC utilization are central to any effective solution. This paper proposes a TensorFlow XLA/LLVM compiler optimization pass for Multicore NPUs, called Tensor Slicing Optimization (TSO), which: (a) maximizes convolution parallelism and memory usage across NPU cores; and (b) reduces data transfers between host and NPU on-chip memories by using DRAM memory burst time estimates to guide tensor slicing. To evaluate the proposed approach, a set of experiments was performed using the NeuroMorphic Processor (NMP), a multicore NPU containing 32 RISC-V cores extended with novel CNN instructions. Experimental results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models. Speed-ups of up to 21.7\% result when comparing the TSO burst-based technique to a no-burst data slicing approach. To validate the generality of the TSO approach, the algorithm was also ported to the Glow Machine Learning framework. The performance of the models were measured on both Glow and TensorFlow XLA/LLVM compilers, revealing similar results.
A computation of D(9) using FPGA Supercomputing
Authors: Lennart Van Hirtum, Patrick De Causmaecker, Jens Goemaere, Tobias Kenter, Heinrich Riebler, Michael Lass, Christian Plessl
Subjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2304.03039
Pdf link: https://arxiv.org/pdf/2304.03039
Abstract This preprint makes the claim of having computed the $9^{th}$ Dedekind Number. This was done by building an efficient FPGA Accelerator for the core operation of the process, and parallelizing it on the Noctua 2 Supercluster at Paderborn University. The resulting value is 286386577668298411128469151667598498812366. This value can be verified in two steps. We have made the data file containing the 490M results available, each of which can be verified separately on CPU, and the whole file sums to our proposed value.
Data-driven HVAC Control Using Symbolic Regression: Design and Implementation
Authors: Yuki Ozawa, Dafang Zhao, Daichi Watari, Ittetsu Taniguchi, Toshihiro Suzuki, Yoshiyuki Shimoda, Takao Onoye
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.03078
Pdf link: https://arxiv.org/pdf/2304.03078
Abstract The large amount of data collected in buildings makes energy management smarter and more energy efficient. This study proposes a design and implementation methodology of data-driven heating, ventilation, and air conditioning (HVAC) control. Building thermodynamics is modeled using a symbolic regression model (SRM) built from the collected data. Additionally, an HVAC system model is also developed with a data-driven approach. A model predictive control (MPC) based HVAC scheduling is formulated with the developed models to minimize energy consumption and peak power demand and maximize thermal comfort. The performance of the proposed framework is demonstrated in the workspace in the actual campus building. The HVAC system using the proposed framework reduces the peak power by 16.1\% compared to the widely used thermostat controller.
Offline Uncertainty Sampling in Data-driven Stochastic MPC
Authors: Johannes Teutsch, Sebastian Kerz, Tim Brüdigam, Dirk Wollherr, Marion Leibold
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.03088
Pdf link: https://arxiv.org/pdf/2304.03088
Abstract In this work, we exploit an offline-sampling based strategy for the constrained data-driven predictive control of an unknown linear system subject to random measurement noise. The strategy uses only past measured, potentially noisy data in a non-parametric system representation and does not require any prior model identification. The approximation of chance constraints using uncertainty sampling leads to efficient constraint tightening. Under mild assumptions, robust recursive feasibility and closed-loop constraint satisfaction is shown. In a simulation example, we provide evidence for the improved control performance of the proposed control scheme in comparison to a purely robust data-driven predictive control approach.
Inductive Graph Unlearning
Authors: Cheng-Long Wang, Mengdi Huai, Di Wang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2304.03093
Pdf link: https://arxiv.org/pdf/2304.03093
Abstract As a way to implement the "right to be forgotten" in machine learning, \textit{machine unlearning} aims to completely remove the contributions and information of the samples to be deleted from a trained model without affecting the contributions of other samples. Recently, many frameworks for machine unlearning have been proposed, and most of them focus on image and text data. To extend machine unlearning to graph data, \textit{GraphEraser} has been proposed. However, a critical issue is that \textit{GraphEraser} is specifically designed for the transductive graph setting, where the graph is static and attributes and edges of test nodes are visible during training. It is unsuitable for the inductive setting, where the graph could be dynamic and the test graph information is invisible in advance. Such inductive capability is essential for production machine learning systems with evolving graphs like social media and transaction networks. To fill this gap, we propose the \underline{{\bf G}}\underline{{\bf U}}ided \underline{{\bf I}}n\underline{{\bf D}}uctiv\underline{{\bf E}} Graph Unlearning framework (GUIDE). GUIDE consists of three components: guided graph partitioning with fairness and balance, efficient subgraph repair, and similarity-based aggregation. Empirically, we evaluate our method on several inductive benchmarks and evolving transaction graphs. Generally speaking, GUIDE can be efficiently implemented on the inductive graph learning tasks for its low graph partition cost, no matter on computation or structure information. The code will be available here: https://github.com/Happy2Git/GUIDE.
FABRID: Flexible Attestation-Based Routing for Inter-Domain Networks
Authors: Cyrill Krähenbühl (ETH Zürich), Marc Wyss (ETH Zürich), David Basin (ETH Zürich), Vincent Lenders (armasuisse), Adrian Perrig (ETH Zürich), Martin Strohmeier (armasuisse)
Subjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2304.03108
Pdf link: https://arxiv.org/pdf/2304.03108
Abstract In its current state, the Internet does not provide end users with transparency and control regarding on-path forwarding devices. In particular, the lack of network device information reduces the trustworthiness of the forwarding path and prevents end-user applications requiring specific router capabilities from reaching their full potential. Moreover, the inability to influence the traffic's forwarding path results in applications communicating over undesired routes, while alternative paths with more desirable properties remain unusable. In this work, we present FABRID, a system that enables applications to forward traffic flexibly, potentially on multiple paths selected to comply with user-defined preferences, where information about forwarding devices is exposed and transparently attested by autonomous systems (ASes). The granularity of this information is chosen by each AS individually, protecting them from leaking sensitive network details, while the secrecy and authenticity of preferences embedded within the users' packets are protected through efficient cryptographic operations. We show the viability of FABRID by deploying it on a global SCION network test bed, and we demonstrate high throughput on commodity hardware.
Simplifying Content-Based Neural News Recommendation: On User Modeling and Training Objectives
Authors: Andreea Iana, Goran Glavaš, Heiko Paulheim
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2304.03112
Pdf link: https://arxiv.org/pdf/2304.03112
Abstract The advent of personalized news recommendation has given rise to increasingly complex recommender architectures. Most neural news recommenders rely on user click behavior and typically introduce dedicated user encoders that aggregate the content of clicked news into user embeddings (early fusion). These models are predominantly trained with standard point-wise classification objectives. The existing body of work exhibits two main shortcomings: (1) despite general design homogeneity, direct comparisons between models are hindered by varying evaluation datasets and protocols; (2) it leaves alternative model designs and training objectives vastly unexplored. In this work, we present a unified framework for news recommendation, allowing for a systematic and fair comparison of news recommenders across several crucial design dimensions: (i) candidate-awareness in user modeling, (ii) click behavior fusion, and (iii) training objectives. Our findings challenge the status quo in neural news recommendation. We show that replacing sizable user encoders with parameter-efficient dot products between candidate and clicked news embeddings (late fusion) often yields substantial performance gains. Moreover, our results render contrastive training a viable alternative to point-wise classification objectives.
Zero-shot Generative Model Adaptation via Image-specific Prompt Learning
Authors: Jiayi Guo, Chaofei Wang, You Wu, Eric Zhang, Kai Wang, Xingqian Xu, Shiji Song, Humphrey Shi, Gao Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03119
Pdf link: https://arxiv.org/pdf/2304.03119
Abstract Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation.
BotTriNet: A Unified and Efficient Embedding for Social Bots Detection via Metric Learning
Authors: Jun Wu, Xuesong Ye, Man Yan Yuet
Subjects: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2304.03144
Pdf link: https://arxiv.org/pdf/2304.03144
Abstract A persistently popular topic in online social networks is the rapid and accurate discovery of bot accounts to prevent their invasion and harassment of genuine users. We propose a unified embedding framework called BOTTRINET, which utilizes textual content posted by accounts for bot detection based on the assumption that contexts naturally reveal account personalities and habits. Content is abundant and valuable if the system efficiently extracts bot-related information using embedding techniques. Beyond the general embedding framework that generates word, sentence, and account embeddings, we design a triplet network to tune the raw embeddings (produced by traditional natural language processing techniques) for better classification performance. We evaluate detection accuracy and f1score on a real-world dataset CRESCI2017, comprising three bot account categories and five bot sample sets. Our system achieves the highest average accuracy of 98.34% and f1score of 97.99% on two content-intensive bot sets, outperforming previous work and becoming state-of-the-art. It also makes a breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%.
Parameterized Approximation Schemes for Clustering with General Norm Objectives
Authors: Fateme Abbasi, Sandip Banerjee, Jarosław Byrka, Parinya Chalermsook, Ameet Gadekar, Kamyar Khodamoradi, Dániel Marx, Roohani Sharma, Joachim Spoerhase
Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.03146
Pdf link: https://arxiv.org/pdf/2304.03146
Abstract This paper considers the well-studied algorithmic regime of designing a $(1+\epsilon)$-approximation algorithm for a $k$-clustering problem that runs in time $f(k,\epsilon)poly(n)$ (sometimes called an efficient parameterized approximation scheme or EPAS for short). Notable results of this kind include EPASes in the high-dimensional Euclidean setting for $k$-center [Bad\u{o}iu, Har-Peled, Indyk; STOC'02] as well as $k$-median, and $k$-means [Kumar, Sabharwal, Sen; J. ACM 2010]. However, existing EPASes handle only basic objectives (such as $k$-center, $k$-median, and $k$-means) and are tailored to the specific objective and metric space. Our main contribution is a clean and simple EPAS that settles more than ten clustering problems (across multiple well-studied objectives as well as metric spaces) and unifies well-known EPASes. Our algorithm gives EPASes for a large variety of clustering objectives (for example, $k$-means, $k$-center, $k$-median, priority $k$-center, $\ell$-centrum, ordered $k$-median, socially fair $k$-median aka robust $k$-median, or more generally monotone norm $k$-clustering) and metric spaces (for example, continuous high-dimensional Euclidean spaces, metrics of bounded doubling dimension, bounded treewidth metrics, and planar metrics). Key to our approach is a new concept that we call bounded $\epsilon$-scatter dimension--an intrinsic complexity measure of a metric space that is a relaxation of the standard notion of bounded doubling dimension. Our main technical result shows that two conditions are essentially sufficient for our algorithm to yield an EPAS on the input metric $M$ for any clustering objective: (i) The objective is described by a monotone (not necessarily symmetric!) norm, and (ii) the $\epsilon$-scatter dimension of $M$ is upper bounded by a function of $\epsilon$.
Spectral Toolkit of Algorithms for Graphs: Technical Report (1)
Authors: Peter Macgregor, He Sun
Subjects: Social and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Mathematical Software (cs.MS)
Arxiv link: https://arxiv.org/abs/2304.03170
Pdf link: https://arxiv.org/pdf/2304.03170
Abstract Spectral Toolkit of Algorithms for Graphs (STAG) is an open-source library for efficient spectral graph algorithms, and its development starts in September 2022. We have so far finished the component on local graph clustering, and this technical report presents a user's guide to STAG, showcase studies, and several technical considerations behind our development.
Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream
Authors: Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, Lan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03184
Pdf link: https://arxiv.org/pdf/2304.03184
Abstract Convenient 4D modeling of human-object interactions is essential for numerous applications. However, monocular tracking and rendering of complex interaction scenarios remain challenging. In this paper, we propose Instant-NVR, a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radiance field techniques via a multi-thread tracking-rendering mechanism. In the tracking front-end, we adopt a robust human-object capture scheme to provide sufficient motion priors. We further introduce a separated instant neural representation with a novel hybrid deformation module for the interacting scene. We also provide an on-the-fly reconstruction scheme of the dynamic/static radiance fields via efficient motion-prior searching. Moreover, we introduce an online key frame selection scheme and a rendering-aware refinement strategy to significantly improve the appearance details for online novel-view synthesis. Extensive experiments demonstrate the effectiveness and efficiency of our approach for the instant generation of human-object radiance fields on the fly, notably achieving real-time photo-realistic novel view synthesis under complex human-object interactions.
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
Authors: Nolan Dey, Gurpreet Gosal, Zhiming (Charles)Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2304.03208
Pdf link: https://arxiv.org/pdf/2304.03208
Abstract We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization ($\mu$P) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.
Hierarchical Graph Neural Network with Cross-Attention for Cross-Device User Matching
Authors: Ali Taghibakhshi, Mingyuan Ma, Ashwath Aithal, Onur Yilmaz, Haggai Maron, Matthew West
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2304.03215
Pdf link: https://arxiv.org/pdf/2304.03215
Abstract Cross-device user matching is a critical problem in numerous domains, including advertising, recommender systems, and cybersecurity. It involves identifying and linking different devices belonging to the same person, utilizing sequence logs. Previous data mining techniques have struggled to address the long-range dependencies and higher-order connections between the logs. Recently, researchers have modeled this problem as a graph problem and proposed a two-tier graph contextual embedding (TGCE) neural network architecture, which outperforms previous methods. In this paper, we propose a novel hierarchical graph neural network architecture (HGNN), which has a more computationally efficient second level design than TGCE. Furthermore, we introduce a cross-attention (Cross-Att) mechanism in our model, which improves performance by 5% compared to the state-of-the-art TGCE method.
FedBot: Enhancing Privacy in Chatbots with Federated Learning
Authors: Addi Ait-Mlouk, Sadi Alawadi, Salman Toor, Andreas Hellander
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.03228
Pdf link: https://arxiv.org/pdf/2304.03228
Abstract Chatbots are mainly data-driven and usually based on utterances that might be sensitive. However, training deep learning models on shared data can violate user privacy. Such issues have commonly existed in chatbots since their inception. In the literature, there have been many approaches to deal with privacy, such as differential privacy and secure multi-party computation, but most of them need to have access to users' data. In this context, Federated Learning (FL) aims to protect data privacy through distributed learning methods that keep the data in its location. This paper presents Fedbot, a proof-of-concept (POC) privacy-preserving chatbot that leverages large-scale customer support data. The POC combines Deep Bidirectional Transformer models and federated learning algorithms to protect customer data privacy during collaborative model training. The results of the proof-of-concept showcase the potential for privacy-preserving chatbots to transform the customer support industry by delivering personalized and efficient customer service that meets data privacy regulations and legal requirements. Furthermore, the system is specifically designed to improve its performance and accuracy over time by leveraging its ability to learn from previous interactions.
DiffMimic: Efficient Motion Mimicking with Differentiable Physics
Authors: Jiawei Ren, Cunjun Yu, Siwei Chen, Xiao Ma, Liang Pan, Ziwei Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.03274
Pdf link: https://arxiv.org/pdf/2304.03274
Abstract Motion mimicking is a foundational task in physics-based character animation. However, most existing motion mimicking methods are built upon reinforcement learning (RL) and suffer from heavy reward engineering, high variance, and slow convergence with hard explorations. Specifically, they usually take tens of hours or even days of training to mimic a simple motion sequence, resulting in poor scalability. In this work, we leverage differentiable physics simulators (DPS) and propose an efficient motion mimicking method dubbed DiffMimic. Our key insight is that DPS casts a complex policy learning task to a much simpler state matching problem. In particular, DPS learns a stable policy by analytical gradients with ground-truth physical priors hence leading to significantly faster and stabler convergence than RL-based methods. Moreover, to escape from local optima, we utilize a Demonstration Replay mechanism to enable stable gradient backpropagation in a long horizon. Extensive experiments on standard benchmarks show that DiffMimic has a better sample efficiency and time efficiency than existing methods (e.g., DeepMimic). Notably, DiffMimic allows a physically simulated character to learn Backflip after 10 minutes of training and be able to cycle it after 3 hours of training, while the existing approach may require about a day of training to cycle Backflip. More importantly, we hope DiffMimic can benefit more differentiable animation systems with techniques like differentiable clothes simulation in future research.
Keyword: faster

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model
Authors: Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.02827
Pdf link: https://arxiv.org/pdf/2304.02827
Abstract The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are also limited, yielding 3D samples with low diversity per prompt with long synthesis time. To address these challenges, we propose DITTO-NeRF, a novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a single image. Our DITTO-NeRF consists of constructing high-quality partial 3D object for limited in-boundary (IB) angles using the given or text-generated 2D image from the frontal view and then iteratively reconstructing the remaining 3D NeRF using inpainting latent diffusion model. We propose progressive 3D object reconstruction schemes in terms of scales (low to high resolution), angles (IB angles initially to outer-boundary (OB) later), and masks (object to background boundary) in our DITTO-NeRF so that high-quality information on IB can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods in terms of fidelity and diversity qualitatively and quantitatively with much faster training times than prior arts on image/text-to-3D such as DreamFusion, and NeuralLift-360.
Convolutional neural networks for crack detection on flexible road pavements
Authors: Hermann Tapamo, Anna Bosman, James Maina, Emile Horak
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02933
Pdf link: https://arxiv.org/pdf/2304.02933
Abstract Flexible road pavements deteriorate primarily due to traffic and adverse environmental conditions. Cracking is the most common deterioration mechanism; the surveying thereof is typically conducted manually using internationally defined classification standards. In South Africa, the use of high-definition video images has been introduced, which allows for safer road surveying. However, surveying is still a tedious manual process. Automation of the detection of defects such as cracks would allow for faster analysis of road networks and potentially reduce human bias and error. This study performs a comparison of six state-of-the-art convolutional neural network models for the purpose of crack detection. The models are pretrained on the ImageNet dataset, and fine-tuned using a new real-world binary crack dataset consisting of 14000 samples. The effects of dataset augmentation are also investigated. Of the six models trained, five achieved accuracy above 97%. The highest recorded accuracy was 98%, achieved by the ResNet and VGG16 models. The dataset is available at the following URL: https://zenodo.org/record/7795975
Boundary-Denoising for Video Activity Localization
Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Pérez-Rúa, Bernard Ghanem
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02934
Pdf link: https://arxiv.org/pdf/2304.02934
Abstract Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
Training a Two Layer ReLU Network Analytically
Authors: Adrian Barbu
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2304.02972
Pdf link: https://arxiv.org/pdf/2304.02972
Abstract Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.
Patch-wise Features for Blur Image Classification
Authors: Sri Charan Kattamuru, Kshitij Agrawal, Shyam Prasad Adhikari, Abhishek Bose, Hemant Misra
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03156
Pdf link: https://arxiv.org/pdf/2304.03156
Abstract Images captured through smartphone cameras often suffer from degradation, blur being one of the major ones, posing a challenge in processing these images for downstream tasks. In this paper we propose low-compute lightweight patch-wise features for image quality assessment. Using our method we can discriminate between blur vs sharp image degradation. To this end, we train a decision-tree based XGBoost model on various intuitive image features like gray level variance, first and second order gradients, texture features like local binary patterns. Experiments conducted on an open dataset show that the proposed low compute method results in 90.1% mean accuracy on the validation set, which is comparable to the accuracy of a compute-intensive VGG16 network with 94% mean accuracy fine-tuned to this task. To demonstrate the generalizability of our proposed features and model we test the model on BHBID dataset and an internal dataset where we attain accuracy of 98% and 91%, respectively. The proposed method is 10x faster than the VGG16 based model on CPU and scales linearly to the input image size making it suitable to be implemented on low compute edge devices.
DiffMimic: Efficient Motion Mimicking with Differentiable Physics
Authors: Jiawei Ren, Cunjun Yu, Siwei Chen, Xiao Ma, Liang Pan, Ziwei Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.03274
Pdf link: https://arxiv.org/pdf/2304.03274
Abstract Motion mimicking is a foundational task in physics-based character animation. However, most existing motion mimicking methods are built upon reinforcement learning (RL) and suffer from heavy reward engineering, high variance, and slow convergence with hard explorations. Specifically, they usually take tens of hours or even days of training to mimic a simple motion sequence, resulting in poor scalability. In this work, we leverage differentiable physics simulators (DPS) and propose an efficient motion mimicking method dubbed DiffMimic. Our key insight is that DPS casts a complex policy learning task to a much simpler state matching problem. In particular, DPS learns a stable policy by analytical gradients with ground-truth physical priors hence leading to significantly faster and stabler convergence than RL-based methods. Moreover, to escape from local optima, we utilize a Demonstration Replay mechanism to enable stable gradient backpropagation in a long horizon. Extensive experiments on standard benchmarks show that DiffMimic has a better sample efficiency and time efficiency than existing methods (e.g., DeepMimic). Notably, DiffMimic allows a physically simulated character to learn Backflip after 10 minutes of training and be able to cycle it after 3 hours of training, while the existing approach may require about a day of training to cycle Backflip. More importantly, we hope DiffMimic can benefit more differentiable animation systems with techniques like differentiable clothes simulation in future research.
Keyword: mobile

Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks
Authors: Michael Weiss, Paolo Tonella
Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2304.02654
Pdf link: https://arxiv.org/pdf/2304.02654
Abstract Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as mobile phones or IoT microcontrollers. Systems relying on large-scale DNNs thus have to call the corresponding model over the network, leading to substantial costs for hosting and running the large-scale remote model, costs which are often charged on a per-use basis. In this paper, we propose BiSupervised, a novel architecture, where, before relying on a large remote DNN, a system attempts to make a prediction on a small-scale local model. A DNN supervisor monitors said prediction process and identifies easy inputs for which the local prediction can be trusted. For these inputs, the remote model does not have to be invoked, thus saving costs, while only marginally impacting the overall system accuracy. Our architecture furthermore foresees a second supervisor to monitor the remote predictions and identify inputs for which not even these can be trusted, allowing to raise an exception or run a fallback strategy instead. We evaluate the cost savings, and the ability to detect incorrectly predicted inputs on four diverse case studies: IMDB movie review sentiment classification, Github issue triaging, Imagenet image classification, and SQuADv2 free-text question answering
Adaptive Headway Motion Control and Motion Prediction for Safe Unicycle Motion Design
Authors: Aykut İşleyen, Nathan van de Wouw, Ömür Arslan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02760
Pdf link: https://arxiv.org/pdf/2304.02760
Abstract Differential drive robots that can be modeled as a kinematic unicycle are a standard mobile base platform for many service and logistics robots. Safe and smooth autonomous motion around obstacles is a crucial skill for unicycle robots to perform diverse tasks in complex environments. A classical control approach for unicycle control is feedback linearization using a headway point at a fixed headway distance in front of the unicycle. The unicycle headway control brings the headway point to a desired goal location by embedding a linear headway reference dynamics, which often results in an undesired offset for the actual unicycle position. In this paper, we introduce a new unicycle headway control approach with an adaptive headway distance that overcomes this limitation, i.e., when the headway point reaches the goal the unicycle position is also at the goal. By systematically analyzing the closed-loop unicycle motion under the adaptive headway controller, we design analytical feedback motion prediction methods that bound the closed-loop unicycle position trajectory and so can be effectively used for safety assessment and safe unicycle motion design around obstacles. We present an application of adaptive headway motion control and motion prediction for safe unicycle path following around obstacles in numerical simulations.
Evaluating Customization of Remote Tele-operation Interfaces for Assistive Robots
Authors: Vinitha Ranganeni, Noah Ponto, Maya Cakmak
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02771
Pdf link: https://arxiv.org/pdf/2304.02771
Abstract Mobile manipulator platforms, like the Stretch RE1 robot, make the promise of in-home robotic assistance feasible. For people with severe physical limitations, like those with quadriplegia, the ability to tele-operate these robots themselves means that they can perform physical tasks they cannot otherwise do themselves, thereby increasing their level of independence. In order for users with physical limitations to operate these robots, their interfaces must be accessible and cater to the specific needs of all users. As physical limitations vary amongst users, it is difficult to make a single interface that will accommodate all users. Instead, such interfaces should be customizable to each individual user. In this paper we explore the value of customization of a browser-based interface for tele-operating the Stretch RE1 robot. More specifically, we evaluate the usability and effectiveness of a customized interface in comparison to the default interface configurations from prior work. We present a user study involving participants with motor impairments (N=10) and without motor impairments, who could serve as a caregiver, (N=13) that use the robot to perform mobile manipulation tasks in a real kitchen environment. Our study demonstrates that no single interface configuration satisfies all users' needs and preferences. Users perform better when using the customized interface for navigation, but not for manipulation due to higher complexity of learning to manipulate through the robot. All participants are able to use the robot to complete all tasks and participants with motor impairments believe that having the robot in their home would make them more independent.
Gotta Assess `Em All: A Risk Analysis of Criminal Offenses Facilitated through PokemonGO
Authors: Ashly Fuller, Martin Lo, Angelica Holmes, Lu Lemanski, Marie Vasek, Enrico Mariconti
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2304.02952
Pdf link: https://arxiv.org/pdf/2304.02952
Abstract Location-based games have come to the forefront of popularity in casual and mobile gaming over the past six years. However, there is no hard data on crimes that these games enable, ranging from assault to cyberstalking to grooming. Given these potential harms, we conduct a risk assessment and quasi-experiment on the game features of location-based games. Using PokemonGO as a case study, we identify and establish cyber-enabled stalking as the main risk event where in-game features such as an innocent function to share in-game postcards can be exploited by malicious users. Users obtain postcards that are unique to each Pokestop and represent gifts that can be shared with in-game friends. The number of postcards that each user can retain is limited, so they send the excess to their friends with items that boost their friends' game activities. The postcard often also unintentionally leaks the users' commonly visited locations to their in-game friends. We analyze these in-game features using risk assessment and identify cyber-enabled stalking as one of the main threats. We further evaluate the feasibility of this crime through a quasi-experiment. Our results show that participants' routine locations such as home and work can be reliably re-identified within days from the first gift exchange. This exploitation of a previously unconsidered in-game feature enables physical stalking of previously unknown persons which can escalate into more serious crimes. Given current data protection legislation in Europe, further preventive measures are required by Niantic to protect pseudonymized users from being re-identified by in-game features and (potentially) stalked.
SwarmGear: Heterogeneous Swarm of Drones with Reconfigurable Leader Drone and Virtual Impedance Links for Multi-Robot Inspection
Authors: Zhanibek Darush, Mikhail Martynov, Aleksey Fedoseev, Aleksei Shcherbak, Dzmitry Tsetserukou
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.02956
Pdf link: https://arxiv.org/pdf/2304.02956
Abstract The continuous monitoring by drone swarms remains a challenging problem due to the lack of power supply and the inability of drones to land on uneven surfaces. Heterogeneous swarms, including ground and aerial vehicles, can support longer inspections and carry a higher number of sensors on board. However, their capabilities are limited by the mobility of wheeled and legged robots in a cluttered environment. In this paper, we propose a novel concept for autonomous inspection that we call SwarmGear. SwarmGear utilizes a heterogeneous swarm that investigates the environment in a leader-follower formation. The leader drone is able to land on rough terrain and traverse it by four compliant robotic legs, possessing both the functionalities of an aerial and mobile robot. To preserve the formation of the swarm during its motion, virtual impedance links were developed between the leader and the follower drones. We evaluated experimentally the accuracy of the hybrid leader drone's ground locomotion. By changing the step parameters, the optimal step configuration was found. Two types of gaits were evaluated. The experiments revealed low crosstrack error (mean of 2 cm and max of 4.8 cm) and the ability of the leader drone to move with a 190 mm step length and a 3 degree standard yaw deviation. Four types of drone formations were considered. The best formation was used for experiments with SwarmGear, and it showed low overall crosstrack error for the swarm (mean 7.9 cm for the type 1 gait and 5.1 cm for the type 2 gait). The proposed system can potentially improve the performance of autonomous swarms in cluttered and unstructured environments by allowing all agents of the swarm to switch between aerial and ground formations to overcome various obstacles and perform missions over a large area.
Spritz-PS: Validation of Synthetic Face Images Using a Large Dataset of Printed Documents
Authors: Ehsan Nowroozi, Yoosef Habibi, Mauro Conti
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02982
Pdf link: https://arxiv.org/pdf/2304.02982
Abstract The capability of doing effective forensic analysis on printed and scanned (PS) images is essential in many applications. PS documents may be used to conceal the artifacts of images which is due to the synthetic nature of images since these artifacts are typically present in manipulated images and the main artifacts in the synthetic images can be removed after the PS. Due to the appeal of Generative Adversarial Networks (GANs), synthetic face images generated with GANs models are difficult to differentiate from genuine human faces and may be used to create counterfeit identities. Additionally, since GANs models do not account for physiological constraints for generating human faces and their impact on human IRISes, distinguishing genuine from synthetic IRISes in the PS scenario becomes extremely difficult. As a result of the lack of large-scale reference IRIS datasets in the PS scenario, we aim at developing a novel dataset to become a standard for Multimedia Forensics (MFs) investigation which is available at [45]. In this paper, we provide a novel dataset made up of a large number of synthetic and natural printed IRISes taken from VIPPrint Printed and Scanned face images. We extracted irises from face images and it is possible that the model due to eyelid occlusion captured the incomplete irises. To fill the missing pixels of extracted iris, we applied techniques to discover the complex link between the iris images. To highlight the problems involved with the evaluation of the dataset's IRIS images, we conducted a large number of analyses employing Siamese Neural Networks to assess the similarities between genuine and synthetic human IRISes, such as ResNet50, Xception, VGG16, and MobileNet-v2. For instance, using the Xception network, we achieved 56.76\% similarity of IRISes for synthetic images and 92.77% similarity of IRISes for real images.
Keyword: pruning

To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency
Authors: Daniel Campos, ChengXiang Zhai
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.02721
Pdf link: https://arxiv.org/pdf/2304.02721
Abstract Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.
NTK-SAP: Improving neural network pruning by aligning training dynamics
Authors: Yite Wang, Dawei Li, Ruoyu Sun
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2304.02840
Pdf link: https://arxiv.org/pdf/2304.02840
Abstract Pruning neural networks before training has received increasing interest due to its potential to reduce training time and memory. One popular method is to prune the connections based on a certain metric, but it is not entirely clear what metric is the best choice. Recent advances in neural tangent kernel (NTK) theory suggest that the training dynamics of large enough neural networks is closely related to the spectrum of the NTK. Motivated by this finding, we propose to prune the connections that have the least influence on the spectrum of the NTK. This method can help maintain the NTK spectrum, which may help align the training dynamics to that of its dense counterpart. However, one possible issue is that the fixed-weight-NTK corresponding to a given initial point can be very different from the NTK corresponding to later iterates during the training phase. We further propose to sample multiple realizations of random weights to estimate the NTK spectrum. Note that our approach is weight-agnostic, which is different from most existing methods that are weight-dependent. In addition, we use random inputs to compute the fixed-weight-NTK, making our method data-agnostic as well. We name our foresight pruning algorithm Neural Tangent Kernel Spectrum-Aware Pruning (NTK-SAP). Empirically, our method achieves better performance than all baselines on multiple datasets.
Learning to Learn with Indispensable Connections
Authors: Sambhavi Tiwari, Manas Gogoi, Shekhar Verma, Krishna Pratap Singh
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02862
Pdf link: https://arxiv.org/pdf/2304.02862
Abstract Meta-learning aims to solve unseen tasks with few labelled instances. Nevertheless, despite its effectiveness for quick learning in existing optimization-based methods, it has several flaws. Inconsequential connections are frequently seen during meta-training, which results in an over-parameterized neural network. Because of this, meta-testing observes unnecessary computations and extra memory overhead. To overcome such flaws. We propose a novel meta-learning method called Meta-LTH that includes indispensible (necessary) connections. We applied the lottery ticket hypothesis technique known as magnitude pruning to generate these crucial connections that can effectively solve few-shot learning problem. We aim to perform two things: (a) to find a sub-network capable of more adaptive meta-learning and (b) to learn new low-level features of unseen tasks and recombine those features with the already learned features during the meta-test phase. Experimental results show that our proposed Met-LTH method outperformed existing first-order MAML algorithm for three different classification datasets. Our method improves the classification accuracy by approximately 2% (20-way 1-shot task setting) for omniglot dataset.
Keyword: voxel

VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection
Authors: Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02867
Pdf link: https://arxiv.org/pdf/2304.02867
Abstract Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.
Keyword: lidar

VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection
Authors: Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02867
Pdf link: https://arxiv.org/pdf/2304.02867
Abstract Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.
Geometric-aware Pretraining for Vision-centric 3D Object Detection
Authors: Linyan Huang, Huijie Wang, Jia Zeng, Shengchuan Zhang, Liujuan Cao, Rongrong Ji, Junchi Yan, Hongyang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03105
Pdf link: https://arxiv.org/pdf/2304.03105
Abstract Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.
SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation
Authors: Bjoern Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03251
Pdf link: https://arxiv.org/pdf/2304.03251
Abstract Learning models on one labeled dataset that generalize well on another domain is a difficult task, as several shifts might happen between the data domains. This is notably the case for lidar data, for which models can exhibit large performance discrepancies due for instance to different lidar patterns or changes in acquisition conditions. This paper addresses the corresponding Unsupervised Domain Adaptation (UDA) task for semantic segmentation. To mitigate this problem, we introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data. As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data. This novel strategy differs from classical minimization of statistical divergences or lidar-specific state-of-the-art domain adaptation techniques. Our experiments demonstrate that our method achieves a better performance than the current state of the art in synthetic-to-real and real-to-real scenarios.
Keyword: diffusion

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model
Authors: Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.02827
Pdf link: https://arxiv.org/pdf/2304.02827
Abstract The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are also limited, yielding 3D samples with low diversity per prompt with long synthesis time. To address these challenges, we propose DITTO-NeRF, a novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a single image. Our DITTO-NeRF consists of constructing high-quality partial 3D object for limited in-boundary (IB) angles using the given or text-generated 2D image from the frontal view and then iteratively reconstructing the remaining 3D NeRF using inpainting latent diffusion model. We propose progressive 3D object reconstruction schemes in terms of scales (low to high resolution), angles (IB angles initially to outer-boundary (OB) later), and masks (object to background boundary) in our DITTO-NeRF so that high-quality information on IB can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods in terms of fidelity and diversity qualitatively and quantitatively with much faster training times than prior arts on image/text-to-3D such as DreamFusion, and NeuralLift-360.
Benchmarking Robustness to Text-Guided Corruptions
Authors: Mohammadreza Mofayezi, Yasamin Medghalchi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02963
Pdf link: https://arxiv.org/pdf/2304.02963
Abstract This study investigates the robustness of image classifiers to text-guided corruptions. We utilize diffusion models to edit images to different domains. Unlike other works that use synthetic or hand-picked data for benchmarking, we use diffusion models as they are generative models capable of learning to edit images while preserving their semantic content. Thus, the corruptions will be more realistic and the comparison will be more informative. Also, there is no need for manual labeling and we can create large-scale benchmarks with less effort. We define a prompt hierarchy based on the original ImageNet hierarchy to apply edits in different domains. As well as introducing a new benchmark we try to investigate the robustness of different vision models. The results of this study demonstrate that the performance of image classifiers decreases significantly in different language-based corruptions and edit domains. We also observe that convolutional models are more robust than transformer architectures. Additionally, we see that common data augmentation techniques can improve the performance on both the original data and the edited images. The findings of this research can help improve the design of image classifiers and contribute to the development of more robust machine learning systems. The code for generating the benchmark will be made available online upon publication.
DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance
Authors: Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, Jingyi Yu
Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2304.03117
Pdf link: https://arxiv.org/pdf/2304.03117
Abstract Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present DreamFace, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures, and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space, and subsequently optimize both the details displacements and normals using Score Distillation Sampling from generic Latent Diffusion Model. Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provides compact priors for fine-grained synthesis. Our generated neutral assets naturally support blendshapes-based facial animations. We further improve the animation ability with personalized deformation characteristics by learning the universal expression prior using the cross-identity hypernetwork. Notably, DreamFace can generate of realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies.
Zero-shot Generative Model Adaptation via Image-specific Prompt Learning
Authors: Jiayi Guo, Chaofei Wang, You Wu, Eric Zhang, Kai Wang, Xingqian Xu, Shiji Song, Humphrey Shi, Gao Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03119
Pdf link: https://arxiv.org/pdf/2304.03119
Abstract Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation.
SketchFFusion: Sketch-guided image editing with diffusion model
Authors: Weihang Mao, Bo Han, Zihao Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03174
Pdf link: https://arxiv.org/pdf/2304.03174
Abstract Sketch-guided image editing aims to achieve local fine-tuning of the image based on the sketch information provided by the user, while maintaining the original status of the unedited areas. Due to the high cost of acquiring human sketches, previous works mostly relied on edge maps as a substitute for sketches, but sketches possess more rich structural information. In this paper, we propose a sketch generation scheme that can preserve the main contours of an image and closely adhere to the actual sketch style drawn by the user. Simultaneously, current image editing methods often face challenges such as image distortion, training cost, and loss of fine details in the sketch. To address these limitations, We propose a conditional diffusion model (SketchFFusion) based on the sketch structure vector. We evaluate the generative performance of our model and demonstrate that it outperforms existing methods.
Face Animation with an Attribute-Guided Diffusion Model
Authors: Bohan Zeng, Xuhui Liu, Sicheng Gao, Boyu Liu, Hong Li, Jianzhuang Liu, Baochang Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03199
Pdf link: https://arxiv.org/pdf/2304.03199
Abstract Face animation has achieved much progress in computer vision. However, prevailing GAN-based methods suffer from unnatural distortions and artifacts due to sophisticated motion deformation. In this paper, we propose a Face Animation framework with an attribute-guided Diffusion Model (FADM), which is the first work to exploit the superior modeling capacity of diffusion models for photo-realistic talking-head generation. To mitigate the uncontrollable synthesis effect of the diffusion model, we design an Attribute-Guided Conditioning Network (AGCN) to adaptively combine the coarse animation features and 3D face reconstruction results, which can incorporate appearance and motion conditions into the diffusion process. These specific designs help FADM rectify unnatural artifacts and distortions, and also enrich high-fidelity facial details through iterative diffusion refinements with accurate animation attributes. FADM can flexibly and effectively improve existing animation videos. Extensive experiments on widely used talking-head benchmarks validate the effectiveness of FADM over prior arts.
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
Authors: Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03246
Pdf link: https://arxiv.org/pdf/2304.03246
Abstract Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and also removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task which will be released soon. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
Diffusion Models as Masked Autoencoders
Authors: Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03283
Pdf link: https://arxiv.org/pdf/2304.03283
Abstract There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
Keyword: dynamic

Abstraction-based Probabilistic Stability Analysis of Polyhedral Probabilistic Hybrid Systems
Authors: Spandan Das, Pavithra Prabhakar
Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.02647
Pdf link: https://arxiv.org/pdf/2304.02647
Abstract In this paper, we consider the problem of probabilistic stability analysis of a subclass of Stochastic Hybrid Systems, namely, Polyhedral Probabilistic Hybrid Systems (PPHS), where the flow dynamics is given by a polyhedral inclusion, the discrete switching between modes happens probabilistically at the boundaries of their invariant regions and the continuous state is not reset during switching. We present an abstraction-based analysis framework that consists of constructing a finite Markov Decision Processes (MDP) such that verification of certain property on the finite MDP ensures the satisfaction of probabilistic stability on the PPHS. Further, we present a polynomial-time algorithm for verifying the corresponding property on the MDP. Our experimental analysis demonstrates the feasibility of the approach in successfully verifying probabilistic stability on PPHS of various dimensions and sizes.
Emergent Coordination through Game-Induced Nonlinear Opinion Dynamics
Authors: Haimin Hu, Kensuke Nakamura, Kai-Chieh Hsu, Naomi Ehrich Leonard, Jaime Fernández Fisac
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02687
Pdf link: https://arxiv.org/pdf/2304.02687
Abstract We present a multi-agent decision-making framework for the emergent coordination of autonomous agents whose intents are initially undecided. Dynamic non-cooperative games have been used to encode multi-agent interaction, but ambiguity arising from factors such as goal preference or the presence of multiple equilibria may lead to coordination issues, ranging from the "freezing robot" problem to unsafe behavior in safety-critical events. The recently developed nonlinear opinion dynamics (NOD) provide guarantees for breaking deadlocks. However, choosing the appropriate model parameters automatically in general multi-agent settings remains a challenge. In this paper, we first propose a novel and principled procedure for synthesizing NOD based on the value functions of dynamic games conditioned on agents' intents. In particular, we provide for the two-player two-option case precise stability conditions for equilibria of the game-induced NOD based on the mismatch between agents' opinions and their game values. We then propose an optimization-based trajectory optimization algorithm that computes agents' policies guided by the evolution of opinions. The efficacy of our method is illustrated with a simulated toll station coordination example.
Going Further: Flatness at the Rescue of Early Stopping for Adversarial Example Transferability
Authors: Martin Gubri, Maxime Cordy, Yves Le Traon
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2304.02688
Pdf link: https://arxiv.org/pdf/2304.02688
Abstract Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that transferability is substantially increased when the training of the surrogate model has been early stopped. A common hypothesis to explain this is that the later training epochs are when models learn the non-robust features that adversarial attacks exploit. Hence, an early stopped model is more robust (hence, a better surrogate) than fully trained models. We demonstrate that the reasons why early stopping improves transferability lie in the side effects it has on the learning dynamics of the model. We first show that early stopping benefits transferability even on models learning from data with non-robust features. We then establish links between transferability and the exploration of the loss landscape in the parameter space, on which early stopping has an inherent effect. More precisely, we observe that transferability peaks when the learning rate decays, which is also the time at which the sharpness of the loss significantly drops. This leads us to propose RFN, a new approach for transferability that minimizes loss sharpness during training in order to maximize transferability. We show that by searching for large flat neighborhoods, RFN always improves over early stopping (by up to 47 points of transferability rate) and is competitive to (if not better than) strong state-of-the-art baselines.
ACTION++: Improving Semi-supervised Medical Image Segmentation with Adaptive Anatomical Contrast
Authors: Chenyu You, Weicheng Dai, Yifei Min, Lawrence Staib, Jas Sekhon, James S. Duncan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2304.02689
Pdf link: https://arxiv.org/pdf/2304.02689
Abstract Medical data often exhibits long-tail distributions with heavy class imbalance, which naturally leads to difficulty in classifying the minority classes (i.e., boundary regions or rare objects). Recent work has significantly improved semi-supervised medical image segmentation in long-tailed scenarios by equipping them with unsupervised contrastive criteria. However, it remains unclear how well they will perform in the labeled portion of data where class distribution is also highly imbalanced. In this work, we present ACTION++, an improved contrastive learning framework with adaptive anatomical contrast for semi-supervised medical segmentation. Specifically, we propose an adaptive supervised contrastive loss, where we first compute the optimal locations of class centers uniformly distributed on the embedding space (i.e., off-line), and then perform online contrastive matching training by encouraging different class features to adaptively match these distinct and uniformly distributed class centers. Moreover, we argue that blindly adopting a constant temperature $\tau$ in the contrastive loss on long-tailed medical data is not optimal, and propose to use a dynamic $\tau$ via a simple cosine schedule to yield better separation between majority and minority classes. Empirically, we evaluate ACTION++ on ACDC and LA benchmarks and show that it achieves state-of-the-art across two semi-supervised settings. Theoretically, we analyze the performance of adaptive anatomical contrast and confirm its superiority in label efficiency.
Recovering Continuous Scene Dynamics from A Single Blurry Image with Events
Authors: Zhangyi Cheng, Xiang Zhang, Lei Yu, Jianzhuang Liu, Wen Yang, Gui-Song Xia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02695
Pdf link: https://arxiv.org/pdf/2304.02695
Abstract This paper aims at demystifying a single motion-blurred image with events and revealing temporally continuous scene dynamics encrypted behind motion blurs. To achieve this end, an Implicit Video Function (IVF) is learned to represent a single motion blurred image with concurrent events, enabling the latent sharp image restoration of arbitrary timestamps in the range of imaging exposures. Specifically, a dual attention transformer is proposed to efficiently leverage merits from both modalities, i.e., the high temporal resolution of event features and the smoothness of image features, alleviating temporal ambiguities while suppressing the event noise. The proposed network is trained only with the supervision of ground-truth images of limited referenced timestamps. Motion- and texture-guided supervisions are employed simultaneously to enhance restorations of the non-referenced timestamps and improve the overall sharpness. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate that our proposed method outperforms state-of-the-art methods by a large margin in terms of both objective PSNR and SSIM measurements and subjective evaluations.
Efficient and Accurate Automatic Python Bindings with cppyy & Cling
Authors: Baidyanath Kundu (1 and 2), Vassil Vassilev (1 and 2), Wim Lavrijsen (3) ((1) European Council for Nuclear Research, (2) Princeton University (US), (3) LBNL (US))
Subjects: Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2304.02712
Pdf link: https://arxiv.org/pdf/2304.02712
Abstract The simplicity of Python and the power of C++ force stark choices on a scientific software stack. There have been multiple developments to mitigate language boundaries by implementing language bindings, but the impedance mismatch between the static nature of C++ and the dynamic one of Python hinders their implementation; examples include the use of user-defined Python types with templated C++ and advanced memory management. The development of the C++ interpreter Cling has changed the way we can think of language bindings as it provides an incremental compilation infrastructure available at runtime. That is, Python can interrogate C++ on demand, and bindings can be lazily constructed at runtime. This automatic binding provision requires no direct support from library authors and offers better performance than alternative solutions, such as PyBind11. ROOT pioneered this approach with PyROOT, which was later enhanced with its successor, cppyy. However, until now, cppyy relied on the reflection layer of ROOT, which is limited in terms of provided features and performance. This paper presents the next step for language interoperability with cppyy, enabling research into uniform cross-language execution environments and boosting optimization opportunities across language boundaries. We illustrate the use of advanced C++ in Numba-accelerated Python through cppyy. We outline a path forward for re-engineering parts of cppyy to use upstream LLVM components to improve performance and sustainability. We demonstrate cppyy purely based on a C++ reflection library, InterOp, which offers interoperability primitives based on Cling and Clang-Repl.
Software and Analysis for Dynamic Voronoi Diagrams in the Hilbert Metric
Authors: Madeline Bumpus, Caesar Dai, Auguste H. Gezalyan, Sam Munoz, Renita Santhoshkumar, Songyu Ye, David M. Mount
Subjects: Computational Geometry (cs.CG)
Arxiv link: https://arxiv.org/abs/2304.02745
Pdf link: https://arxiv.org/pdf/2304.02745
Abstract The Hilbert metric is a projective metric defined on a convex body which generalizes the Cayley-Klein model of hyperbolic geometry to any convex set. In this paper we analyze Hilbert Voronoi diagrams in the Dynamic setting. In addition we introduce dynamic visualization software for Voronoi diagrams in the Hilbert metric on user specified convex polygons.
Adaptive Headway Motion Control and Motion Prediction for Safe Unicycle Motion Design
Authors: Aykut İşleyen, Nathan van de Wouw, Ömür Arslan
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02760
Pdf link: https://arxiv.org/pdf/2304.02760
Abstract Differential drive robots that can be modeled as a kinematic unicycle are a standard mobile base platform for many service and logistics robots. Safe and smooth autonomous motion around obstacles is a crucial skill for unicycle robots to perform diverse tasks in complex environments. A classical control approach for unicycle control is feedback linearization using a headway point at a fixed headway distance in front of the unicycle. The unicycle headway control brings the headway point to a desired goal location by embedding a linear headway reference dynamics, which often results in an undesired offset for the actual unicycle position. In this paper, we introduce a new unicycle headway control approach with an adaptive headway distance that overcomes this limitation, i.e., when the headway point reaches the goal the unicycle position is also at the goal. By systematically analyzing the closed-loop unicycle motion under the adaptive headway controller, we design analytical feedback motion prediction methods that bound the closed-loop unicycle position trajectory and so can be effectively used for safety assessment and safe unicycle motion design around obstacles. We present an application of adaptive headway motion control and motion prediction for safe unicycle path following around obstacles in numerical simulations.
A Robust Observer with Gyroscopic Bias Correction for Rotational Dynamics
Authors: Erjen Lefeber, Marcus Greiff, Anders Robertsson
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.02763
Pdf link: https://arxiv.org/pdf/2304.02763
Abstract We propose an observer for rotational dynamics subject to directional and gyroscopic measurements, which simultaneously estimates the gyroscopic biases and attitude rates. We show uniform almost global asymptotic and local exponential stability of the resulting error dynamics, implying robustness against bounded disturbances. This robustness is quantified with respect to a popular nonlinear complementary filter in quantitative simulation studies, and we explore how the measurement noise propagates to the asymptotic errors as a function of tuning. This is an extended version of a paper with the same title (to appear at IFAC WC 2023). Additional mathematical details are provided in this extended version.
MoStGAN-V: Video Generation with Temporal Motion Styles
Authors: Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.02777
Pdf link: https://arxiv.org/pdf/2304.02777
Abstract Video generation remains a challenging task due to spatiotemporal complexity and the requirement of synthesizing diverse motions with temporal consistency. Previous works attempt to generate videos in arbitrary lengths either in an autoregressive manner or regarding time as a continuous signal. However, they struggle to synthesize detailed and diverse motions with temporal coherence and tend to generate repetitive scenes after a few time steps. In this work, we argue that a single time-agnostic latent vector of style-based generator is insufficient to model various and temporally-consistent motions. Hence, we introduce additional time-dependent motion styles to model diverse motion patterns. In addition, a Motion Style Attention modulation mechanism, dubbed as MoStAtt, is proposed to augment frames with vivid dynamics for each specific scale (i.e., layer), which assigns attention score for each motion style w.r.t deconvolution filter weights in the target synthesis layer and softly attends different motion styles for weight modulation. Experimental results show our model achieves state-of-the-art performance on four unconditional $256^2$ video synthesis benchmarks trained with only 3 frames per clip and produces better qualitative results with respect to dynamic motions. Code and videos have been made available at https://github.com/xiaoqian-shen/MoStGAN-V.
Enhanced Grid Following Inverter: A Uniform Control Design Framework
Authors: Alireza Askarian, Jaesang Park, Srinivasa Salapaka
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.02792
Pdf link: https://arxiv.org/pdf/2304.02792
Abstract This article presents a novel grid following (GFL) inverter control design framework that exploits the line dynamics structure in $dq$ frame and treats the inverter as an actuator. The proposed framework imposes a structure on the line's coupled dynamics and captures the effect of coupling on the GFL inverter's closed-loop stability and performance. One of the main features of our work is using the bode sensitivity integral to characterize the fundamental limitations of control design. These constraints translate into fundamental trade-offs between performance objectives such as reference tracking, closed-loop bandwidth, robust synchronization, and resilience to grid anomalies. The article develops design considerations to ensure specific trade-offs. We assess the performance of our proposed framework through simulation and experimental results.
Unveiling the Dynamics of Censorship, COVID-19 Regulations, and Protest: An Empirical Study of Chinese Subreddit r/china_irl
Authors: Siyi Zhou, Luca Luceri, Emilio Ferrara
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2304.02800
Pdf link: https://arxiv.org/pdf/2304.02800
Abstract The COVID-19 pandemic has intensified numerous social issues that warrant academic investigation. Although information dissemination has been extensively studied, the silenced voices and censored content also merit attention due to their role in mobilizing social movements. In this paper, we provide empirical evidence to explore the relationships among COVID-19 regulations, censorship, and protest through a series of social incidents occurred in China during 2022. We analyze the similarities and differences between censored articles and discussions on r/china_irl, the most popular Chinese-speaking subreddit, and scrutinize the temporal dynamics of government censorship activities and their impact on user engagement within the subreddit. Furthermore, we examine users' linguistic patterns under the influence of a censorship-driven environment. Our findings reveal patterns in topic recurrence, the complex interplay between censorship activities, user subscription, and collective commenting behavior, as well as potential linguistic adaptation strategies to circumvent censorship. These insights hold significant implications for researchers interested in understanding the survival mechanisms of marginalized groups within censored information ecosystems.
Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
Authors: Haotao Wang, Ziyu Jiang, Yan Han, Zhangyang Wang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2304.02806
Pdf link: https://arxiv.org/pdf/2304.02806
Abstract Graph neural networks (GNNs) have been widely applied to learning over graph data. Yet, real-world graphs commonly exhibit diverse graph structures and contain heterogeneous nodes and edges. Moreover, to enhance the generalization ability of GNNs, it has become common practice to further increase the diversity of training graph structures by incorporating graph augmentations and/or performing large-scale pre-training on more graphs. Therefore, it becomes essential for a GNN to simultaneously model diverse graph structures. Yet, naively increasing the GNN model capacity will suffer from both higher inference costs and the notorious trainability issue of GNNs. This paper introduces the Mixture-of-Expert (MoE) idea to GNNs, aiming to enhance their ability to accommodate the diversity of training graph structures, without incurring computational overheads. Our new Graph Mixture of Expert (GMoE) model enables each node in the graph to dynamically select its own optimal \textit{information aggregation experts}. These experts are trained to model different subgroups of graph structures in the training set. Additionally, GMoE includes information aggregation experts with varying aggregation hop sizes, where the experts with larger hop sizes are specialized in capturing information over longer ranges. The effectiveness of GMoE is verified through experimental results on a large variety of graph, node, and link prediction tasks in the OGB benchmark. For instance, it enhances ROC-AUC by $1.81\%$ in ogbg-molhiv and by $1.40\%$ in ogbg-molbbbp, as compared to the non-MoE baselines. Our code is available at https://github.com/VITA-Group/Graph-Mixture-of-Experts.
Causal Repair of Learning-enabled Cyber-physical Systems
Authors: Pengyuan Lu, Ivan Ruchkin, Matthew Cleaveland, Oleg Sokolsky, Insup Lee
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2304.02813
Pdf link: https://arxiv.org/pdf/2304.02813
Abstract Models of actual causality leverage domain knowledge to generate convincing diagnoses of events that caused an outcome. It is promising to apply these models to diagnose and repair run-time property violations in cyber-physical systems (CPS) with learning-enabled components (LEC). However, given the high diversity and complexity of LECs, it is challenging to encode domain knowledge (e.g., the CPS dynamics) in a scalable actual causality model that could generate useful repair suggestions. In this paper, we focus causal diagnosis on the input/output behaviors of LECs. Specifically, we aim to identify which subset of I/O behaviors of the LEC is an actual cause for a property violation. An important by-product is a counterfactual version of the LEC that repairs the run-time property by fixing the identified problematic behaviors. Based on this insights, we design a two-step diagnostic pipeline: (1) construct and Halpern-Pearl causality model that reflects the dependency of property outcome on the component's I/O behaviors, and (2) perform a search for an actual cause and corresponding repair on the model. We prove that our pipeline has the following guarantee: if an actual cause is found, the system is guaranteed to be repaired; otherwise, we have high probabilistic confidence that the LEC under analysis did not cause the property violation. We demonstrate that our approach successfully repairs learned controllers on a standard OpenAI Gym benchmark.
NTK-SAP: Improving neural network pruning by aligning training dynamics
Authors: Yite Wang, Dawei Li, Ruoyu Sun
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2304.02840
Pdf link: https://arxiv.org/pdf/2304.02840
Abstract Pruning neural networks before training has received increasing interest due to its potential to reduce training time and memory. One popular method is to prune the connections based on a certain metric, but it is not entirely clear what metric is the best choice. Recent advances in neural tangent kernel (NTK) theory suggest that the training dynamics of large enough neural networks is closely related to the spectrum of the NTK. Motivated by this finding, we propose to prune the connections that have the least influence on the spectrum of the NTK. This method can help maintain the NTK spectrum, which may help align the training dynamics to that of its dense counterpart. However, one possible issue is that the fixed-weight-NTK corresponding to a given initial point can be very different from the NTK corresponding to later iterates during the training phase. We further propose to sample multiple realizations of random weights to estimate the NTK spectrum. Note that our approach is weight-agnostic, which is different from most existing methods that are weight-dependent. In addition, we use random inputs to compute the fixed-weight-NTK, making our method data-agnostic as well. We name our foresight pruning algorithm Neural Tangent Kernel Spectrum-Aware Pruning (NTK-SAP). Empirically, our method achieves better performance than all baselines on multiple datasets.
Design and Control of a Ballbot Drivetrain with High Agility, Minimal Footprint, and High Payload
Authors: Chenzhang Xiao, Mahshid Mansouri, David Lam, Joao Ramos, Elizabeth T. Hsiao-Wecksler
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2304.02887
Pdf link: https://arxiv.org/pdf/2304.02887
Abstract This paper presents the design and control of a ballbot drivetrain that aims to achieve high agility, minimal footprint, and high payload capacity while maintaining dynamic stability. Two hardware platforms and analytical models were developed to test design and control methodologies. The full-scale ballbot prototype (MiaPURE) was constructed using off-the-shelf components and designed to have agility, footprint, and balance similar to that of a walking human. The planar inverted pendulum testbed (PIPTB) was developed as a reduced-order testbed for quick validation of system performance. We then proposed a simple yet robust LQR-PI controller to balance and maneuver the ballbot drivetrain with a heavy payload. This is crucial because the drivetrain is often subject to high stiction due to elastomeric components in the torque transmission system. This controller was first tested in the PIPTB to compare with traditional LQR and cascaded PI-PD controllers, and then implemented in the ballbot drivetrain. The MiaPURE drivetrain was able to carry a payload of 60 kg, achieve a maximum speed of 2.3 m/s, and come to a stop from a speed of 1.4 m/s in 2 seconds in a selected translation direction. Finally, we demonstrated the omnidirectional movement of the ballbot drivetrain in an indoor environment as a payload-carrying robot and a human-riding mobility device. Our experiments demonstrated the feasibility of using the ballbot drivetrain as a universal mobility platform with agile movements, minimal footprint, and high payload capacity using our proposed design and control methodologies.
LSketch: A Label-Enabled Graph Stream Sketch Toward Time-Sensitive Queries
Authors: Yiling Zeng, Chunyao Song, Yuhan Li, Tingjian Ge
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2304.02897
Pdf link: https://arxiv.org/pdf/2304.02897
Abstract Graph streams represent data interactions in real applications. The mining of graph streams plays an important role in network security, social network analysis, and traffic control, among others. However, the sheer volume and high dynamics cause great challenges for efficient storage and subsequent query analysis on them. Current studies apply sketches to summarize graph streams. We propose LSketch that works for heterogeneous graph streams, which effectively preserves the label information carried by the streams in real scenes, thereby enriching the expressive ability of sketches. In addition, as graph streams continue to evolve over time, edges too old may lose their practical significance. Therefore, we introduce the sliding window model into LSketch to eliminate the expired edges automatically. LSketch uses sub-linear storage space and can support structure based queries and time-sensitive queries with high accuracy. We perform extensive experiments over four real datasets, demonstrating the superiority of the proposed method over state-of-the-art methods, in aspects of query accuracy and time efficiency.
Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding
Authors: Yuke Hu, Wei Liang, Ruofan Wu, Kai Xiao, Weiqiang Wang, Xiaochen Li, Jinfei Liu, Zhan Qin
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.02932
Pdf link: https://arxiv.org/pdf/2304.02932
Abstract Knowledge Graph Embedding (KGE) is a fundamental technique that extracts expressive representation from knowledge graph (KG) to facilitate diverse downstream tasks. The emerging federated KGE (FKGE) collaboratively trains from distributed KGs held among clients while avoiding exchanging clients' sensitive raw KGs, which can still suffer from privacy threats as evidenced in other federated model trainings (e.g., neural networks). However, quantifying and defending against such privacy threats remain unexplored for FKGE which possesses unique properties not shared by previously studied models. In this paper, we conduct the first holistic study of the privacy threat on FKGE from both attack and defense perspectives. For the attack, we quantify the privacy threat by proposing three new inference attacks, which reveal substantial privacy risk by successfully inferring the existence of the KG triple from victim clients. For the defense, we propose DP-Flames, a novel differentially private FKGE with private selection, which offers a better privacy-utility tradeoff by exploiting the entity-binding sparse gradient property of FKGE and comes with a tight privacy accountant by incorporating the state-of-the-art private selection technique. We further propose an adaptive privacy budget allocation policy to dynamically adjust defense magnitude across the training procedure. Comprehensive evaluations demonstrate that the proposed defense can successfully mitigate the privacy threat by effectively reducing the success rate of inference attacks from $83.1\%$ to $59.4\%$ on average with only a modest utility decrease.
Adaptable and Interpretable Framework for Novelty Detection in Real-Time IoT Systems
Authors: Marek Wadinger, Michal Kvasnica
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2304.02947
Pdf link: https://arxiv.org/pdf/2304.02947
Abstract This paper presents the Real-time Adaptive and Interpretable Detection (RAID) algorithm. The novel approach addresses the limitations of state-of-the-art anomaly detection methods for multivariate dynamic processes, which are restricted to detecting anomalies within the scope of the model training conditions. The RAID algorithm adapts to non-stationary effects such as data drift and change points that may not be accounted for during model development, resulting in prolonged service life. A dynamic model based on joint probability distribution handles anomalous behavior detection in a system and the root cause isolation based on adaptive process limits. RAID algorithm does not require changes to existing process automation infrastructures, making it highly deployable across different domains. Two case studies involving real dynamic system data demonstrate the benefits of the RAID algorithm, including change point adaptation, root cause isolation, and improved detection accuracy.
FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead
Authors: Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, Wanli Ouyang
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Arxiv link: https://arxiv.org/abs/2304.02948
Pdf link: https://arxiv.org/pdf/2304.02948
Abstract We present FengWu, an advanced data-driven global medium-range weather forecast system based on Artificial Intelligence (AI). Different from existing data-driven weather forecast methods, FengWu solves the medium-range forecast problem from a multi-modal and multi-task perspective. Specifically, a deep learning architecture equipped with model-specific encoder-decoders and cross-modal fusion Transformer is elaborately designed, which is learned under the supervision of an uncertainty loss to balance the optimization of different predictors in a region-adaptive manner. Besides this, a replay buffer mechanism is introduced to improve medium-range forecast performance. With 39-year data training based on the ERA5 reanalysis, FengWu is able to accurately reproduce the atmospheric dynamics and predict the future land and atmosphere states at 37 vertical levels on a 0.25{\deg} latitude-longitude resolution. Hindcasts of 6-hourly weather in 2018 based on ERA5 demonstrate that FengWu performs better than GraphCast in predicting 80\% of the 880 reported predictands, e.g., reducing the root mean square error (RMSE) of 10-day lead global z500 prediction from 733 to 651 $m^{2}/s^2$. In addition, the inference cost of each iteration is merely 600ms on NVIDIA Tesla A100 hardware. The results suggest that FengWu can significantly improve the forecast skill and extend the skillful global medium-range weather forecast out to 10.75 days lead (with ACC of z500 > 0.6) for the first time.
Deep Long-Short Term Memory networks: Stability properties and Experimental validation
Authors: Fabio Bonassi, Alessio La Bella, Giulio Panzani, Marcello Farina, Riccardo Scattolini
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2304.02975
Pdf link: https://arxiv.org/pdf/2304.02975
Abstract The aim of this work is to investigate the use of Incrementally Input-to-State Stable ($\delta$ISS) deep Long Short Term Memory networks (LSTMs) for the identification of nonlinear dynamical systems. We show that suitable sufficient conditions on the weights of the network can be leveraged to setup a training procedure able to learn provenly-$\delta$ISS LSTM models from data. The proposed approach is tested on a real brake-by-wire apparatus to identify a model of the system from input-output experimentally collected data. Results show satisfactory modeling performances.
Distributed Model Predictive Control for Periodic Cooperation of Multi-Agent Systems
Authors: Matthias Köhler, Matthias A. Müller, Frank Allgöwer
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2304.03002
Pdf link: https://arxiv.org/pdf/2304.03002
Abstract We consider multi-agent systems with heterogeneous, nonlinear agents subject to individual constraints that want to achieve a periodic, dynamic cooperative control goal which can be characterised by a set and a suitable cost. We propose a sequential distributed model predictive control (MPC) scheme in which agents sequentially solve an individual optimisation problem to track an artificial periodic output trajectory. The optimisation problems are coupled through these artificial periodic output trajectories, which are communicated and penalised using the cost that characterises the cooperative goal. The agents communicate only their artificial trajectories and only once per time step. We show that under suitable assumptions, the agents can incrementally move their artificial output trajectories towards the cooperative goal, and, hence, their closed-loop output trajectories asymptotically achieve it. We illustrate the scheme with a simulation example.
IoT Federated Blockchain Learning at the Edge
Authors: James Calo, Benny Lo
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2304.03006
Pdf link: https://arxiv.org/pdf/2304.03006
Abstract IoT devices are sorely underutilized in the medical field, especially within machine learning for medicine, yet they offer unrivaled benefits. IoT devices are low-cost, energy-efficient, small and intelligent devices. In this paper, we propose a distributed federated learning framework for IoT devices, more specifically for IoMT (Internet of Medical Things), using blockchain to allow for a decentralized scheme improving privacy and efficiency over a centralized system; this allows us to move from the cloud-based architectures, that are prevalent, to the edge. The system is designed for three paradigms: 1) Training neural networks on IoT devices to allow for collaborative training of a shared model whilst decoupling the learning from the dataset to ensure privacy. Training is performed in an online manner simultaneously amongst all participants, allowing for the training of actual data that may not have been present in a dataset collected in the traditional way and dynamically adapt the system whilst it is being trained. 2) Training of an IoMT system in a fully private manner such as to mitigate the issue with confidentiality of medical data and to build robust, and potentially bespoke, models where not much, if any, data exists. 3) Distribution of the actual network training, something federated learning itself does not do, to allow hospitals, for example, to utilize their spare computing resources to train network models.
Data-driven HVAC Control Using Symbolic Regression: Design and Implementation
Authors: Yuki Ozawa, Dafang Zhao, Daichi Watari, Ittetsu Taniguchi, Toshihiro Suzuki, Yoshiyuki Shimoda, Takao Onoye
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2304.03078
Pdf link: https://arxiv.org/pdf/2304.03078
Abstract The large amount of data collected in buildings makes energy management smarter and more energy efficient. This study proposes a design and implementation methodology of data-driven heating, ventilation, and air conditioning (HVAC) control. Building thermodynamics is modeled using a symbolic regression model (SRM) built from the collected data. Additionally, an HVAC system model is also developed with a data-driven approach. A model predictive control (MPC) based HVAC scheduling is formulated with the developed models to minimize energy consumption and peak power demand and maximize thermal comfort. The performance of the proposed framework is demonstrated in the workspace in the actual campus building. The HVAC system using the proposed framework reduces the peak power by 16.1\% compared to the widely used thermostat controller.
Inductive Graph Unlearning
Authors: Cheng-Long Wang, Mengdi Huai, Di Wang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2304.03093
Pdf link: https://arxiv.org/pdf/2304.03093
Abstract As a way to implement the "right to be forgotten" in machine learning, \textit{machine unlearning} aims to completely remove the contributions and information of the samples to be deleted from a trained model without affecting the contributions of other samples. Recently, many frameworks for machine unlearning have been proposed, and most of them focus on image and text data. To extend machine unlearning to graph data, \textit{GraphEraser} has been proposed. However, a critical issue is that \textit{GraphEraser} is specifically designed for the transductive graph setting, where the graph is static and attributes and edges of test nodes are visible during training. It is unsuitable for the inductive setting, where the graph could be dynamic and the test graph information is invisible in advance. Such inductive capability is essential for production machine learning systems with evolving graphs like social media and transaction networks. To fill this gap, we propose the \underline{{\bf G}}\underline{{\bf U}}ided \underline{{\bf I}}n\underline{{\bf D}}uctiv\underline{{\bf E}} Graph Unlearning framework (GUIDE). GUIDE consists of three components: guided graph partitioning with fairness and balance, efficient subgraph repair, and similarity-based aggregation. Empirically, we evaluate our method on several inductive benchmarks and evolving transaction graphs. Generally speaking, GUIDE can be efficiently implemented on the inductive graph learning tasks for its low graph partition cost, no matter on computation or structure information. The code will be available here: https://github.com/Happy2Git/GUIDE.
Constrained Exploration in Reinforcement Learning with Optimality Preservation
Authors: Peter C. Y. Chen
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2304.03104
Pdf link: https://arxiv.org/pdf/2304.03104
Abstract We consider a class of reinforcement-learning systems in which the agent follows a behavior policy to explore a discrete state-action space to find an optimal policy while adhering to some restriction on its behavior. Such restriction may prevent the agent from visiting some state-action pairs, possibly leading to the agent finding only a sub-optimal policy. To address this problem we introduce the concept of constrained exploration with optimality preservation, whereby the exploration behavior of the agent is constrained to meet a specification while the optimality of the (original) unconstrained learning process is preserved. We first establish a feedback-control structure that models the dynamics of the unconstrained learning process. We then extend this structure by adding a supervisor to ensure that the behavior of the agent meets the specification, and establish (for a class of reinforcement-learning problems with a known deterministic environment) a necessary and sufficient condition under which optimality is preserved. This work demonstrates the utility and the prospect of studying reinforcement-learning problems in the context of the theories of discrete-event systems, automata and formal languages.
A self-organizing robotic aggregate using solid and liquid-like collective states
Authors: Baudouin Saintyves, Matthew Spenko, Heinrich M. Jaeger
Subjects: Robotics (cs.RO); Soft Condensed Matter (cond-mat.soft); Adaptation and Self-Organizing Systems (nlin.AO)
Arxiv link: https://arxiv.org/abs/2304.03125
Pdf link: https://arxiv.org/pdf/2304.03125
Abstract Designing robotic systems that can change their physical form factor as well as their compliance to adapt to environmental constraints remains a major conceptual and technical challenge. To address this, we introduce the Granulobot, a modular system that blurs the distinction between soft, modular, and swarm robotics. The system consists of gear-like units that each contain a single actuator such that units can self-assemble into larger, granular aggregates using magnetic coupling. These aggregates can reconfigure dynamically and also split up into subsystems that might later recombine. Aggregates can self-organize into collective states with solid- and liquid-like properties, thus displaying widely differing compliances. These states can be perturbed locally via actuators or externally via mechanical feedback from the environment to produce adaptive shape shifting in a decentralized manner. This in turn can generate locomotion strategies adapted to different conditions. Aggregates can move over obstacles without using external sensors or coordinate to maintain a steady gait over different surfaces without electronic communication among units. The modular design highlights a physical, morphological form of control that advances the development of resilient robotic systems with the ability to morph and adapt to different functions and conditions.
From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection
Authors: Changsheng Lu, Hao Zhu, Piotr Koniusz
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03140
Pdf link: https://arxiv.org/pdf/2304.03140
Abstract Unlike current deep keypoint detectors that are trained to recognize limited number of body parts, few-shot keypoint detection (FSKD) attempts to localize any keypoints, including novel or base keypoints, depending on the reference samples. FSKD requires the semantically meaningful relations for keypoint similarity learning to overcome the ubiquitous noise and ambiguous local patterns. One rescue comes with vision transformer (ViT) as it captures long-range relations well. However, ViT may model irrelevant features outside of the region of interest due to the global attention matrix, thus degrading similarity learning between support and query features. In this paper, we present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot keypoint detection. Our SalViT enjoys a uniquely designed masked self-attention and a morphology learner, where the former introduces saliency map as a soft mask to constrain the self-attention on foregrounds, while the latter leverages the so-called power normalization to adjust morphology of saliency map, realizing ``dynamically changing receptive field''. Moreover, as salinecy detectors add computations, we show that attentive masks of DINO transformer can replace saliency. On top of SalViT, we also investigate i) transductive FSKD that enhances keypoint representations with unlabelled data and ii) FSKD under occlusions. We show that our model performs well on five public datasets and achieves ~10% PCK higher than the normally trained model under severe occlusions.
Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream
Authors: Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, Lan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03184
Pdf link: https://arxiv.org/pdf/2304.03184
Abstract Convenient 4D modeling of human-object interactions is essential for numerous applications. However, monocular tracking and rendering of complex interaction scenarios remain challenging. In this paper, we propose Instant-NVR, a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radiance field techniques via a multi-thread tracking-rendering mechanism. In the tracking front-end, we adopt a robust human-object capture scheme to provide sufficient motion priors. We further introduce a separated instant neural representation with a novel hybrid deformation module for the interacting scene. We also provide an on-the-fly reconstruction scheme of the dynamic/static radiance fields via efficient motion-prior searching. Moreover, we introduce an online key frame selection scheme and a rendering-aware refinement strategy to significantly improve the appearance details for online novel-view synthesis. Extensive experiments demonstrate the effectiveness and efficiency of our approach for the instant generation of human-object radiance fields on the fly, notably achieving real-time photo-realistic novel view synthesis under complex human-object interactions.
LANe: Lighting-Aware Neural Fields for Compositional Scene Synthesis
Authors: Akshay Krishnan, Amit Raj, Xianling Zhang, Alexandra Carlson, Nathan Tseng, Sandhya Sridhar, Nikita Jaipuria, James Hays
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03280
Pdf link: https://arxiv.org/pdf/2304.03280
Abstract Neural fields have recently enjoyed great success in representing and rendering 3D scenes. However, most state-of-the-art implicit representations model static or dynamic scenes as a whole, with minor variations. Existing work on learning disentangled world and object neural fields do not consider the problem of composing objects into different world neural fields in a lighting-aware manner. We present Lighting-Aware Neural Field (LANe) for the compositional synthesis of driving scenes in a physically consistent manner. Specifically, we learn a scene representation that disentangles the static background and transient elements into a world-NeRF and class-specific object-NeRFs to allow compositional synthesis of multiple objects in the scene. Furthermore, we explicitly designed both the world and object models to handle lighting variation, which allows us to compose objects into scenes with spatially varying lighting. This is achieved by constructing a light field of the scene and using it in conjunction with a learned shader to modulate the appearance of the object NeRFs. We demonstrate the performance of our model on a synthetic dataset of diverse lighting conditions rendered with the CARLA simulator, as well as a novel real-world dataset of cars collected at different times of the day. Our approach shows that it outperforms state-of-the-art compositional scene synthesis on the challenging dataset setup, via composing object-NeRFs learned from one scene into an entirely different scene whilst still respecting the lighting variations in the novel scene. For more results, please visit our project website https://lane-composition.github.io/.
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Authors: Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2304.03282
Pdf link: https://arxiv.org/pdf/2304.03282
Abstract Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called \emph{reversed attention} that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection.

A-suozhang / GetArxivDaily

New submissions for Fri, 7 Apr 23 #26

Keyword: efficient

Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks

nD-PDPA: nDimensional Probability Density Profile Analysis

A Certified Radius-Guided Attack Framework to Image Segmentation Models

Recovering Continuous Scene Dynamics from A Single Blurry Image with Events

Agnostic proper learning of monotone functions: beyond the black-box correction barrier

A Unified Taxonomy for Automated Vehicles: Individual, Cooperative, Collaborative, On-Road, and Off-Road

Efficient OCR for Building a Diverse Digital History

Sejarah dan Perkembangan Teknik Natural Language Processing (NLP) Bahasa Indonesia: Tinjauan tentang sejarah, perkembangan teknologi, dan aplikasi NLP dalam bahasa Indonesia

Robust, privacy-preserving, transparent, and auditable on-device blocklisting

GIF: A General Graph Unlearning Strategy via Influence Function

Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets

Towards an Effective and Efficient Transformer for Rain-by-snow Weather Removal

VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection

Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach

Affect as a proxy for literary mood

LSketch: A Label-Enabled Graph Stream Sketch Toward Time-Sensitive Queries

InterFormer: Real-time Interactive Image Segmentation

When approximate design for fast homomorphic computation provides differential privacy guarantees

A Fast and Lightweight Network for Low-Light Image Enhancement

IoT Federated Blockchain Learning at the Edge

PointCAT: Cross-Attention Transformer for point cloud

Tensor Slicing and Optimization for Multicore NPUs

A computation of D(9) using FPGA Supercomputing

Data-driven HVAC Control Using Symbolic Regression: Design and Implementation

Offline Uncertainty Sampling in Data-driven Stochastic MPC

Inductive Graph Unlearning

FABRID: Flexible Attestation-Based Routing for Inter-Domain Networks

Simplifying Content-Based Neural News Recommendation: On User Modeling and Training Objectives

Zero-shot Generative Model Adaptation via Image-specific Prompt Learning

BotTriNet: A Unified and Efficient Embedding for Social Bots Detection via Metric Learning

Parameterized Approximation Schemes for Clustering with General Norm Objectives

Spectral Toolkit of Algorithms for Graphs: Technical Report (1)

Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

Hierarchical Graph Neural Network with Cross-Attention for Cross-Device User Matching

FedBot: Enhancing Privacy in Chatbots with Federated Learning

DiffMimic: Efficient Motion Mimicking with Differentiable Physics

Keyword: faster

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model

Convolutional neural networks for crack detection on flexible road pavements

Boundary-Denoising for Video Activity Localization

Training a Two Layer ReLU Network Analytically

Patch-wise Features for Blur Image Classification

DiffMimic: Efficient Motion Mimicking with Differentiable Physics

Keyword: mobile

Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks

Adaptive Headway Motion Control and Motion Prediction for Safe Unicycle Motion Design

Evaluating Customization of Remote Tele-operation Interfaces for Assistive Robots

Gotta Assess `Em All: A Risk Analysis of Criminal Offenses Facilitated through PokemonGO

SwarmGear: Heterogeneous Swarm of Drones with Reconfigurable Leader Drone and Virtual Impedance Links for Multi-Robot Inspection

Spritz-PS: Validation of Synthetic Face Images Using a Large Dataset of Printed Documents

Keyword: pruning

To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency

NTK-SAP: Improving neural network pruning by aligning training dynamics

Learning to Learn with Indispensable Connections

Keyword: voxel

VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection

Keyword: lidar

VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection

Geometric-aware Pretraining for Vision-centric 3D Object Detection

SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation

Keyword: diffusion

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model

Benchmarking Robustness to Text-Guided Corruptions

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Zero-shot Generative Model Adaptation via Image-specific Prompt Learning

SketchFFusion: Sketch-guided image editing with diffusion model

Face Animation with an Attribute-Guided Diffusion Model

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Diffusion Models as Masked Autoencoders

Keyword: dynamic

Abstraction-based Probabilistic Stability Analysis of Polyhedral Probabilistic Hybrid Systems

Emergent Coordination through Game-Induced Nonlinear Opinion Dynamics

Going Further: Flatness at the Rescue of Early Stopping for Adversarial Example Transferability

ACTION++: Improving Semi-supervised Medical Image Segmentation with Adaptive Anatomical Contrast

Recovering Continuous Scene Dynamics from A Single Blurry Image with Events

Efficient and Accurate Automatic Python Bindings with cppyy & Cling