Special Session: Approximation and Fault Resiliency of DNN Accelerators
Authors: Mohammad Hasan Ahmadilivani, Mario Barbareschi, Salvatore Barone, Alberto Bosio, Masoud Daneshtalab, Salvatore Della Torca, Gabriele Gavarini, Maksim Jenihhin, Jaan Raik, Annachiara Ruospo, Ernesto Sanchez, Mahdi Taheri
Abstract
Deep Learning, and in particular, Deep Neural Network (DNN) is nowadays widely used in many scenarios, including safety-critical applications such as autonomous driving. In this context, besides energy efficiency and performance, reliability plays a crucial role since a system failure can jeopardize human life. As with any other device, the reliability of hardware architectures running DNNs has to be evaluated, usually through costly fault injection campaigns. This paper explores the approximation and fault resiliency of DNN accelerators. We propose to use approximate (AxC) arithmetic circuits to agilely emulate errors in hardware without performing fault injection on the DNN. To allow fast evaluation of AxC DNN, we developed an efficient GPU-based simulation framework. Further, we propose a fine-grain analysis of fault resiliency by examining fault propagation and masking in networks
From Data to Action: Exploring AI and IoT-driven Solutions for Smarter Cities
Authors: Tiago Dias, Tiago Fonseca, João Vitorino, Andreia Martins, Sofia Malpique, Isabel Praça
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
Abstract
The emergence of smart cities demands harnessing advanced technologies like the Internet of Things (IoT) and Artificial Intelligence (AI) and promises to unlock cities' potential to become more sustainable, efficient, and ultimately livable for their inhabitants. This work introduces an intelligent city management system that provides a data-driven approach to three use cases: (i) analyze traffic information to reduce the risk of traffic collisions and improve driver and pedestrian safety, (ii) identify when and where energy consumption can be reduced to improve cost savings, and (iii) detect maintenance issues like potholes in the city's roads and sidewalks, as well as the beginning of hazards like floods and fires. A case study in Aveiro City demonstrates the system's effectiveness in generating actionable insights that enhance security, energy efficiency, and sustainability, while highlighting the potential of AI and IoT-driven solutions for smart city development.
Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation
Abstract
A challenge in the Dialogue State Tracking (DST) field is adapting models to new domains without using any supervised data, zero-shot domain adaptation. Parameter-Efficient Transfer Learning (PETL) has the potential to address this problem due to its robustness. However, it has yet to be applied to the zero-shot scenarios, as it is not clear how to apply it unsupervisedly. Our method, Prompter, uses descriptions of target domain slots to generate dynamic prefixes that are concatenated to the key and values at each layer's self-attention mechanism. This allows for the use of prefix-tuning in zero-shot. Prompter outperforms previous methods on both the MultiWOZ and SGD benchmarks. In generating prefixes, our analyses find that Prompter not only utilizes the semantics of slot descriptions but also how often the slots appear together in conversation. Moreover, Prompter's gains are due to its improved ability to distinguish "none"-valued dialogue slots, compared against baselines.
Online Multi-Contact Receding Horizon Planning via Value Function Approximation
Authors: Jiayi Wang, Sanghyun Kim, Teguh Santoso Lembono, Wenqian Du, Jaehyun Shim, Saeid Samadi, Ke Wang, Vladimir Ivan, Sylvain Calinon, Sethu Vijayakumar, Steve Tonneau
Abstract
Planning multi-contact motions in a receding horizon fashion requires a value function to guide the planning with respect to the future, e.g., building momentum to traverse large obstacles. Traditionally, the value function is approximated by computing trajectories in a prediction horizon (never executed) that foresees the future beyond the execution horizon. However, given the non-convex dynamics of multi-contact motions, this approach is computationally expensive. To enable online Receding Horizon Planning (RHP) of multi-contact motions, we find efficient approximations of the value function. Specifically, we propose a trajectory-based and a learning-based approach. In the former, namely RHP with Multiple Levels of Model Fidelity, we approximate the value function by computing the prediction horizon with a convex relaxed model. In the latter, namely Locally-Guided RHP, we learn an oracle to predict local objectives for locomotion tasks, and we use these local objectives to construct local value functions for guiding a short-horizon RHP. We evaluate both approaches in simulation by planning centroidal trajectories of a humanoid robot walking on moderate slopes, and on large slopes where the robot cannot maintain static balance. Our results show that locally-guided RHP achieves the best computation efficiency (95\%-98.6\% cycles converge online). This computation advantage enables us to demonstrate online receding horizon planning of our real-world humanoid robot Talos walking in dynamic environments that change on-the-fly.
Optimal Wheeler Language Recognition
Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Alberto Policriti, Nicola Prezza
Subjects: Formal Languages and Automata Theory (cs.FL)
Abstract
A Wheeler automaton is a finite state automaton whose states admit a total Wheeler order, reflecting the co-lexicographic order of the strings labeling source-to-node paths. A Wheeler language is a regular language admitting an accepting Wheeler automaton. Wheeler languages admit efficient and elegant solutions to hard problems such as automata compression and regular expression matching, therefore deciding whether a regular language is Wheeler is relevant in applications requiring efficient solutions to those problems. In this paper, we show that it is possible to decide whether a DFA with n states and m transitions recognizes a Wheeler language in $O(mn)$ time. This is a significant improvement over the running time $O(n^{13} + m\log n)$ of the previous polynomial-time algorithm (Alanko et al., Information and Computation 2021). A proof-of-concept implementation of this algorithm is available in a public repository. We complement this upper bound with a conditional matching lower bound stating that, unless the strong exponential time hypothesis (SETH) fails, the problem cannot be solved in strongly subquadratic time. The same problem is known to be PSPACE-complete when the input is an NFA (D'Agostino et al., Theoretical Computer Science 2023). Together with that result, our paper essentially closes the algorithmic problem of Wheeler language recognition.
3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels
Authors: Zhenzhen Weng, Alexander S. Gorban, Jingwei Ji, Mahyar Najibi, Yin Zhou, Dragomir Anguelov
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Training a 3D human keypoint detector from point clouds in a supervised manner requires large volumes of high quality labels. While it is relatively easy to capture large amounts of human point clouds, annotating 3D keypoints is expensive, subjective, error prone and especially difficult for long-tail cases (pedestrians with rare poses, scooterists, etc.). In this work, we propose GC-KPL - Geometry Consistency inspired Key Point Leaning, an approach for learning 3D human joint locations from point clouds without human labels. We achieve this by our novel unsupervised loss formulations that account for the structure and movement of the human body. We show that by training on a large training set from Waymo Open Dataset without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream fewshot learning of keypoints, where fine-tuning on only 10 percent of the labeled training data gives comparable performance to fine-tuning on the entire set. We demonstrated that GC-KPL outperforms by a large margin over SoTA when trained on entire dataset and efficiently leverages large volumes of unlabeled data.
SKG: A Versatile Information Retrieval and Analysis Framework for Academic Papers with Semantic Knowledge Graphs
Abstract
The number of published research papers has experienced exponential growth in recent years, which makes it crucial to develop new methods for efficient and versatile information extraction and knowledge discovery. To address this need, we propose a Semantic Knowledge Graph (SKG) that integrates semantic concepts from abstracts and other meta-information to represent the corpus. The SKG can support various semantic queries in academic literature thanks to the high diversity and rich information content stored within. To extract knowledge from unstructured text, we develop a Knowledge Extraction Module that includes a semi-supervised pipeline for entity extraction and entity normalization. We also create an ontology to integrate the concepts with other meta information, enabling us to build the SKG. Furthermore, we design and develop a dataflow system that demonstrates how to conduct various semantic queries flexibly and interactively over the SKG. To demonstrate the effectiveness of our approach, we conduct the research based on the visualization literature and provide real-world use cases to show the usefulness of the SKG. The dataset and codes for this work are available at https://osf.io/aqv8p/?view_only=2c26b36e3e3941ce999df47e4616207f.
Autonomous Capability Assessment of Black-Box Sequential Decision-Making Systems
Abstract
It is essential for users to understand what their AI systems can and can't do in order to use them safely. However, the problem of enabling users to assess AI systems with evolving sequential decision making (SDM) capabilities is relatively understudied. This paper presents a new approach for modeling the capabilities of black-box AI systems that can plan and act, along with the possible effects and requirements for executing those capabilities in stochastic settings. We present an active-learning approach that can effectively interact with a black-box SDM system and learn an interpretable probabilistic model describing its capabilities. Theoretical analysis of the approach identifies the conditions under which the learning process is guaranteed to converge to the correct model of the agent; empirical evaluations on different agents and simulated scenarios show that this approach is few-shot generalizable and can effectively describe the capabilities of arbitrary black-box SDM agents in a sample-efficient manner.
Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers
Abstract
The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the impact of different datasets on model performance for the crowd-annotated CODA-19 research aspect classification task. Specifically, we explore the potential benefits of using the large, automatically curated PubMed 200K RCT dataset and evaluate the effectiveness of large language models (LLMs), such as LLaMA, GPT-3, ChatGPT, and GPT-4. Our results indicate that using the PubMed 200K RCT dataset does not improve performance for the CODA-19 task. We also observe that while GPT-4 performs well, it does not outperform the SciBERT model fine-tuned on the CODA-19 dataset, emphasizing the importance of a dedicated and task-aligned datasets dataset for the target task. Our code is available at https://github.com/Crowd-AI-Lab/CODA-19-exp.
Extended Neighboring Extremal Optimal Control with State and Preview Perturbations
Authors: Amin Vahidi-Moghaddam, Kaixiang Zhang, Zhaojian Li, Xunyuan Yin, Ziyou Song, Yan Wang
Abstract
Optimal control schemes have achieved remarkable performance in numerous engineering applications. However, they typically require high computational cost, which has limited their use in real-world engineering systems with fast dynamics and/or limited computation power. To address this challenge, Neighboring Extremal (NE) has been developed as an efficient optimal adaption strategy to adapt a pre-computed nominal control solution to perturbations from the nominal trajectory. The resulting control law is a time-varying feedback gain that can be pre-computed along with the original optimal control problem, and it takes negligible online computation. However, existing NE frameworks only deal with state perturbations while in modern applications, optimal controllers (e.g., predictive controllers) frequently incorporate preview information. Therefore, a new NE framework is needed to adapt to such preview perturbations. In this work, an extended NE (ENE) framework is developed to systematically adapt the nominal control to both state and preview perturbations. We show that the derived ENE law is two time-varying feedback gains on the state perturbation and the preview perturbation. We also develop schemes to handle nominal non-optimal solutions and large perturbations to retain optimal performance and constraint satisfaction. Case study on nonlinear model predictive control is presented due to its popularity but it can be easily extended to other optimal control schemes. Promising simulation results on the cart inverted pendulum problem demonstrate the efficacy of the ENE algorithm.
Improving Vietnamese Legal Question--Answering System based on Automatic Data Enrichment
Abstract
Question answering (QA) in law is a challenging problem because legal documents are much more complicated than normal texts in terms of terminology, structure, and temporal and logical relationships. It is even more difficult to perform legal QA for low-resource languages like Vietnamese where labeled data are rare and pre-trained language models are still limited. In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system and introduce a novel method to improve the performance of language models by improving data quality through weak labeling. Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance. Our experiments are designed to test multiple aspects, which demonstrate the effectiveness of the proposed technique.
InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
Authors: Hanrong Ye, Dan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models' ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Authors: Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Raghuraman Krishnamoorthi, Vikas Chandra
Abstract
Weight-sharing supernet has become a vital component for performance estimation in the state-of-the-art (SOTA) neural architecture search (NAS) frameworks. Although supernet can directly generate different subnetworks without retraining, there is no guarantee for the quality of these subnetworks because of weight sharing. In NLP tasks such as machine translation and pre-trained language modeling, we observe that given the same model architecture, there is a large performance gap between supernet and training from scratch. Hence, supernet cannot be directly used and retraining is necessary after finding the optimal architectures. In this work, we propose mixture-of-supernets, a generalized supernet formulation where mixture-of-experts (MoE) is adopted to enhance the expressive power of the supernet model, with negligible training overhead. In this way, different subnetworks do not share the model weights directly, but through an architecture-based routing mechanism. As a result, model weights of different subnetworks are customized towards their specific architectures and the weight generation is learned by gradient descent. Compared to existing weight-sharing supernet for NLP, our method can minimize the retraining time, greatly improving training efficiency. In addition, the proposed method achieves the SOTA performance in NAS for building fast machine translation models, yielding better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We also achieve the SOTA performance in NAS for building memory-efficient task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various model sizes.
Abstract
Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data by using machine learning techniques. We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm. Our learning algorithm leverages features of spatial data partitioning and prunes ineffective learning processes to find optimal partitions efficiently. Our experimental study, which uses Apache Sedona and real-world spatial data, demonstrates that our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.
Longest Common Prefix Arrays for Succinct k-Spectra
Authors: Jarno N. Alanko, Elena Biagi, Simon J. Puglisi
Abstract
The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudoalignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays.
The Hybrid Extended Bicycle: A Simple Model for High Dynamic Vehicle Trajectory Planning
Authors: Agapius Bou Ghosn, Philip Polack, Arnaud de La Fortelle
Abstract
While highly automated driving relies most of the time on a smooth driving assumption, the possibility of a vehicle performing harsh maneuvers with high dynamic driving to face unexpected events is very likely. The modeling of the behavior of the vehicle in these events is crucial to proper planning and controlling; the used model should present accurate and computationally efficient properties. In this article, we propose an LSTM-based hybrid extended bicycle model able to present an accurate description of the state of the vehicle for both normal and aggressive situations. The introduced model is used in an MPPI framework for planning trajectories in high-dynamic scenarios where other simple models fail.
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Authors: Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu Emma Wang
Abstract
Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an effective ordering of layer sensitivities. We address this by augmenting the estimated Hessian with additional information to capture inter-layer dependencies. We demonstrate that this consistently improves PTQ performance along the accuracy-latency Pareto frontier across multiple models. Our method combines second-order information and inter-layer dependencies to guide a bisection search, finding quantization configurations within a user-configurable model accuracy degradation range. We evaluate the effectiveness of our method on the ResNet50, MobileNetV2, and BERT models. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48\%$, $21.69\%$, and $33.28\%$ respectively, while maintaining model accuracy to within $99.99\%$ of the baseline model.
From Bad to Worse: Using Private Data to Propagate Disinformation on Online Platforms with a Greater Efficiency
Authors: Protik Bose Pranto, Waqar Hassan Khan, Sahar Abdelnabi, Rebecca Weil, Mario Fritz, Rakibul Hasan
Abstract
We outline a planned experiment to investigate if personal data (e.g., demographics and behavioral patterns) can be used to selectively expose individuals to disinformation such that an adversary can spread disinformation more efficiently compared to broadcasting the same information to everyone. This mechanism, if effective, will have devastating consequences as modern technologies collect and infer a plethora of private data that can be abused to target with disinformation. We believe this research will inform designing policies and regulations for online platforms.
Big-data-driven and AI-based framework to enable personalization in wireless networks
Authors: Rawan Alkurd, Ibrahim Abualhaol, Halim Yanikomeroglu
Subjects: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract
Current communication networks use design methodologies that prevent the realization of maximum network efficiency. In the first place, while users' perception of satisfactory service diverges widely, current networks are designed to be a "universal fit," where they are generally over-engineered to deliver services appealing to all types of users. Also, current networks lack user-level data cognitive intelligence that would enable fast personalized network decisions and actions through automation. Thus, in this article, we propose the utilization of AI, big data analytics, and real-time non-intrusive user feedback in order to enable the personalization of wireless networks. Based on each user's actual QoS requirements and context, a multi-objective formulation enables the network to micro-manage and optimize the provided QoS and user satisfaction levels simultaneously. Moreover, in order to enable user feedback tracking and measurement, we propose a user satisfaction model based on the zone of tolerance concept. Furthermore, we propose a big-data-driven and AI-based personalization framework to integrate personalization into wireless networks. Finally, we implement a personalized network prototype to demonstrate the proposed personalization concept and its potential benefits through a case study. The case study shows how personalization can be realized to enable the efficient optimization of network resources such that certain requirement levels of user satisfaction and revenue in the form of saved resources are achieved.
Sensing-based Beamforming Design for Joint Performance Enhancement of RIS-Aided ISAC Systems
Abstract
Reconfigurable intelligent surface (RIS) has shown its great potential in facilitating device-based integrated sensing and communication (ISAC), where sensing and communication tasks are mostly conducted on different time-frequency resources. While the more challenging scenarios of simultaneous sensing and communication (SSC) have so far drawn little attention. In this paper, we propose a novel RIS-aided ISAC framework where the inherent location information in the received communication signals from a blind-zone user equipment is exploited to enable SSC. We first design a two-phase ISAC transmission protocol. In the first phase, communication and coarse-grained location sensing are performed concurrently by exploiting the very limited channel state information, while in the second phase, by using the coarse-grained sensing information obtained from the first phase, simple-yet-efficient sensing-based beamforming designs are proposed to realize both higher-rate communication and fine-grained location sensing. We demonstrate that our proposed framework can achieve almost the same performance as the communication-only frameworks, while providing up to millimeter-level positioning accuracy. In addition, we show how the communication and sensing performance can be simultaneously boosted through our proposed sensing-based beamforming designs. The results presented in this work provide valuable insights into the design and implementation of other ISAC systems considering SSC.
Flow-based Network Intrusion Detection Based on BERT Masked Language Model
Abstract
A Network Intrusion Detection System (NIDS) is an important tool that identifies potential threats to a network. Recently, different flow-based NIDS designs utilizing Machine Learning (ML) algorithms have been proposed as potential solutions to detect intrusions efficiently. However, conventional ML-based classifiers have not seen widespread adoption in the real-world due to their poor domain adaptation capability. In this research, our goal is to explore the possibility of improve the domain adaptation capability of NIDS. Our proposal employs Natural Language Processing (NLP) techniques and Bidirectional Encoder Representations from Transformers (BERT) framework. The proposed method achieved positive results when tested on data from different domains.
Prefer to Classify: Improving Text Classifiers via Auxiliary Preference Learning
Authors: Jaehyung Kim, Jinwoo Shin, Dongyeop Kang
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The development of largely human-annotated benchmarks has driven the success of deep neural networks in various NLP tasks. To enhance the effectiveness of existing benchmarks, collecting new additional input-output pairs is often too costly and challenging, particularly considering their marginal impact on improving the current model accuracy. Instead, additional or complementary annotations on the existing input texts in the benchmarks can be preferable as an efficient way to pay the additional human cost. In this paper, we investigate task-specific preferences between pairs of input texts as a new alternative way for such auxiliary data annotation. From 'pair-wise' comparisons with respect to the task, the auxiliary preference learning enables the model to learn an additional informative training signal that cannot be captured with 'instance-wise' task labels. To this end, we propose a novel multi-task learning framework, called prefer-to-classify (P2C), which can enjoy the cooperative effect of learning both the given classification task and the auxiliary preferences. Here, we provide three different ways to collect preference signals in practice: (a) implicitly extracting from annotation records (for free, but often unavailable), (b) collecting explicitly from crowd workers (high paid), or (c) pre-trained large language models such as GPT-3 (low paid). Given existing classification NLP benchmarks, we demonstrate that the proposed auxiliary preference learning via P2C on them is effective in improving text classifiers. Our codes are publicly available.
An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection
Authors: Ziye Chen, Kate Smith-Miles, Bo Du, Guoqi Qian, Mingming Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Accurately detecting lane lines in 3D space is crucial for autonomous driving. Existing methods usually first transform image-view features into bird-eye-view (BEV) by aid of inverse perspective mapping (IPM), and then detect lane lines based on the BEV features. However, IPM ignores the changes in road height, leading to inaccurate view transformations. Additionally, the two separate stages of the process can cause cumulative errors and increased complexity. To address these limitations, we propose an efficient transformer for 3D lane detection. Different from the vanilla transformer, our model contains a decomposed cross-attention mechanism to simultaneously learn lane and BEV representations. The mechanism decomposes the cross-attention between image-view and BEV features into the one between image-view and lane features, and the one between lane and BEV features, both of which are supervised with ground-truth lane lines. Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively. This allows for a more accurate view transformation than IPM-based methods, as the view transformation is learned from data with a supervised cross-attention. Additionally, the cross-attention between lane and BEV features enables them to adjust to each other, resulting in more accurate lane detection than the two separate stages. Finally, the decomposed cross-attention is more efficient than the original one. Experimental results on OpenLane and ONCE-3DLanes demonstrate the state-of-the-art performance of our method.
InfoPrompt: Information-Theoretic Soft Prompt Tuning for Natural Language Understanding
Abstract
Soft prompt tuning achieves superior performances across a wide range of few-shot tasks. However, the performances of prompt tuning can be highly sensitive to the initialization of the prompts. We also empirically observe that conventional prompt tuning methods cannot encode and learn sufficient task-relevant information from prompt tokens. In this work, we develop an information-theoretic framework that formulates soft prompt tuning as maximizing mutual information between prompts and other model parameters (or encoded representations). This novel view helps us to develop a more efficient, accurate and robust soft prompt tuning method InfoPrompt. With this framework, we develop two novel mutual information based loss functions, to (i) discover proper prompt initialization for the downstream tasks and learn sufficient task-relevant information from prompt tokens and (ii) encourage the output representation from the pretrained language model to be more aware of the task-relevant information captured in the learnt prompt. Extensive experiments validate that InfoPrompt can significantly accelerate the convergence of the prompt tuning and outperform traditional prompt tuning methods. Finally, we provide a formal theoretical result for showing to show that gradient descent type algorithm can be used to train our mutual information loss.
A modified model for topic detection from a corpus and a new metric evaluating the understandability of topics
Abstract
This paper presents a modified neural model for topic detection from a corpus and proposes a new metric to evaluate the detected topics. The new model builds upon the embedded topic model incorporating some modifications such as document clustering. Numerical experiments suggest that the new model performs favourably regardless of the document's length. The new metric, which can be computed more efficiently than widely-used metrics such as topic coherence, provides variable information regarding the understandability of the detected topics.
Robust Learning with Progressive Data Expansion Against Spurious Correlation
Abstract
While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable spurious features rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. In light of this, we propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. On average, our method achieves a 2.8% improvement in worst-group accuracy compared with the state-of-the-art method, while enjoying up to 10x faster training efficiency.
Motion Planning for Aerial Pick-and-Place based on Geometric Feasibility Constraints
Abstract
This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional configuration spaces. That is because it calculates the trajectories of the quadcopter base and the end-effector separately in the Cartesian space based on proposed geometric feasibility constraints. The geometric feasibility constraints can ensure the resulting trajectories satisfy the aerial manipulator's geometry. Second, collision avoidance for the Delta arm is achieved through an iterative approach based on a pinhole mapping method, so that the feasible trajectory can be found in an efficient manner. The proposed approach is verified by three experiments on a real aerial manipulation platform. The experimental results show the effectiveness of the proposed method for the aerial pick-and-place task.
A Melting Pot of Evolution and Learning
Authors: Moshe Sipper, Achiya Elyasaf, Tomer Halperin, Zvika Haramaty, Raz Lapid, Eyal Segal, Itai Tzruia, Snir Vitrack Tamam
Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Abstract
We survey eight recent works by our group, involving the successful blending of evolutionary algorithms with machine learning and deep learning: 1. Binary and Multinomial Classification through Evolutionary Symbolic Regression, 2. Classy Ensemble: A Novel Ensemble Algorithm for Classification, 3. EC-KitY: Evolutionary Computation Tool Kit in Python, 4. Evolution of Activation Functions for Deep Learning-Based Image Classification, 5. Adaptive Combination of a Genetic Algorithm and Novelty Search for Deep Neuroevolution, 6. An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks, 7. Foiling Explanations in Deep Neural Networks, 8. Patch of Invisibility: Naturalistic Black-Box Adversarial Attacks on Object Detectors.
StreetSurf: Extending Multi-view Implicit Surface Reconstruction to Street Views
Abstract
We present a novel multi-view implicit surface reconstruction technique, termed StreetSurf, that is readily applicable to street view images in widely-used autonomous driving datasets, such as Waymo-perception sequences, without necessarily requiring LiDAR data. As neural rendering research expands rapidly, its integration into street views has started to draw interests. Existing approaches on street views either mainly focus on novel view synthesis with little exploration of the scene geometry, or rely heavily on dense LiDAR data when investigating reconstruction. Neither of them investigates multi-view implicit surface reconstruction, especially under settings without LiDAR data. Our method extends prior object-centric neural surface reconstruction techniques to address the unique challenges posed by the unbounded street views that are captured with non-object-centric, long and narrow camera trajectories. We delimit the unbounded space into three parts, close-range, distant-view and sky, with aligned cuboid boundaries, and adapt cuboid/hyper-cuboid hash-grids along with road-surface initialization scheme for finer and disentangled representation. To further address the geometric errors arising from textureless regions and insufficient viewing angles, we adopt geometric priors that are estimated using general purpose monocular models. Coupled with our implementation of efficient and fine-grained multi-stage ray marching strategy, we achieve state of the art reconstruction quality in both geometry and appearance within only one to two hours of training time with a single RTX3090 GPU for each street view sequence. Furthermore, we demonstrate that the reconstructed implicit surfaces have rich potential for various downstream tasks, including ray tracing and LiDAR simulation.
Numerical computation of the half Laplacian by means of a fast convolution algorithm
Authors: Carlota M. Cuesta, Francisco de la Hoz, Ivan Girona
Abstract
In this paper, we develop a fast and accurate pseudospectral method to approximate numerically the fractional Laplacian $(-\Delta)^{\alpha/2}$ of a function on $\mathbb{R}$ for $\alpha=1$; this case, commonly referred to as the half Laplacian, is equivalent to the Hilbert transform of the derivative of the function. The main ideas are as follows. Given a twice continuously differentiable bounded function $u\in \mathit{C}_b^2(\mathbb{R})$, we apply the change of variable $x=L\cot(s)$, with $L>0$ and $s\in[0,\pi]$, which maps $\mathbb{R}$ into $[0,\pi]$, and denote $(-\Delta)_s^{1/2}u(x(s)) \equiv (-\Delta)^{1/2}u(x)$. Therefore, by performing a Fourier series expansion of $u(x(s))$, the problem is reduced to computing $(-\Delta)_s^{1/2}e^{iks} \equiv (-\Delta)^{1/2}(x + i)^k/(1+x^2)^{k/2}$. On a previous work, we considered the case with $k$ even and $\alpha\in(0,2)$, so we focus now on the case with $k$ odd. More precisely, we express $(-\Delta)_s^{1/2}e^{iks}$ for $k$ odd in terms of the Gaussian hypergeometric function ${}_2F1$, and also as a well-conditioned finite sum. Then, we use a fast convolution result, that enable us to compute very efficiently $\sum{l = 0}^Ma_l(-\Delta)_s^{1/2}e^{i(2l+1)s}$, for extremely large values of $M$. This enables us to approximate $(-\Delta)_s^{1/2}u(x(s))$ in a fast and accurate way, especially when $u(x(s))$ is not periodic of period $\pi$.
Progression Cognition Reinforcement Learning with Prioritized Experience for Multi-Vehicle Pursuit
Authors: Xinhang Li, Yiying Yang, Zheng Yuan, Zhe Wang, Qinwen Wang, Chen Xu, Lei Li, Jianhua He, Lin Zhang
Abstract
Multi-vehicle pursuit (MVP) such as autonomous police vehicles pursuing suspects is important but very challenging due to its mission and safety critical nature. While multi-agent reinforcement learning (MARL) algorithms have been proposed for MVP problem in structured grid-pattern roads, the existing algorithms use randomly training samples in centralized learning, which leads to homogeneous agents showing low collaboration performance. For the more challenging problem of pursuing multiple evading vehicles, these algorithms typically select a fixed target evading vehicle for pursuing vehicles without considering dynamic traffic situation, which significantly reduces pursuing success rate. To address the above problems, this paper proposes a Progression Cognition Reinforcement Learning with Prioritized Experience for MVP (PEPCRL-MVP) in urban multi-intersection dynamic traffic scenes. PEPCRL-MVP uses a prioritization network to assess the transitions in the global experience replay buffer according to the parameters of each MARL agent. With the personalized and prioritized experience set selected via the prioritization network, diversity is introduced to the learning process of MARL, which can improve collaboration and task related performance. Furthermore, PEPCRL-MVP employs an attention module to extract critical features from complex urban traffic environments. These features are used to develop progression cognition method to adaptively group pursuing vehicles. Each group efficiently target one evading vehicle in dynamic driving environments. Extensive experiments conducted with a simulator over unstructured roads of an urban area show that PEPCRL-MVP is superior to other state-of-the-art methods. Specifically, PEPCRL-MVP improves pursuing efficiency by 3.95% over TD3-DMAP and its success rate is 34.78% higher than that of MADDPG. Codes are open sourced.
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Abstract
Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD
Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Authors: Ruijie Zhang, Qiaozhe Zhang, Yingzhuang Liu, Hao Xin, Yan Liu, Xinggang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make it challenging to employ deep learning methods for WSI-based digital diagnosis. Multiple instance learning (MIL) is a powerful tool to address the weak annotation problem, while Transformer has shown great success in the field of visual tasks. The combination of both should provide new insights for deep learning based image diagnosis. However, due to the limitations of single-level MIL and the attention mechanism's constraints on sequence length, directly applying Transformer to WSI-based MIL tasks is not practical. To tackle this issue, we propose a Multi-level MIL with Transformer (MMIL-Transformer) approach. By introducing a hierarchical structure to MIL, this approach enables efficient handling of MIL tasks that involve a large number of instances. To validate its effectiveness, we conducted a set of experiments on WSIs classification task, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods. Our proposed approach achieves test AUC 94.74% and test accuracy 93.41% on CAMELYON16 dataset, test AUC 99.04% and test accuracy 94.37% on TCGA-NSCLC dataset, respectively. All code and pre-trained models are available at: https://github.com/hustvl/MMIL-Transformer
Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations
Authors: Hyeonjeong Ha, Minseon Kim, Sung Ju Hwang
Abstract
Recent neural architecture search (NAS) frameworks have been successful in finding optimal architectures for given conditions (e.g., performance or latency). However, they search for optimal architectures in terms of their performance on clean images only, while robustness against various types of perturbations or corruptions is crucial in practice. Although there exist several robust NAS frameworks that tackle this issue by integrating adversarial training into one-shot NAS, however, they are limited in that they only consider robustness against adversarial attacks and require significant computational resources to discover optimal architectures for a single task, which makes them impractical in real-world scenarios. To address these challenges, we propose a novel lightweight robust zero-cost proxy that considers the consistency across features, parameters, and gradients of both clean and perturbed images at the initialization state. Our approach facilitates an efficient and rapid search for neural architectures capable of learning generalizable features that exhibit robustness across diverse perturbations. The experimental results demonstrate that our proxy can rapidly and efficiently search for neural architectures that are consistently robust against various perturbations on multiple benchmark datasets and diverse search spaces, largely outperforming existing clean zero-shot NAS and robust NAS with reduced search cost.
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Cost-Efficient Question Answering
Abstract
Large language models (LLMs), such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, the two LLMs are closed source, and little is known about the LLMs' performance in real-world use cases. In academia, LLM performance is often measured on benchmarks which may have leaked into ChatGPT's and GPT-4's training data. In this paper, we apply and evaluate ChatGPT and GPT-4 for the real-world task of cost-efficient extractive question answering over a text corpus that was published after the two LLMs completed training. More specifically, we extract research challenges for researchers in the field of HCI from the proceedings of the 2023 Conference on Human Factors in Computing Systems (CHI). We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs in academia and practice. For researchers in HCI, we contribute an interactive visualization of 4392 research challenges in over 90 research topics. We share this visualization and the dataset in the spirit of open science.
Magnitude Attention-based Dynamic Pruning
Authors: Jihye Back, Namhyuk Ahn, Jangho Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Existing pruning methods utilize the importance of each weight based on specified criteria only when searching for a sparse structure but do not utilize it during training. In this work, we propose a novel approach - \textbf{M}agnitude \textbf{A}ttention-based Dynamic \textbf{P}runing (MAP) method, which applies the importance of weights throughout both the forward and backward paths to explore sparse model structures dynamically. Magnitude attention is defined based on the magnitude of weights as continuous real-valued numbers enabling a seamless transition from a redundant to an effective sparse network by promoting efficient exploration. Additionally, the attention mechanism ensures more effective updates for important layers within the sparse network. In later stages of training, our approach shifts from exploration to exploitation, exclusively updating the sparse model composed of crucial weights based on the explored structure, resulting in pruned models that not only achieve performance comparable to dense models but also outperform previous pruning methods on CIFAR-10/100 and ImageNet.
A Dynamic Feature Interaction Framework for Multi-task Visual Perception
Authors: Yuling Xi, Hao Chen, Ning Wang, Peng Wang, Yanning Zhang, Chunhua Shen, Yifan Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-task visual perception has a wide range of applications in scene understanding such as autonomous driving. In this work, we devise an efficient unified framework to solve multiple common perception tasks, including instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Simply sharing the same visual feature representations for these tasks impairs the performance of tasks, while independent task-specific feature extractors lead to parameter redundancy and latency. Thus, we design two feature-merge branches to learn feature basis, which can be useful to, and thus shared by, multiple perception tasks. Then, each task takes the corresponding feature basis as the input of the prediction task head to fulfill a specific task. In particular, one feature merge branch is designed for instance-level recognition the other for dense predictions. To enhance inter-branch communication, the instance branch passes pixel-wise spatial information of each instance to the dense branch using efficient dynamic convolution weighting. Moreover, a simple but effective dynamic routing mechanism is proposed to isolate task-specific features and leverage common properties among tasks. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception. In addition, as tasks benefit from co-training with each other, our solution achieves on par results on partially labeled settings on nuScenes and outperforms previous works for 3D detection and depth estimation on the Cityscapes dataset with full supervision.
Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization
Abstract
Large language models (LLMs)have achieved great success in general domains of natural language processing. In this paper, we bring LLMs to the realm of geoscience, with the objective of advancing research and applications in this field. To this end, we present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience. For instance, we have curated the first geoscience instruction tuning dataset, GeoSignal, which aims to align LLM responses to geoscience-related user queries. Additionally, we have established the first geoscience benchmark, GeoBenchmark, to evaluate LLMs in the context of geoscience. In this work, we experiment with a complete recipe to adapt a pretrained general-domain LLM to the geoscience domain. Specifically, we further train the LLaMA-7B model on over 1 million pieces of geoscience literature and utilize GeoSignal's supervised data to fine-tune the model. Moreover, we share a protocol that can efficiently gather domain-specific data and construct domain-supervised data, even in situations where manpower is scarce. Experiments conducted on the GeoBenchmark demonstrate the the effectiveness of our approach and datasets.
Abstract
As society transitions towards an AI-based decision-making infrastructure, an ever-increasing number of decisions once under control of humans are now delegated to automated systems. Even though such developments make various parts of society more efficient, a large body of evidence suggests that a great deal of care needs to be taken to make such automated decision-making systems fair and equitable, namely, taking into account sensitive attributes such as gender, race, and religion. In this paper, we study a specific decision-making task called outcome control in which an automated system aims to optimize an outcome variable $Y$ while being fair and equitable. The interest in such a setting ranges from interventions related to criminal justice and welfare, all the way to clinical decision-making and public health. In this paper, we first analyze through causal lenses the notion of benefit, which captures how much a specific individual would benefit from a positive decision, counterfactually speaking, when contrasted with an alternative, negative one. We introduce the notion of benefit fairness, which can be seen as the minimal fairness requirement in decision-making, and develop an algorithm for satisfying it. We then note that the benefit itself may be influenced by the protected attribute, and propose causal tools which can be used to analyze this. Finally, if some of the variations of the protected attribute in the benefit are considered as discriminatory, the notion of benefit fairness may need to be strengthened, which leads us to articulating a notion of causal benefit fairness. Using this notion, we develop a new optimization procedure capable of maximizing $Y$ while ascertaining causal fairness in the decision process.
AutoCharge: Autonomous Charging for Perpetual Quadrotor Missions
Authors: Alessandro Saviolo, Jeffrey Mao, Roshan Balu T M B, Vivek Radhakrishnan, Giuseppe Loianno
Abstract
Battery endurance represents a key challenge for long-term autonomy and long-range operations, especially in the case of aerial robots. In this paper, we propose AutoCharge, an autonomous charging solution for quadrotors that combines a portable ground station with a flexible, lightweight charging tether and is capable of universal, highly efficient, and robust charging. We design and manufacture a pair of circular magnetic connectors to ensure a precise orientation-agnostic electrical connection between the ground station and the charging tether. Moreover, we supply the ground station with an electromagnet that largely increases the tolerance to localization and control errors during the docking maneuver, while still guaranteeing smooth un-docking once the charging process is completed. We demonstrate AutoCharge on a perpetual 10 hours quadrotor flight experiment and show that the docking and un-docking performance is solidly repeatable, enabling perpetual quadrotor flight missions.
Controllable Multi-Objective Re-ranking with Policy Hypernetworks
Authors: Sirui Chen, Yuan Wang, Zijing Wen, Zhiyu Li, Changshuo Zhang, Quan Lin, Cheng Zhu, Jun Xu
Abstract
Multi-stage ranking pipelines have become widely used strategies in modern recommender systems, where the final stage aims to return a ranked list of items that balances a number of requirements such as user preference, diversity, novelty etc. Linear scalarization is arguably the most widely used technique to merge multiple requirements into one optimization objective, by summing up the requirements with certain preference weights. Existing final-stage ranking methods often adopt a static model where the preference weights are determined during offline training and kept unchanged during online serving. Whenever a modification of the preference weights is needed, the model has to be re-trained, which is time and resources inefficient. Meanwhile, the most appropriate weights may vary greatly for different groups of targeting users or at different time periods (e.g., during holiday promotions). In this paper, we propose a framework called controllable multi-objective re-ranking (CMR) which incorporates a hypernetwork to generate parameters for a re-ranking model according to different preference weights. In this way, CMR is enabled to adapt the preference weights according to the environment changes in an online manner, without retraining the models. Moreover, we classify practical business-oriented tasks into four main categories and seamlessly incorporate them in a new proposed re-ranking model based on an Actor-Evaluator framework, which serves as a reliable real-world testbed for CMR. Offline experiments based on the dataset collected from Taobao App showed that CMR improved several popular re-ranking models by using them as underlying models. Online A/B tests also demonstrated the effectiveness and trustworthiness of CMR.
A Two-dimensional Spatial Optimization Framework for Vehicle Powertrain Systems
Authors: Jorn van Kampen, Mauro Salazar, Theo Hofman
Abstract
This paper presents a modeling framework to optimize the two-dimensional placement of powertrain elements inside the vehicle, explicitly accounting for the rotation, relative placement and alignment. Specifically, we first capture the multi-level nature of the system mathematically, and construct a model that captures different powertrain component orientations. Second, we include the relative element placement as variables in the model and derive alignment constraints for both child components and parent subsystems to automatically connect mechanical ports. Finally, we showcase our framework on a four-wheel driven electric vehicle. Our results demonstrate that our framework is capable of efficiently generating system design solutions in a fully automated manner, only using basic component properties.
Variable Radiance Field for Real-Life Category-Specifc Reconstruction from Single Image
Authors: Kun Wang, Zhiqiang Yan, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Reconstructing category-specific objects from a single image is a challenging task that requires inferring the geometry and appearance of an object from a limited viewpoint. Existing methods typically rely on local feature retrieval based on re-projection with known camera intrinsic, which are slow and prone to distortion at viewpoints distant from the input image. In this paper, we present Variable Radiance Field (VRF), a novel framework that can efficiently reconstruct category-specific objects from a single image without known camera parameters. Our key contributions are: (1) We parameterize the geometry and appearance of the object using a multi-scale global feature extractor, which avoids frequent point-wise feature retrieval and camera dependency. We also propose a contrastive learning-based pretraining strategy to improve the feature extractor. (2) We reduce the geometric complexity of the object by learning a category template, and use hypernetworks to generate a small neural radiance field for fast and instance-specific rendering. (3) We align each training instance to the template space using a learned similarity transformation, which enables semantic-consistent learning across different objects. We evaluate our method on the CO3D dataset and show that it outperforms existing methods in terms of quality and speed. We also demonstrate its applicability to shape interpolation and object placement tasks.
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems
Authors: Herbert Woisetschläger, Alexander Isenko, Ruben Mayer, Hans-Arno Jacobsen
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Federated Machine Learning (FL) has received considerable attention in recent years. FL benchmarks are predominantly explored in either simulated systems or data center environments, neglecting the setups of real-world systems, which are often closely linked to edge computing. We close this research gap by introducing FLEdge, a benchmark targeting FL workloads in edge computing systems. We systematically study hardware heterogeneity, energy efficiency during training, and the effect of various differential privacy levels on training in FL systems. To make this benchmark applicable to real-world scenarios, we evaluate the impact of client dropouts on state-of-the-art FL strategies with failure rates as high as 50%. FLEdge provides new insights, such as that training state-of-the-art FL workloads on older GPU-accelerated embedded devices is up to 3x more energy efficient than on modern server-grade GPUs.
Boosting Adversarial Transferability by Achieving Flat Local Maxima
Abstract
Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the fact that flat local minima are correlated with good generalization, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability by introducing a penalized gradient norm to the original loss function. Since directly optimizing the gradient regularization norm is computationally expensive and intractable for generating adversarial examples, we propose an approximation optimization method to simplify the gradient update of the objective function. Specifically, we randomly sample an example and adopt the first-order gradient to approximate the second-order Hessian matrix, which makes computing more efficient by interpolating two Jacobian matrices. Meanwhile, in order to obtain a more stable gradient direction, we randomly sample multiple examples and average the gradients of these examples to reduce the variance due to random sampling during the iterative process. Extensive experimental results on the ImageNet-compatible dataset show that the proposed method can generate adversarial examples at flat local regions, and significantly improve the adversarial transferability on either normally trained models or adversarially trained models than the state-of-the-art attacks.
SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
Abstract
Exploring robust and efficient association methods has always been an important issue in multiple-object tracking (MOT). Although existing tracking methods have achieved impressive performance, congestion and frequent occlusions still pose challenging problems in multi-object tracking. We reveal that performing sparse decomposition on dense scenes is a crucial step to enhance the performance of associating occluded targets. To this end, we propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images. Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets and perform data association on these sparse target subsets in order from near to far. By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack. SparseTrack provides a new perspective for solving the challenging crowded scene MOT problem. Only using IoU matching, SparseTrack achieves comparable performance with the state-of-the-art (SOTA) methods on the MOT17 and MOT20 benchmarks. Code and models are publicly available at \url{https://github.com/hustvl/SparseTrack}.
Efficient Multi-Task Scene Analysis with RGB-D Transformers
Authors: Söhnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, Horst-Michael Gross
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
Authors: Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, Rene Vidal, Yi Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's image-text binding, we show how the new clustering method leads to a simple yet effective self-labeling algorithm that successfully works on unlabeled large datasets such as MS-COCO and LAION-Aesthetics. We will release the code in https://github.com/LeslieTrue/CPP.
Simple and Controllable Music Generation
Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez
Abstract
We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft.
Bayesian Optimisation of Functions on Graphs
Authors: Xingchen Wan, Pierre Osselin, Henry Kenlay, Binxin Ru, Michael A. Osborne, Xiaowen Dong
Abstract
The increasing availability of graph-structured data motivates the task of optimising over functions defined on the node set of graphs. Traditional graph search algorithms can be applied in this case, but they may be sample-inefficient and do not make use of information about the function values; on the other hand, Bayesian optimisation is a class of promising black-box solvers with superior sample efficiency, but it has been scarcely been applied to such novel setups. To fill this gap, we propose a novel Bayesian optimisation framework that optimises over functions defined on generic, large-scale and potentially unknown graphs. Through the learning of suitable kernels on graphs, our framework has the advantage of adapting to the behaviour of the target function. The local modelling approach further guarantees the efficiency of our method. Extensive experiments on both synthetic and real-world graphs demonstrate the effectiveness of the proposed optimisation framework.
Real-time whole-heart electromechanical simulations using Latent Neural Ordinary Differential Equations
Authors: Matteo Salvador, Marina Strocchi, Francesco Regazzoni, Luca Dede', Steven Niederer, Alfio Quarteroni
Abstract
Cardiac digital twins provide a physics and physiology informed framework to deliver predictive and personalized medicine. However, high-fidelity multi-scale cardiac models remain a barrier to adoption due to their extensive computational costs and the high number of model evaluations needed for patient-specific personalization. Artificial Intelligence-based methods can make the creation of fast and accurate whole-heart digital twins feasible. In this work, we use Latent Neural Ordinary Differential Equations (LNODEs) to learn the temporal pressure-volume dynamics of a heart failure patient. Our surrogate model based on LNODEs is trained from 400 3D-0D whole-heart closed-loop electromechanical simulations while accounting for 43 model parameters, describing single cell through to whole organ and cardiovascular hemodynamics. The trained LNODEs provides a compact and efficient representation of the 3D-0D model in a latent space by means of a feedforward fully-connected Artificial Neural Network that retains 3 hidden layers with 13 neurons per layer and allows for 300x real-time numerical simulations of the cardiac function on a single processor of a standard laptop. This surrogate model is employed to perform global sensitivity analysis and robust parameter estimation with uncertainty quantification in 3 hours of computations, still on a single processor. We match pressure and volume time traces unseen by the LNODEs during the training phase and we calibrate 4 to 11 model parameters while also providing their posterior distribution. This paper introduces the most advanced surrogate model of cardiac function available in the literature and opens new important venues for parameter calibration in cardiac digital twins.
Federated Learning under Covariate Shifts with Generalization Guarantees
Authors: Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher
Abstract
This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of the supremum over true ratios. We also propose the communication-efficient variant FITW-ERM with the same level of privacy guarantees as those of classical ERM in FL. We theoretically show that FTW-ERM achieves smaller generalization error than classical ERM under certain settings. Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients.
Actively learning a Bayesian matrix fusion model with deep side information
Authors: Yangyang Yu, Jordan W. Suchow
Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract
High-dimensional deep neural network representations of images and concepts can be aligned to predict human annotations of diverse stimuli. However, such alignment requires the costly collection of behavioral responses, such that, in practice, the deep-feature spaces are only ever sparsely sampled. Here, we propose an active learning approach to adaptively sampling experimental stimuli to efficiently learn a Bayesian matrix factorization model with deep side information. We observe a significant efficiency gain over a passive baseline. Furthermore, with a sequential batched sampling strategy, the algorithm is applicable not only to small datasets collected from traditional laboratory experiments but also to settings where large-scale crowdsourced data collection is needed to accurately align the high-dimensional deep feature representations derived from pre-trained networks.
PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
Authors: Tiantian Feng, Shrikanth Narayanan
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser.
Negotiated Reasoning: On Provably Addressing Relative Over-Generalization
Authors: Junjie Sheng, Wenhao Li, Bo Jin, Hongyuan Zha, Jun Wang, Xiangfeng Wang
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract
Over-generalization is a thorny issue in cognitive science, where people may become overly cautious due to past experiences. Agents in multi-agent reinforcement learning (MARL) also have been found to suffer relative over-generalization (RO) as people do and stuck to sub-optimal cooperation. Recent methods have shown that assigning reasoning ability to agents can mitigate RO algorithmically and empirically, but there has been a lack of theoretical understanding of RO, let alone designing provably RO-free methods. This paper first proves that RO can be avoided when the MARL method satisfies a consistent reasoning requirement under certain conditions. Then we introduce a novel reasoning framework, called negotiated reasoning, that first builds the connection between reasoning and RO with theoretical justifications. After that, we propose an instantiated algorithm, Stein variational negotiated reasoning (SVNR), which uses Stein variational gradient descent to derive a negotiation policy that provably avoids RO in MARL under maximum entropy policy iteration. The method is further parameterized with neural networks for amortized learning, making computation efficient. Numerical experiments on many RO-challenged environments demonstrate the superiority and efficiency of SVNR compared to state-of-the-art methods in addressing RO.
Automatic Image Blending Algorithm Based on SAM and DINO
Abstract
The field of image blending has gained significant popularity in recent years due to its ability to create visually stunning content. The main objective of image blending is to merge an object from one image onto another seamlessly, with minor masking adjustments. With the recent development of SAM, which can detect and segment targets in images automatically. Our approach (1) combines semantic object detection and segmentation with corresponding mask generation to automatically fuse images and (2) introduces the use of PAN for further quality enhancement during the fusion process. Our approach surpasses many classical visual fusion models in various performance indicators such as PSNR, SSIM, and Realism. Notably, our process is highly efficient and speedy, making it widely applicable in industrial settings. This new process has the potential to revolutionize visual content creation and improve productivity across various industries.
A shape derivative approach to domain simplification
Abstract
The objective of this study is to address the difficulty of simplifying the geometric model in which a differential problem is formulated, also called defeaturing, while simultaneously ensuring that the accuracy of the solution is maintained under control. This enables faster and more efficient simulations, without sacrificing accuracy. More precisely, we consider an isogeometric discretisation of an elliptic model problem defined on a two-dimensional hierarchical B-spline computational domain with a complex boundary. Starting with an oversimplification of the geometry, we build a goal-oriented adaptive strategy that adaptively reintroduces continuous geometrical features in regions where the analysis suggests a large impact on the quantity of interest. This strategy is driven by an a posteriori estimator of the defeaturing error based on first-order shape sensitivity analysis, and it profits from the local refinement properties of hierarchical B-splines. The adaptive algorithm is described together with a procedure to generate (partially) simplified hierarchical B-spline geometrical domains. Numerical experiments are presented to illustrate the proposed strategy and its limitations.
Matting Anything
Authors: Jiachen Li, Jitesh Jain, Humphrey Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories
Abstract
Pre-trained language models (PLMs) demonstrate excellent abilities to understand texts in the generic domain while struggling in a specific domain. Although continued pre-training on a large domain-specific corpus is effective, it is costly to tune all the parameters on the domain. In this paper, we investigate whether we can adapt PLMs both effectively and efficiently by only tuning a few parameters. Specifically, we decouple the feed-forward networks (FFNs) of the Transformer architecture into two parts: the original pre-trained FFNs to maintain the old-domain knowledge and our novel domain-specific adapters to inject domain-specific knowledge in parallel. Then we adopt a mixture-of-adapters gate to fuse the knowledge from different domain adapters dynamically. Our proposed Mixture-of-Domain-Adapters (MixDA) employs a two-stage adapter-tuning strategy that leverages both unlabeled data and labeled data to help the domain adaptation: i) domain-specific adapter on unlabeled data; followed by ii) the task-specific adapter on labeled data. MixDA can be seamlessly plugged into the pretraining-finetuning paradigm and our experiments demonstrate that MixDA achieves superior performance on in-domain tasks (GLUE), out-of-domain tasks (ChemProt, RCT, IMDB, Amazon), and knowledge-intensive tasks (KILT). Further analyses demonstrate the reliability, scalability, and efficiency of our method. The code is available at https://github.com/Amano-Aki/Mixture-of-Domain-Adapters.
SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding
Authors: Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, Simon Lynen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving. However, these maps have limitations: they lack detail, often contain inaccuracies, and are difficult to create and maintain, especially in an automated fashion. Can we use raw imagery to automatically create better maps that can be easily interpreted by both humans and machines? We introduce SNAP, a deep network that learns rich neural 2D maps from ground-level and overhead images. We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images. SNAP can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. Moreover, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. This enables effective pre-training for data-efficient semantic scene understanding, with the potential to unlock cost-efficient creation of more detailed maps.
R-MAE: Regions Meet Masked Autoencoders
Authors: Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision-specific concepts such as "region" have played a key role in extending general machine learning frameworks to tasks like object detection. Given the success of region-based detectors for supervised learning and the progress of intra-image methods for contrastive learning, we explore the use of regions for reconstructive pre-training. Starting from Masked Autoencoding (MAE) both as a baseline and an inspiration, we propose a parallel pre-text task tailored to address the one-to-many mapping between images and regions. Since such regions can be generated in an unsupervised way, our approach (R-MAE) inherits the wide applicability from MAE, while being more "region-aware". We conduct thorough analyses during the development of R-MAE, and converge on a variant that is both effective and efficient (1.3% overhead over MAE). Moreover, it shows consistent quantitative improvements when generalized to various pre-training data and downstream detection and segmentation benchmarks. Finally, we provide extensive qualitative visualizations to enhance the understanding of R-MAE's behaviour and potential. Code will be made available at https://github.com/facebookresearch/r-mae.
Improving Negative-Prompt Inversion via Proximal Guidance
Abstract
DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose Proximal Negative-Prompt Inversion (ProxNPI), extending the concepts of NTI and NPI. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.
Scaling Spherical CNNs
Authors: Carlos Esteves, Jean-Jacques Slotine, Ameesh Makadia
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Spherical CNNs generalize CNNs to functions on the sphere, by using spherical convolutions as the main linear operation. The most accurate and efficient way to compute spherical convolutions is in the spectral domain (via the convolution theorem), which is still costlier than the usual planar convolutions. For this reason, applications of spherical CNNs have so far been limited to small problems that can be approached with low model capacity. In this work, we show how spherical CNNs can be scaled for much larger problems. To achieve this, we make critical improvements including novel variants of common model components, an implementation of core operations to exploit hardware accelerator characteristics, and application-specific input representations that exploit the properties of our model. Experiments show our larger spherical CNNs reach state-of-the-art on several targets of the QM9 molecular benchmark, which was previously dominated by equivariant graph neural networks, and achieve competitive performance on multiple weather forecasting tasks. Our code is available at https://github.com/google-research/spherical-cnn.
Keyword: faster
Faster Approximation Algorithms for Parameterized Graph Clustering and Edge Labeling
Authors: Vedangi Bengali, Nate Veldt
Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract
Graph clustering is a fundamental task in network analysis where the goal is to detect sets of nodes that are well-connected to each other but sparsely connected to the rest of the graph. We present faster approximation algorithms for an NP-hard parameterized clustering framework called LambdaCC, which is governed by a tunable resolution parameter and generalizes many other clustering objectives such as modularity, sparsest cut, and cluster deletion. Previous LambdaCC algorithms are either heuristics with no approximation guarantees, or computationally expensive approximation algorithms. We provide fast new approximation algorithms that can be made purely combinatorial. These rely on a new parameterized edge labeling problem we introduce that generalizes previous edge labeling problems that are based on the principle of strong triadic closure and are of independent interest in social network analysis. Our methods are orders of magnitude more scalable than previous approximation algorithms and our lower bounds allow us to obtain a posteriori approximation guarantees for previous heuristics that have no approximation guarantees of their own.
Efficient and Equivariant Graph Networks for Predicting Quantum Hamiltonian
Abstract
We consider the prediction of the Hamiltonian matrix, which finds use in quantum chemistry and condensed matter physics. Efficiency and equivariance are two important, but conflicting factors. In this work, we propose a SE(3)-equivariant network, named QHNet, that achieves efficiency and equivariance. Our key advance lies at the innovative design of QHNet architecture, which not only obeys the underlying symmetries, but also enables the reduction of number of tensor products by 92\%. In addition, QHNet prevents the exponential growth of channel dimension when more atom types are involved. We perform experiments on MD17 datasets, including four molecular systems. Experimental results show that our QHNet can achieve comparable performance to the state of the art methods at a significantly faster speed. Besides, our QHNet consumes 50\% less memory due to its streamlined architecture. Our code is publicly available as part of the AIRS library (\url{https://github.com/divelab/AIRS}).
Robust Learning with Progressive Data Expansion Against Spurious Correlation
Abstract
While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable spurious features rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. In light of this, we propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. On average, our method achieves a 2.8% improvement in worst-group accuracy compared with the state-of-the-art method, while enjoying up to 10x faster training efficiency.
RE-Matching: A Fine-Grained Semantic Matching Method for Zero-Shot Relation Extraction
Authors: Jun Zhao, Wenyu Zhan, Xin Zhao, Qi Zhang, Tao Gui, Zhongyu Wei, Junzhe Wang, Minlong Peng, Mingming Sun
Abstract
Semantic matching is a mainstream paradigm of zero-shot relation extraction, which matches a given input with a corresponding label description. The entities in the input should exactly match their hypernyms in the description, while the irrelevant contexts should be ignored when matching. However, general matching methods lack explicit modeling of the above matching pattern. In this work, we propose a fine-grained semantic matching method tailored for zero-shot relation extraction. Following the above matching pattern, we decompose the sentence-level similarity score into entity and context matching scores. Due to the lack of explicit annotations of the redundant components, we design a feature distillation module to adaptively identify the relation-irrelevant features and reduce their negative impact on context matching. Experimental results show that our method achieves higher matching $F_1$ score and has an inference speed 10 times faster, when compared with the state-of-the-art methods.
Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop
Authors: Jinyang Liu, Junjie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, Michael R. Lyu
Abstract
System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.
Real-Time Rendering of Glinty Appearances using Distributed Binomial Laws on Anisotropic Grids
Abstract
In this work, we render in real-time glittery materials caused by discrete flakes on the surface. To achieve this, one has to count the number of flakes reflecting the light towards the camera within every texel covered by a given pixel footprint. To do so, we derive a counting method for arbitrary footprints that, unlike previous work, outputs the correct statistics. We combine this counting method with an anisotropic parameterization of the texture space that reduces the number of texels falling under a pixel footprint. This allows our method to run with both stable performance and 1.5X to 5X faster than the state-of-the-art.
A Meta-Generation framework for Industrial System Generation
Abstract
Generative design is an increasingly important tool in the industrial world. It allows the designers and engineers to easily explore vast ranges of design options, providing a cheaper and faster alternative to the trial and failure approaches. Thanks to the flexibility they offer, Deep Generative Models are gaining popularity amongst Generative Design technologies. However, developing and evaluating these models can be challenging. The field lacks accessible benchmarks, in order to evaluate and compare objectively different Deep Generative Models architectures. Moreover, vanilla Deep Generative Models appear to be unable to accurately generate multi-components industrial systems that are controlled by latent design constraints. To address these challenges, we propose an industry-inspired use case that incorporates actual industrial system characteristics. This use case can be quickly generated and used as a benchmark. We propose a Meta-VAE capable of producing multi-component industrial systems and showcase its application on the proposed use case.
Time-Optimal Path Tracking with ISO Safety Guarantees
Abstract
One way of ensuring operator's safety during human-robot collaboration is through Speed and Separation Monitoring (SSM), as defined in ISO standard ISO/TS 15066. In general, it is impossible to avoid all human-robot collisions: consider for instance the case when the robot does not move at all, a human operator can still collide with it by hitting it of her own voluntary motion. In the SSM framework, it is possible however to minimize harm by requiring this: \emph{if} a collision ever occurs, then the robot must be in a \emph{stationary state} (all links have zero velocity) at the time instant of the collision. In this paper, we propose a time-optimal control policy based on Time-Optimal Path Parameterization (TOPP) to guarantee such a behavior. Specifically, we show that: for any robot motion that is strictly faster than the motion recommended by our policy, there exists a human motion that results in a collision with the robot in a non-stationary state. Correlatively, we show, in simulation, that our policy is strictly less conservative than state-of-the-art safe robot control methods. Additionally, we propose a parallelization method to reduce the computation time of our pre-computation phase (down to 0.5 sec, practically), which enables the whole pipeline (including the pre-computation) to be executed at runtime, nearly in real-time. Finally, we demonstrate the application of our method in a scenario: time-optimal, safe control of a 6-dof industrial robot.
Habits of Mind: Reusing Action Sequences for Efficient Planning
Abstract
When we exercise sequences of actions, their execution becomes more fluent and precise. Here, we consider the possibility that exercised action sequences can also be used to make planning faster and more accurate by focusing expansion of the search tree on paths that have been frequently used in the past, and by reducing deep planning problems to shallow ones via multi-step jumps in the tree. To capture such sequences, we use a flexible Bayesian action chunking mechanism which finds and exploits statistically reliable structure at different scales. This gives rise to shorter or longer routines that can be embedded into a Monte-Carlo tree search planner. We show the benefits of this scheme using a physical construction task patterned after tangrams.
Real-time GeoAI for High-resolution Mapping and Segmentation of Arctic Permafrost Features
Abstract
This paper introduces a real-time GeoAI workflow for large-scale image analysis and the segmentation of Arctic permafrost features at a fine-granularity. Very high-resolution (0.5m) commercial imagery is used in this analysis. To achieve real-time prediction, our workflow employs a lightweight, deep learning-based instance segmentation model, SparseInst, which introduces and uses Instance Activation Maps to accurately locate the position of objects within the image scene. Experimental results show that the model can achieve better accuracy of prediction at a much faster inference speed than the popular Mask-RCNN model.
A shape derivative approach to domain simplification
Abstract
The objective of this study is to address the difficulty of simplifying the geometric model in which a differential problem is formulated, also called defeaturing, while simultaneously ensuring that the accuracy of the solution is maintained under control. This enables faster and more efficient simulations, without sacrificing accuracy. More precisely, we consider an isogeometric discretisation of an elliptic model problem defined on a two-dimensional hierarchical B-spline computational domain with a complex boundary. Starting with an oversimplification of the geometry, we build a goal-oriented adaptive strategy that adaptively reintroduces continuous geometrical features in regions where the analysis suggests a large impact on the quantity of interest. This strategy is driven by an a posteriori estimator of the defeaturing error based on first-order shape sensitivity analysis, and it profits from the local refinement properties of hierarchical B-splines. The adaptive algorithm is described together with a procedure to generate (partially) simplified hierarchical B-spline geometrical domains. Numerical experiments are presented to illustrate the proposed strategy and its limitations.
Keyword: mobile
Traction and Stability Control using Fuzzy-based Controller Integration
Abstract
Adverse road conditions can cause vehicle yaw instability and loss of traction. To compensate for the instability under such conditions, corrective actions must be taken. In comparison to a mechanical differential, an electronic differential can independently control the two drive wheels and provide means of generating more effective corrective actions. As a solution for traction and stability issues in automobiles, this paper has developed a controller for a vehicle electronic differential consisting of two program-controlled rear motors. The control algorithm adjusts to changing road conditions. Traction was controlled using a motor reaction torque observer-based slip ratio estimation, and yaw stability was achieved by tracking a reference yaw rate calculated using estimated tyre cornering stiffnesses. A recursive least squares algorithm was used to estimate cornering stiffness. The yaw rate of the vehicle, as well as its longitudinal and lateral accelerations, were measured, and the body slip angle was estimated using an observer. A fuzzy inference system was used to integrate the independently developed traction control and yaw control schemes. The fuzzy inference system modifies the commanded voltage generated by the driver's input to account for the traction and yaw stability controller outputs. A vehicle simulator was used to numerically simulate the integrated controller.
A Systematic Literature Review on Client Selection in Federated Learning
Authors: Carl Smestad (1), Jingyue Li (2) ((1) Norwegian University of Science and Technology, (2) Norwegian University of Science and Technology)
Abstract
With the arising concerns of privacy within machine learning, federated learning (FL) was invented in 2017, in which the clients, such as mobile devices, train a model and send the update to the centralized server. Choosing clients randomly for FL can harm learning performance due to different reasons. Many studies have proposed approaches to address the challenges of client selection of FL. However, no systematic literature review (SLR) on this topic existed. This SLR investigates the state of the art of client selection in FL and answers the challenges, solutions, and metrics to evaluate the solutions. We systematically reviewed 47 primary studies. The main challenges found in client selection are heterogeneity, resource allocation, communication costs, and fairness. The client selection schemes aim to improve the original random selection algorithm by focusing on one or several of the aforementioned challenges. The most common metric used is testing accuracy versus communication rounds, as testing accuracy measures the successfulness of the learning and preferably in as few communication rounds as possible, as they are very expensive. Although several possible improvements can be made with the current state of client selection, the most beneficial ones are evaluating the impact of unsuccessful clients and gaining a more theoretical understanding of the impact of fairness in FL.
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Authors: Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu Emma Wang
Abstract
Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an effective ordering of layer sensitivities. We address this by augmenting the estimated Hessian with additional information to capture inter-layer dependencies. We demonstrate that this consistently improves PTQ performance along the accuracy-latency Pareto frontier across multiple models. Our method combines second-order information and inter-layer dependencies to guide a bisection search, finding quantization configurations within a user-configurable model accuracy degradation range. We evaluate the effectiveness of our method on the ResNet50, MobileNetV2, and BERT models. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48\%$, $21.69\%$, and $33.28\%$ respectively, while maintaining model accuracy to within $99.99\%$ of the baseline model.
Local Map-Based DQN Navigation and a Transferability Metric Using Scene Similarity
Abstract
Autonomous navigation in unknown environments without a global map is a long-standing challenge for mobile robots. While deep reinforcement learning (DRL) has attracted a rapidly growing interest in solving such an autonomous navigation problem for its generalization capability, DRL typically leads to a mediocre navigation performance in practice due to the gap between the training scene and the actual test scene. Most existing work focuses on tuning the algorithm to enhance its transferability, whereas few investigates how to quantify or measure the gap therebetween. This letter presents a local map-based deep Q-network (DQN) navigation algorithm, which uses local maps converted from 2D LiDAR data as observations without a global map. More importantly, this letter proposes a new transferability metric -- the scene similarity calculated from an improved image template matching algorithm to measure the similarity between the training and test scenes. With a wheeled robot as the case study platform, both simulation and real-world experiments are conducted in a total of 20 different scenes. The case study results successfully validate the local map-based navigation algorithm as well as the similarity metric in predicting the transferability or success rate of the algorithm.
Neuro-Symbolic Approaches for Context-Aware Human Activity Recognition
Abstract
Deep Learning models are a standard solution for sensor-based Human Activity Recognition (HAR), but their deployment is often limited by labeled data scarcity and models' opacity. Neuro-Symbolic AI (NeSy) provides an interesting research direction to mitigate these issues by infusing knowledge about context information into HAR deep learning classifiers. However, existing NeSy methods for context-aware HAR require computationally expensive symbolic reasoners during classification, making them less suitable for deployment on resource-constrained devices (e.g., mobile devices). Additionally, NeSy approaches for context-aware HAR have never been evaluated on in-the-wild datasets, and their generalization capabilities in real-world scenarios are questionable. In this work, we propose a novel approach based on a semantic loss function that infuses knowledge constraints in the HAR model during the training phase, avoiding symbolic reasoning during classification. Our results on scripted and in-the-wild datasets show the impact of different semantic loss functions in outperforming a purely data-driven model. We also compare our solution with existing NeSy methods and analyze each approach's strengths and weaknesses. Our semantic loss remains the only NeSy solution that can be deployed as a single DNN without the need for symbolic reasoning modules, reaching recognition rates close (and better in some cases) to existing approaches.
Efficient Multi-Task Scene Analysis with RGB-D Transformers
Authors: Söhnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, Horst-Michael Gross
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
SMUG Planner: A Safe Multi-Goal Planner for Mobile Robots in Challenging Environments
Authors: Changan Chen, Jonas Frey, Philip Arm, Marco Hutter
Abstract
Robotic exploration or monitoring missions require mobile robots to autonomously and safely navigate between multiple target locations in potentially challenging environments. Currently, this type of multi-goal mission often relies on humans designing a set of actions for the robot to follow in the form of a path or waypoints. In this work, we consider the multi-goal problem of visiting a set of pre-defined targets, each of which could be visited from multiple potential locations. To increase autonomy in these missions, we propose a safe multi-goal (SMUG) planner that generates an optimal motion path to visit those targets. To increase safety and efficiency, we propose a hierarchical state validity checking scheme, which leverages robot-specific traversability learned in simulation. We use LazyPRM* with an informed sampler to accelerate collision-free path generation. Our iterative dynamic programming algorithm enables the planner to generate a path visiting more than ten targets within seconds. Moreover, the proposed hierarchical state validity checking scheme reduces the planning time by 30% compared to pure volumetric collision checking and increases safety by avoiding high-risk regions. We deploy the SMUG planner on the quadruped robot ANYmal and show its capability to guide the robot in multi-goal missions fully autonomously on rough terrain.
Keyword: pruning
Muti-Scale And Token Mergence: Make Your ViT More Efficient
Authors: Zhe Bian, Zhe Wang, Wenqiang Han, Kangping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Since its inception, Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. Nonetheless, the multi-head self-attention (MHSA) mechanism in ViT is computationally expensive due to its calculation of relationships among all tokens. Although some techniques mitigate computational overhead by discarding tokens, this also results in the loss of potential information from those tokens. To tackle these issues, we propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens, thereby mitigating the impact of pruning on model performance. Crucial and non-crucial tokens are identified by their importance scores and merged based on similarity scores. Furthermore, multi-scale features are exploited to represent images, which are fused prior to token pruning to produce richer feature representations. Importantly, our method can be seamlessly integrated with various ViTs, enhancing their adaptability. Experimental evidence substantiates the efficacy of our approach in reducing the influence of token pruning on model performance. For instance, on the ImageNet dataset, it achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Abstract
Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD
Magnitude Attention-based Dynamic Pruning
Authors: Jihye Back, Namhyuk Ahn, Jangho Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Existing pruning methods utilize the importance of each weight based on specified criteria only when searching for a sparse structure but do not utilize it during training. In this work, we propose a novel approach - \textbf{M}agnitude \textbf{A}ttention-based Dynamic \textbf{P}runing (MAP) method, which applies the importance of weights throughout both the forward and backward paths to explore sparse model structures dynamically. Magnitude attention is defined based on the magnitude of weights as continuous real-valued numbers enabling a seamless transition from a redundant to an effective sparse network by promoting efficient exploration. Additionally, the attention mechanism ensures more effective updates for important layers within the sparse network. In later stages of training, our approach shifts from exploration to exploitation, exclusively updating the sparse model composed of crucial weights based on the explored structure, resulting in pruned models that not only achieve performance comparable to dense models but also outperform previous pruning methods on CIFAR-10/100 and ImageNet.
Large-scale Dataset Pruning with Dynamic Uncertainty
Authors: Muyang He, Shuo Yang, Tiejun Huang, Bo Zhao
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. To our knowledge, this is the first work to study dataset pruning on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 75% lossless compression ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.
Keyword: diffusion
DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models
Authors: Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Jiliang Tang
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Recently, Generative Diffusion Models (GDMs) have showcased their remarkable capabilities in learning and generating images. A large community of GDMs has naturally emerged, further promoting the diversified applications of GDMs in various fields. However, this unrestricted proliferation has raised serious concerns about copyright protection. For example, artists including painters and photographers are becoming increasingly concerned that GDMs could effortlessly replicate their unique creative works without authorization. In response to these challenges, we introduce a novel watermarking scheme, DiffusionShield, tailored for GDMs. DiffusionShield protects images from copyright infringement by GDMs through encoding the ownership information into an imperceptible watermark and injecting it into the images. Its watermark can be easily learned by GDMs and will be reproduced in their generated images. By detecting the watermark from generated images, copyright infringement can be exposed with evidence. Benefiting from the uniformity of the watermarks and the joint optimization method, DiffusionShield ensures low distortion of the original image, high watermark detection performance, and the ability to embed lengthy messages. We conduct rigorous and comprehensive experiments to show the effectiveness of DiffusionShield in defending against infringement by GDMs and its superiority over traditional watermarking methods.
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem
Abstract
We systematically study a wide variety of image-based generative models spanning semantically-diverse datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 16 modern metrics for evaluating the overall performance, fidelity, diversity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization; none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 16 common metrics for 8 different encoders at https://github.com/layer6ai-labs/dgm-eval.
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Authors: Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The rapid advancement of generative models, facilitating the creation of hyper-realistic images from textual descriptions, has concurrently escalated critical societal concerns such as misinformation. Traditional fake detection mechanisms, although providing some mitigation, fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach, incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates near-perfect attribution accuracy with a minimal impact on output quality. We rigorously scrutinize our method's secrecy under two distinct scenarios: one where a malicious user attempts to detect the fingerprint, and another where a user possesses a comprehensive understanding of our method. We also evaluate the robustness of our approach against various image post-processing manipulations typically executed by end-users. Through extensive evaluation of the Stable Diffusion models, our method presents a promising and novel avenue for accountable model distribution and responsible use.
Interpreting and Improving Diffusion Models Using the Euclidean Distance Function
Authors: Frank Permenter, Chenyang Yuan
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract
Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to reinterpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection-error of the denoiser. Finally, we propose a new sampler based on two simple modifications to DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.
City-wide Origin-Destination Matrix Generation via Graph Denoising Diffusion
Authors: Can Rong, Jingtao Ding, Zhicheng Liu, Yong Li
Abstract
The Origin-Destination~(OD) matrix provides an estimation of number of individuals traveling between regions, i.e., mobility flow in the city, which is widely-used in urban planning, transportation, etc. Given various city characteristics of urban regions, generating the city-wide OD matrix without using historical flow information has become increasingly appealing to both researchers and practitioners. However, existing works are limited in independent generation of each element, i.e., flow, in OD matrix, overlooking the element relations within the matrix that can be well formulated as a network. In this paper, we instead propose to generate the city-wide OD matrix from the network perspective, and design a graph denoising diffusion method to learn the conditional joint probability distribution of all elements in the OD matrix given city characteristics at region level. To overcome the learning difficulty of the city-wide OD matrix covering over thousands of regions, we decompose the original one-shot generative modeling of the diffusion model into two cascaded stages, corresponding to the generation of network topology and mobility flow, respectively. To further reproduce important network properties contained in city-wide OD matrices, we design an elaborated graph denoising network structure including a node property augmentation module and a graph transformer backbone. Empirical experiments on data collected in two large US cities have verified that our method can generate OD matrices for new cities with network statistics remarkably similar with the ground truth, further achieving superior outperformance over competitive baselines in terms of the generation realism.
Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning
Authors: Jifeng Hu, Yanchao Sun, Sili Huang, SiYuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, Dacheng Tao
Abstract
Recent works have shown the potential of diffusion models in computer vision and natural language processing. Apart from the classical supervised learning fields, diffusion models have also shown strong competitiveness in reinforcement learning (RL) by formulating decision-making as sequential generation. However, incorporating temporal information of sequential data and utilizing it to guide diffusion models to perform better generation is still an open challenge. In this paper, we take one step forward to investigate controllable generation with temporal conditions that are refined from temporal information. We observe the importance of temporal conditions in sequential generation in sufficient explorative scenarios and provide a comprehensive discussion and comparison of different temporal conditions. Based on the observations, we propose an effective temporally-conditional diffusion model coined Temporally-Composable Diffuser (TCD), which extracts temporal information from interaction sequences and explicitly guides generation with temporal conditions. Specifically, we separate the sequences into three parts according to time expansion and identify historical, immediate, and prospective conditions accordingly. Each condition preserves non-overlapping temporal information of sequences, enabling more controllable generation when we jointly use them to guide the diffuser. Finally, we conduct extensive experiments and analysis to reveal the favorable applicability of TCD in offline RL tasks, where our method reaches or matches the best performance compared with prior SOTA baselines.
Abstract
Diffusion models have achieved impressive results in generating diverse and realistic data by employing multi-step denoising processes. However, the need for accommodating significant variations in input noise at each time-step has led to diffusion models requiring a large number of parameters for their denoisers. We have observed that diffusion models effectively act as filters for different frequency ranges at each time-step noise. While some previous works have introduced multi-expert strategies, assigning denoisers to different noise intervals, they overlook the importance of specialized operations for high and low frequencies. For instance, self-attention operations are effective at handling low-frequency components (low-pass filters), while convolutions excel at capturing high-frequency features (high-pass filters). In other words, existing diffusion models employ denoisers with the same architecture, without considering the optimal operations for each time-step noise. To address this limitation, we propose a novel approach called Multi-architecturE Multi-Expert (MEME), which consists of multiple experts with specialized architectures tailored to the operations required at each time-step interval. Through extensive experiments, we demonstrate that MEME outperforms large competitors in terms of both generation performance and computational efficiency.
Non-autoregressive Conditional Diffusion Models for Time Series Prediction
Abstract
Recently, denoising diffusion models have led to significant breakthroughs in the generation of images, audio and text. However, it is still an open question on how to adapt their strong modeling ability to model time series. In this paper, we propose TimeDiff, a non-autoregressive diffusion model that achieves high-quality time series prediction with the introduction of two novel conditioning mechanisms: future mixup and autoregressive initialization. Similar to teacher forcing, future mixup allows parts of the ground-truth future predictions for conditioning, while autoregressive initialization helps better initialize the model with basic time series patterns such as short-term trends. Extensive experiments are performed on nine real-world datasets. Results show that TimeDiff consistently outperforms existing time series diffusion models, and also achieves the best overall performance across a variety of the existing strong baselines (including transformers and FiLM).
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
Authors: Yuseung Lee, Kunho Kim, Hyunjin Kim, Minhyuk Sung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score).
Interactive Fashion Content Generation Using LLMs and Latent Diffusion Models
Authors: Krishna Sri Ipsit Mantri, Nevasini Sasikumar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Fashionable image generation aims to synthesize images of diverse fashion prevalent around the globe, helping fashion designers in real-time visualization by giving them a basic customized structure of how a specific design preference would look in real life and what further improvements can be made for enhanced customer satisfaction. Moreover, users can alone interact and generate fashionable images by just giving a few simple prompts. Recently, diffusion models have gained popularity as generative models owing to their flexibility and generation of realistic images from Gaussian noise. Latent diffusion models are a type of generative model that use diffusion processes to model the generation of complex data, such as images, audio, or text. They are called "latent" because they learn a hidden representation, or latent variable, of the data that captures its underlying structure. We propose a method exploiting the equivalence between diffusion models and energy-based models (EBMs) and suggesting ways to compose multiple probability distributions. We describe a pipeline on how our method can be used specifically for new fashionable outfit generation and virtual try-on using LLM-guided text-to-image generation. Our results indicate that using an LLM to refine the prompts to the latent diffusion model assists in generating globally creative and culturally diversified fashion styles and reducing bias.
PriSampler: Mitigating Property Inference of Diffusion Models
Authors: Hailong Hu, Jun Pang
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Diffusion models have been remarkably successful in data synthesis. Such successes have also driven diffusion models to apply to sensitive data, such as human face data, but this might bring about severe privacy concerns. In this work, we systematically present the first privacy study about property inference attacks against diffusion models, in which adversaries aim to extract sensitive global properties of the training set from a diffusion model, such as the proportion of the training data for certain sensitive properties. Specifically, we consider the most practical attack scenario: adversaries are only allowed to obtain synthetic data. Under this realistic scenario, we evaluate the property inference attacks on different types of samplers and diffusion models. A broad range of evaluations shows that various diffusion models and their samplers are all vulnerable to property inference attacks. Furthermore, one case study on off-the-shelf pre-trained diffusion models also demonstrates the effectiveness of the attack in practice. Finally, we propose a new model-agnostic plug-in method PriSampler to mitigate the property inference of diffusion models. PriSampler can be directly applied to well-trained diffusion models and support both stochastic and deterministic sampling. Extensive experiments illustrate the effectiveness of our defense and it makes adversaries infer the proportion of properties as close as random guesses. PriSampler also shows its significantly superior performance to diffusion models trained with differential privacy on both model utility and defense performance.
Anomaly Detection in Satellite Videos using Diffusion Models
Authors: Akash Awasthi, Son Ly, Jaer Nizam, Samira Zare, Videet Mehta, Safwan Ahmed, Keshav Shah, Ramakrishna Nemani, Saurabh Prasad, Hien Van Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
The definition of anomaly detection is the identification of an unexpected event. Real-time detection of extreme events such as wildfires, cyclones, or floods using satellite data has become crucial for disaster management. Although several earth-observing satellites provide information about disasters, satellites in the geostationary orbit provide data at intervals as frequent as every minute, effectively creating a video from space. There are many techniques that have been proposed to identify anomalies in surveillance videos; however, the available datasets do not have dynamic behavior, so we discuss an anomaly framework that can work on very high-frequency datasets to find very fast-moving anomalies. In this work, we present a diffusion model which does not need any motion component to capture the fast-moving anomalies and outperforms the other baseline methods.
Improving Negative-Prompt Inversion via Proximal Guidance
Abstract
DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose Proximal Negative-Prompt Inversion (ProxNPI), extending the concepts of NTI and NPI. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.
Stochastic Multi-Person 3D Motion Forecasting
Authors: Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of articulated motion. To this end, we introduce a novel task of stochastic multi-person 3D motion forecasting. We propose a dual-level generative modeling framework that separately models independent individual motion at the local level and social interactions at the global level. Notably, this dual-level modeling mechanism can be achieved within a shared generative model, through introducing learnable latent codes that represent intents of future motion and switching the codes' modes of operation at different levels. Our framework is general; we instantiate it with different generative models, including generative adversarial networks and diffusion models, and various multi-person forecasting models. Extensive experiments on CMU-Mocap, MuPoTS-3D, and SoMoF benchmarks show that our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
Abstract
Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code shall be released.
Grounded Text-to-Image Synthesis with Attention Refocusing
Authors: Quynh Phung, Songwei Ge, Jia-Bin Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.
Abstract
We address the problem of making Conformal Prediction (CP) intervals locally adaptive. Most existing methods focus on approximating the object-conditional validity of the intervals by partitioning or re-weighting the calibration set. Our strategy is new and conceptually different. Instead of re-weighting the calibration data, we redefine the conformity measure through a trainable change of variables, $A \to \phi_X(A)$, that depends explicitly on the object attributes, $X$. Under certain conditions and if $\phi_X$ is monotonic in $A$ for any $X$, the transformations produce prediction intervals that are guaranteed to be marginally valid and have $X$-dependent sizes. We describe how to parameterize and train $\phi_X$ to maximize the interval efficiency. Contrary to other CP-aware training methods, the objective function is smooth and can be minimized through standard gradient methods without approximations.
Language Adaptive Weight Generation for Multi-task Visual Grounding
Authors: Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, Xi Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.
Improving Empathetic Dialogue Generation by Dynamically Infusing Commonsense Knowledge
Abstract
In empathetic conversations, individuals express their empathy towards others. Previous work has mainly focused on generating empathetic responses by utilizing the speaker's emotion. Besides, external commonsense knowledge has been applied to enhance the system's understandings of the speaker's situation. However, given an event, commonsense knowledge base contains various relations, potentially leading to confusion for the dialogue system. Consequently, inconsistencies arise among the emotion, generated response and speaker's contextual information. To this end, we propose a novel approach for empathetic response generation, which incorporates an adaptive module for commonsense knowledge selection to ensure consistency between the generated empathetic responses and the speaker's situation. This selected knowledge is used to refine the commonsense cognition and empathy expression for generated responses. Experimental results show that our approach significantly outperforms baseline models in both automatic and human evaluations, exhibiting the generation of more coherent and empathetic responses. Moreover, case studies highlight the interpretability of knowledge selection in the responses and the effectiveness of adaptive module in our model. Code: https://github.com/Hanscal/DCKS.
Adaptive Frequency Green Light Optimal Speed Advisory based on Hybrid Actor-Critic Reinforcement Learning
Abstract
Green Light Optimal Speed Advisory (GLOSA) system suggests speeds to vehicles to assist them in passing through intersections during green intervals, thus reducing traffic congestion and fuel consumption by minimizing the number of stops and idle times at intersections. However, previous research has focused on optimizing the GLOSA algorithm, neglecting the frequency of speed advisory by the GLOSA system. Specifically, some studies provide speed advisory profile at each decision step, resulting in redundant advisory, while others calculate the optimal speed for the vehicle only once, which cannot adapt to dynamic traffic. In this paper, we propose an Adaptive Frequency GLOSA (AF-GLOSA) model based on Hybrid Proximal Policy Optimization (H-PPO), which employs an actor-critic architecture with a hybrid actor network. The hybrid actor network consists of a discrete actor that outputs advisory frequency and a continuous actor that outputs acceleration profiles. Additionally, we design a novel reward function that considers both travel efficiency and fuel consumption. The AF-GLOSA model is evaluated in comparison to traditional GLOSA and learning-based GLOSA methods in a three-lane intersection with a traffic signal in SUMO, under three different levels of traffic density. The results demonstrate that the AF-GLOSA model performs best in reducing average stop times, fuel consumption and CO2 emissions.
Feature Selection using Sparse Adaptive Bottleneck Centroid-Encoder
Abstract
We introduce a novel nonlinear model, Sparse Adaptive Bottleneck Centroid-Encoder (SABCE), for determining the features that discriminate between two or more classes. The algorithm aims to extract discriminatory features in groups while reconstructing the class centroids in the ambient space and simultaneously use additional penalty terms in the bottleneck layer to decrease within-class scatter and increase the separation of different class centroids. The model has a sparsity-promoting layer (SPL) with a one-to-one connection to the input layer. Along with the primary objective, we minimize the $l_{2,1}$-norm of the sparse layer, which filters out unnecessary features from input data. During training, we update class centroids by taking the Hadamard product of the centroids and weights of the sparse layer, thus ignoring the irrelevant features from the target. Therefore the proposed method learns to reconstruct the critical components of class centroids rather than the whole centroids. The algorithm is applied to various real-world data sets, including high-dimensional biological, image, speech, and accelerometer sensor data. We compared our method to different state-of-the-art feature selection techniques, including supervised Concrete Autoencoders (SCAE), Feature Selection Networks (FsNet), Stochastic Gates (STG), and LassoNet. We empirically showed that SABCE features often produced better classification accuracy than other methods on the sequester test sets, setting new state-of-the-art results.
An adaptive augmented Lagrangian method for training physics and equality constrained artificial neural networks
Abstract
Physics and equality constrained artificial neural networks (PECANN) are grounded in methods of constrained optimization to properly constrain the solution of partial differential equations (PDEs) with their boundary and initial conditions and any high-fidelity data that may be available. To this end, adoption of the augmented Lagrangian method within the PECANN framework is paramount for learning the solution of PDEs without manually balancing the individual loss terms in the objective function used for determining the parameters of the neural network. Generally speaking, ALM combines the merits of the penalty and Lagrange multiplier methods while avoiding the ill conditioning and convergence issues associated singly with these methods . In the present work, we apply our PECANN framework to solve forward and inverse problems that have an expanded and diverse set of constraints. We show that ALM with its conventional formulation to update its penalty parameter and Lagrange multipliers stalls for such challenging problems. To address this issue, we propose an adaptive ALM in which each constraint is assigned a unique penalty parameter that evolve adaptively according to a rule inspired by the adaptive subgradient method. Additionally, we revise our PECANN formulation for improved computational efficiency and savings which allows for mini-batch training. We demonstrate the efficacy of our proposed approach by solving several forward and PDE-constrained inverse problems with noisy data, including simulation of incompressible fluid flows with a primitive-variables formulation of the Navier-Stokes equations up to a Reynolds number of 1000.
Open Set Relation Extraction via Unknown-Aware Training
Authors: Jun Zhao, Xin Zhao, Wenyu Zhan, Qi Zhang, Tao Gui, Zhongyu Wei, Yunwen Chen, Xiang Gao, Xuanjing Huang
Abstract
The existing supervised relation extraction methods have achieved impressive performance in a closed-set setting, where the relations during both training and testing remain the same. In a more realistic open-set setting, unknown relations may appear in the test set. Due to the lack of supervision signals from unknown relations, a well-performing closed-set relation extractor can still confidently misclassify them into known relations. In this paper, we propose an unknown-aware training method, regularizing the model by dynamically synthesizing negative instances. To facilitate a compact decision boundary, ``difficult'' negative instances are necessary. Inspired by text adversarial attacks, we adaptively apply small but critical perturbations to original training instances and thus synthesizing negative instances that are more likely to be mistaken by the model as known relations. Experimental results show that this method achieves SOTA unknown relation detection without compromising the classification of known relations.
RE-Matching: A Fine-Grained Semantic Matching Method for Zero-Shot Relation Extraction
Authors: Jun Zhao, Wenyu Zhan, Xin Zhao, Qi Zhang, Tao Gui, Zhongyu Wei, Junzhe Wang, Minlong Peng, Mingming Sun
Abstract
Semantic matching is a mainstream paradigm of zero-shot relation extraction, which matches a given input with a corresponding label description. The entities in the input should exactly match their hypernyms in the description, while the irrelevant contexts should be ignored when matching. However, general matching methods lack explicit modeling of the above matching pattern. In this work, we propose a fine-grained semantic matching method tailored for zero-shot relation extraction. Following the above matching pattern, we decompose the sentence-level similarity score into entity and context matching scores. Due to the lack of explicit annotations of the redundant components, we design a feature distillation module to adaptively identify the relation-irrelevant features and reduce their negative impact on context matching. Experimental results show that our method achieves higher matching $F_1$ score and has an inference speed 10 times faster, when compared with the state-of-the-art methods.
A Melting Pot of Evolution and Learning
Authors: Moshe Sipper, Achiya Elyasaf, Tomer Halperin, Zvika Haramaty, Raz Lapid, Eyal Segal, Itai Tzruia, Snir Vitrack Tamam
Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Abstract
We survey eight recent works by our group, involving the successful blending of evolutionary algorithms with machine learning and deep learning: 1. Binary and Multinomial Classification through Evolutionary Symbolic Regression, 2. Classy Ensemble: A Novel Ensemble Algorithm for Classification, 3. EC-KitY: Evolutionary Computation Tool Kit in Python, 4. Evolution of Activation Functions for Deep Learning-Based Image Classification, 5. Adaptive Combination of a Genetic Algorithm and Novelty Search for Deep Neuroevolution, 6. An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks, 7. Foiling Explanations in Deep Neural Networks, 8. Patch of Invisibility: Naturalistic Black-Box Adversarial Attacks on Object Detectors.
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering
Abstract
As a collaborative paradigm, Federated Learning (FL) empowers clients to engage in collective model training without exchanging their respective local data. Nevertheless, FL remains vulnerable to backdoor attacks in which an attacker compromises malicious clients, and injects poisoned model weights into the aggregation process to yield attacker-chosen predictions for particular samples. Existing countermeasures, mainly based on anomaly detection, may erroneously reject legitimate weights while accepting malicious ones, which is due to inadequacies in quantifying client model similarities. Other defense mechanisms prove effective exclusively when confronted with a restricted number of malicious clients, e.g., less than 10%. To address these vulnerabilities, we present G$^2$uardFL, a protective framework that reframes the detection of malicious clients as an attributed graph clustering problem, thereby safeguarding FL systems. This framework employs a client graph clustering technique to identify malicious clients and incorporates an adaptive method to amplify the disparity between the aggregated model and poisoned client models, thereby eliminating previously embedded backdoors. A theoretical analysis of convergence is also performed to demonstrate that the global model closely approximates the model untouched by any backdoor. Through empirical evaluation compared to cutting-edge defenses and against various backdoor attacks, our experimental results indicate that G$^2$uardFL considerably undermines the effectiveness of backdoor attacks while maintaining a negligible impact on the benign sample performance.
Attention Weighted Mixture of Experts with Contrastive Learning for Personalized Ranking in E-commerce
Authors: Juan Gong, Zhenlin Chen, Chaoyi Ma, Zhuojian Xiao, Haonan Wang, Guoyu Tang, Lin Liu, Sulong Xu, Bo Long, Yunjiang Jiang
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract
Ranking model plays an essential role in e-commerce search and recommendation. An effective ranking model should give a personalized ranking list for each user according to the user preference. Existing algorithms usually extract a user representation vector from the user behavior sequence, then feed the vector into a feed-forward network (FFN) together with other features for feature interactions, and finally produce a personalized ranking score. Despite tremendous progress in the past, there is still room for improvement. Firstly, the personalized patterns of feature interactions for different users are not explicitly modeled. Secondly, most of existing algorithms have poor personalized ranking results for long-tail users with few historical behaviors due to the data sparsity. To overcome the two challenges, we propose Attention Weighted Mixture of Experts (AW-MoE) with contrastive learning for personalized ranking. Firstly, AW-MoE leverages the MoE framework to capture personalized feature interactions for different users. To model the user preference, the user behavior sequence is simultaneously fed into expert networks and the gate network. Within the gate network, one gate unit and one activation unit are designed to adaptively learn the fine-grained activation vector for experts using an attention mechanism. Secondly, a random masking strategy is applied to the user behavior sequence to simulate long-tail users, and an auxiliary contrastive loss is imposed to the output of the gate network to improve the model generalization for these users. This is validated by a higher performance gain on the long-tail user test set. Experiment results on a JD real production dataset and a public dataset demonstrate the effectiveness of AW-MoE, which significantly outperforms state-of-art methods. Notably, AW-MoE has been successfully deployed in the JD e-commerce search engine, ...
Progression Cognition Reinforcement Learning with Prioritized Experience for Multi-Vehicle Pursuit
Authors: Xinhang Li, Yiying Yang, Zheng Yuan, Zhe Wang, Qinwen Wang, Chen Xu, Lei Li, Jianhua He, Lin Zhang
Abstract
Multi-vehicle pursuit (MVP) such as autonomous police vehicles pursuing suspects is important but very challenging due to its mission and safety critical nature. While multi-agent reinforcement learning (MARL) algorithms have been proposed for MVP problem in structured grid-pattern roads, the existing algorithms use randomly training samples in centralized learning, which leads to homogeneous agents showing low collaboration performance. For the more challenging problem of pursuing multiple evading vehicles, these algorithms typically select a fixed target evading vehicle for pursuing vehicles without considering dynamic traffic situation, which significantly reduces pursuing success rate. To address the above problems, this paper proposes a Progression Cognition Reinforcement Learning with Prioritized Experience for MVP (PEPCRL-MVP) in urban multi-intersection dynamic traffic scenes. PEPCRL-MVP uses a prioritization network to assess the transitions in the global experience replay buffer according to the parameters of each MARL agent. With the personalized and prioritized experience set selected via the prioritization network, diversity is introduced to the learning process of MARL, which can improve collaboration and task related performance. Furthermore, PEPCRL-MVP employs an attention module to extract critical features from complex urban traffic environments. These features are used to develop progression cognition method to adaptively group pursuing vehicles. Each group efficiently target one evading vehicle in dynamic driving environments. Extensive experiments conducted with a simulator over unstructured roads of an urban area show that PEPCRL-MVP is superior to other state-of-the-art methods. Specifically, PEPCRL-MVP improves pursuing efficiency by 3.95% over TD3-DMAP and its success rate is 34.78% higher than that of MADDPG. Codes are open sourced.
Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop
Authors: Jinyang Liu, Junjie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, Michael R. Lyu
Abstract
System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Abstract
Misogyny and sexism are growing problems in social media. Advances have been made in online sexism detection but the systems are often uninterpretable. SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at increasing explainability of the sexism detection, and our team participated in all the proposed subtasks. Our system is based on further domain-adaptive pre-training (Gururangan et al., 2020). Building on the Transformer-based models with the domain adaptation, we compare fine-tuning with multi-task learning and show that each subtask requires a different system configuration. In our experiments, multi-task learning performs on par with standard fine-tuning for sexism detection and noticeably better for coarse-grained sexism classification, while fine-tuning is preferable for fine-grained classification.
Ownership Protection of Generative Adversarial Networks
Authors: Hailong Hu, Jun Pang
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Generative adversarial networks (GANs) have shown remarkable success in image synthesis, making GAN models themselves commercially valuable to legitimate model owners. Therefore, it is critical to technically protect the intellectual property of GANs. Prior works need to tamper with the training set or training process, and they are not robust to emerging model extraction attacks. In this paper, we propose a new ownership protection method based on the common characteristics of a target model and its stolen models. Our method can be directly applicable to all well-trained GANs as it does not require retraining target models. Extensive experimental results show that our new method can achieve the best protection performance, compared to the state-of-the-art methods. Finally, we demonstrate the effectiveness of our method with respect to the number of generations of model extraction attacks, the number of generated samples, different datasets, as well as adaptive attacks.
Hybridizable discontinuous Galerkin methods for the Monge-Ampere equation
Abstract
We introduce two hybridizable discontinuous Galerkin (HDG) methods for numerically solving the Monge-Ampere equation. The first HDG method is devised to solve the nonlinear elliptic Monge-Ampere equation by using Newton's method. The second HDG method is devised to solve a sequence of the Poisson equation until convergence to a fixed-point solution of the Monge-Ampere equation is reached. Numerical examples are presented to demonstrate the convergence and accuracy of the HDG methods. Furthermore, the HDG methods are applied to r-adaptive mesh generation by redistributing a given scalar density function via the optimal transport theory. This r-adaptivity methodology leads to the Monge-Ampere equation with a nonlinear Neumann boundary condition arising from the optimal transport of the density function to conform the resulting high-order mesh to the boundary. Hence, we extend the HDG methods to treat the nonlinear Neumann boundary condition. Numerical experiments are presented to illustrate the generation of r-adaptive high-order meshes on planar and curved domains.
Actively learning a Bayesian matrix fusion model with deep side information
Authors: Yangyang Yu, Jordan W. Suchow
Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract
High-dimensional deep neural network representations of images and concepts can be aligned to predict human annotations of diverse stimuli. However, such alignment requires the costly collection of behavioral responses, such that, in practice, the deep-feature spaces are only ever sparsely sampled. Here, we propose an active learning approach to adaptively sampling experimental stimuli to efficiently learn a Bayesian matrix factorization model with deep side information. We observe a significant efficiency gain over a passive baseline. Furthermore, with a sequential batched sampling strategy, the algorithm is applicable not only to small datasets collected from traditional laboratory experiments but also to settings where large-scale crowdsourced data collection is needed to accurately align the high-dimensional deep feature representations derived from pre-trained networks.
A shape derivative approach to domain simplification
Abstract
The objective of this study is to address the difficulty of simplifying the geometric model in which a differential problem is formulated, also called defeaturing, while simultaneously ensuring that the accuracy of the solution is maintained under control. This enables faster and more efficient simulations, without sacrificing accuracy. More precisely, we consider an isogeometric discretisation of an elliptic model problem defined on a two-dimensional hierarchical B-spline computational domain with a complex boundary. Starting with an oversimplification of the geometry, we build a goal-oriented adaptive strategy that adaptively reintroduces continuous geometrical features in regions where the analysis suggests a large impact on the quantity of interest. This strategy is driven by an a posteriori estimator of the defeaturing error based on first-order shape sensitivity analysis, and it profits from the local refinement properties of hierarchical B-splines. The adaptive algorithm is described together with a procedure to generate (partially) simplified hierarchical B-spline geometrical domains. Numerical experiments are presented to illustrate the proposed strategy and its limitations.
Keyword: quantization
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Authors: Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu Emma Wang
Abstract
Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an effective ordering of layer sensitivities. We address this by augmenting the estimated Hessian with additional information to capture inter-layer dependencies. We demonstrate that this consistently improves PTQ performance along the accuracy-latency Pareto frontier across multiple models. Our method combines second-order information and inter-layer dependencies to guide a bisection search, finding quantization configurations within a user-configurable model accuracy degradation range. We evaluate the effectiveness of our method on the ResNet50, MobileNetV2, and BERT models. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48\%$, $21.69\%$, and $33.28\%$ respectively, while maintaining model accuracy to within $99.99\%$ of the baseline model.
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Abstract
Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TD
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference
Authors: Matteo Risso, Alessio Burrello, Giuseppe Maria Sarda, Luca Benini, Enrico Macii, Massimo Poncino, Marian Verhelst, Daniele Jahier Pagliari
Abstract
The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
Keyword: efficient
Special Session: Approximation and Fault Resiliency of DNN Accelerators
From Data to Action: Exploring AI and IoT-driven Solutions for Smarter Cities
Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation
Online Multi-Contact Receding Horizon Planning via Value Function Approximation
Optimal Wheeler Language Recognition
3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels
SKG: A Versatile Information Retrieval and Analysis Framework for Academic Papers with Semantic Knowledge Graphs
Autonomous Capability Assessment of Black-Box Sequential Decision-Making Systems
Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers
Extended Neighboring Extremal Optimal Control with State and Preview Perturbations
Improving Vietnamese Legal Question--Answering System based on Automatic Data Enrichment
InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Learned spatial data partitioning
Longest Common Prefix Arrays for Succinct k-Spectra
The Hybrid Extended Bicycle: A Simple Model for High Dynamic Vehicle Trajectory Planning
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
From Bad to Worse: Using Private Data to Propagate Disinformation on Online Platforms with a Greater Efficiency
Big-data-driven and AI-based framework to enable personalization in wireless networks
Sensing-based Beamforming Design for Joint Performance Enhancement of RIS-Aided ISAC Systems
Flow-based Network Intrusion Detection Based on BERT Masked Language Model
Prefer to Classify: Improving Text Classifiers via Auxiliary Preference Learning
An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection
InfoPrompt: Information-Theoretic Soft Prompt Tuning for Natural Language Understanding
A modified model for topic detection from a corpus and a new metric evaluating the understandability of topics
Robust Learning with Progressive Data Expansion Against Spurious Correlation
Motion Planning for Aerial Pick-and-Place based on Geometric Feasibility Constraints
A Melting Pot of Evolution and Learning
StreetSurf: Extending Multi-view Implicit Surface Reconstruction to Street Views
Numerical computation of the half Laplacian by means of a fast convolution algorithm
Progression Cognition Reinforcement Learning with Prioritized Experience for Multi-Vehicle Pursuit
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Cost-Efficient Question Answering
Magnitude Attention-based Dynamic Pruning
A Dynamic Feature Interaction Framework for Multi-task Visual Perception
Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization
Causal Fairness for Outcome Control
AutoCharge: Autonomous Charging for Perpetual Quadrotor Missions
Controllable Multi-Objective Re-ranking with Policy Hypernetworks
A Two-dimensional Spatial Optimization Framework for Vehicle Powertrain Systems
Variable Radiance Field for Real-Life Category-Specifc Reconstruction from Single Image
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems
Boosting Adversarial Transferability by Achieving Flat Local Maxima
SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
Efficient Multi-Task Scene Analysis with RGB-D Transformers
Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
Simple and Controllable Music Generation
Bayesian Optimisation of Functions on Graphs
Real-time whole-heart electromechanical simulations using Latent Neural Ordinary Differential Equations
Federated Learning under Covariate Shifts with Generalization Guarantees
Actively learning a Bayesian matrix fusion model with deep side information
PEFT-SER: On the Use of Parameter Efficient Transfer Learning Approaches For Speech Emotion Recognition Using Pre-trained Speech Models
Negotiated Reasoning: On Provably Addressing Relative Over-Generalization
Automatic Image Blending Algorithm Based on SAM and DINO
A shape derivative approach to domain simplification
Matting Anything
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories
SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding
R-MAE: Regions Meet Masked Autoencoders
Improving Negative-Prompt Inversion via Proximal Guidance
Scaling Spherical CNNs
Keyword: faster
Faster Approximation Algorithms for Parameterized Graph Clustering and Edge Labeling
Efficient and Equivariant Graph Networks for Predicting Quantum Hamiltonian
Robust Learning with Progressive Data Expansion Against Spurious Correlation
RE-Matching: A Fine-Grained Semantic Matching Method for Zero-Shot Relation Extraction
Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop
Real-Time Rendering of Glinty Appearances using Distributed Binomial Laws on Anisotropic Grids
A Meta-Generation framework for Industrial System Generation
Time-Optimal Path Tracking with ISO Safety Guarantees
Habits of Mind: Reusing Action Sequences for Efficient Planning
Real-time GeoAI for High-resolution Mapping and Segmentation of Arctic Permafrost Features
A shape derivative approach to domain simplification
Keyword: mobile
Traction and Stability Control using Fuzzy-based Controller Integration
A Systematic Literature Review on Client Selection in Federated Learning
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Local Map-Based DQN Navigation and a Transferability Metric Using Scene Similarity
Neuro-Symbolic Approaches for Context-Aware Human Activity Recognition
Efficient Multi-Task Scene Analysis with RGB-D Transformers
SMUG Planner: A Safe Multi-Goal Planner for Mobile Robots in Challenging Environments
Keyword: pruning
Muti-Scale And Token Mergence: Make Your ViT More Efficient
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Magnitude Attention-based Dynamic Pruning
Large-scale Dataset Pruning with Dynamic Uncertainty
Keyword: diffusion
DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
Interpreting and Improving Diffusion Models Using the Euclidean Distance Function
City-wide Origin-Destination Matrix Generation via Graph Denoising Diffusion
Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning
Multi-Architecture Multi-Expert Diffusion Models
Non-autoregressive Conditional Diffusion Models for Time Series Prediction
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
Interactive Fashion Content Generation Using LLMs and Latent Diffusion Models
PriSampler: Mitigating Property Inference of Diffusion Models
Anomaly Detection in Satellite Videos using Diffusion Models
Improving Negative-Prompt Inversion via Proximal Guidance
Stochastic Multi-Person 3D Motion Forecasting
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
Grounded Text-to-Image Synthesis with Attention Refocusing
Keyword: adaptive
On training locally adaptive CP
Language Adaptive Weight Generation for Multi-task Visual Grounding
Improving Empathetic Dialogue Generation by Dynamically Infusing Commonsense Knowledge
Adaptive Frequency Green Light Optimal Speed Advisory based on Hybrid Actor-Critic Reinforcement Learning
Feature Selection using Sparse Adaptive Bottleneck Centroid-Encoder
An adaptive augmented Lagrangian method for training physics and equality constrained artificial neural networks
Open Set Relation Extraction via Unknown-Aware Training
RE-Matching: A Fine-Grained Semantic Matching Method for Zero-Shot Relation Extraction
A Melting Pot of Evolution and Learning
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering
Attention Weighted Mixture of Experts with Contrastive Learning for Personalized Ranking in E-commerce
Progression Cognition Reinforcement Learning with Prioritized Experience for Multi-Vehicle Pursuit
Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Ownership Protection of Generative Adversarial Networks
Hybridizable discontinuous Galerkin methods for the Monge-Ampere equation
Actively learning a Bayesian matrix fusion model with deep side information
A shape derivative approach to domain simplification
Keyword: quantization
Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference