Abstract
Flexible district heating grids form an important part of future, low-carbon energy systems. We examine probabilistic state estimation in such grids, i.e., we aim to estimate the posterior probability distribution over all grid state variables such as pressures, temperatures, and mass flows conditional on measurements of a subset of these states. Since the posterior state distribution does not belong to a standard class of probability distributions, we use Markov Chain Monte Carlo (MCMC) sampling in the space of network heat exchanges and evaluate the samples in the grid state space to estimate the posterior. Converting the heat exchange samples into grid states by solving the non-linear grid equations makes this approach computationally burdensome. However, we propose to speed it up by employing a deep neural network that is trained to approximate the solution of the exact but slow non-linear solver. This novel approach is shown to deliver highly accurate posterior distributions both for classic tree-shaped as well as meshed heating grids, at significantly reduced computational costs that are acceptable for online control. Our state estimation approach thus enables tightening the safety margins for temperature and pressure control and thereby a more efficient grid operation.
Adaptive Data Analysis in a Balanced Adversarial Model
Authors: Kobbi Nissim, Uri Stemmer, Eliad Tsfadia
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Abstract
In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $D$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $D$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $D$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $D$. We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but does not have a prior knowledge of the underlying distribution. We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.
Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs
Abstract
The adoption of data science brings vast benefits to Small and Medium-sized Enterprises (SMEs) including business productivity, economic growth, innovation and jobs creation. Data Science can support SMEs to optimise production processes, anticipate customers' needs, predict machinery failures and deliver efficient smart services. Businesses can also harness the power of Artificial Intelligence (AI) and Big Data and the smart use of digital technologies to enhance productivity and performance, paving the way for innovation. However, integrating data science decisions into an SME requires both skills and IT investments. In most cases, such expenses are beyond the means of SMEs due to limited resources and restricted access to financing. This paper presents trends and challenges towards an effective data-driven decision making for organisations based on a case study of 85 SMEs, mostly from the West Midlands region of England. The work is supported as part of a 3 years ERDF (European Regional Development Funded project) in the areas of big data management, analytics and business intelligence. We present two case studies that demonstrates the potential of Digitisation, AI and Machine Learning and use these as examples to unveil challenges and showcase the wealth of current available opportunities for SMEs.
Foundational Models for Malware Embeddings Using Spatio-Temporal Parallel Convolutional Networks
Authors: Dhruv Nandakumar, Devin Quinn, Elijah Soba, Eunyoung Kim, Christopher Redino, Chris Chan, Kevin Choi, Abdul Rahman, Edward Bowen
Abstract
In today's interconnected digital landscape, the proliferation of malware poses a significant threat to the security and stability of computer networks and systems worldwide. As the complexity of malicious tactics, techniques, and procedures (TTPs) continuously grows to evade detection, so does the need for advanced methods capable of capturing and characterizing malware behavior. The current state of the art in malware classification and detection uses task specific objectives; however, this method fails to generalize to other downstream tasks involving the same malware class. In this paper, the authors introduce a novel method that combines convolutional neural networks, standard graph embedding techniques, and a metric learning objective to extract meaningful information from network flow data and create strong embeddings characterizing malware behavior. These embeddings enable the development of highly accurate, efficient, and generalizable machine learning models for tasks such as malware strain classification, zero day threat detection, and closest attack type attribution as demonstrated in this paper. A shift from task specific objectives to strong embeddings will not only allow rapid iteration of cyber-threat detection models, but also allow different modalities to be introduced in the development of these models.
On Semantically-Deterministic Automata
Authors: Bader Abu Radi, Orna Kupferman
Subjects: Formal Languages and Automata Theory (cs.FL)
Abstract
A nondeterministic automaton is semantically deterministic (SD) if different nondeterministic choices in the automaton lead to equivalent states. Semantic determinism is interesting as it is a natural relaxation of determinism, and as some applications of deterministic automata in formal methods can actually use automata with some level of nondeterminism, tightly related to semantic determinism. In the context of finite words, semantic determinism coincides with determinism, in the sense that every pruning of an SD automaton to a deterministic one results in an equivalent automaton. We study SD automata on infinite words, focusing on B\"uchi, co-B\"uchi, and weak automata. We show that there, while semantic determinism does not increase the expressive power, the combinatorial and computational properties of SD automata are very different from these of deterministic automata. In particular, SD B\"uchi and co-B\"uchi automata are exponentially more succinct than deterministic ones (in fact, also exponentially more succinct than history-deterministic automata), their complementation involves an exponential blow up, and decision procedures for them like universality and minimization are PSPACE-complete. For weak automata, we show that while an SD weak automaton need not be pruned to an equivalent deterministic one, it can be determinized to an equivalent deterministic weak automaton with the same state space, implying also efficient complementation and decision procedures for SD weak automata.
Improving selective classification performance of deep neural networks through post-hoc logit normalization and temperature scaling
Abstract
This paper addresses the problem of selective classification for deep neural networks, where a model is allowed to abstain from low-confidence predictions to avoid potential errors. Specifically, we tackle the problem of optimizing the confidence estimator of a fixed classifier, aiming to enhance its misclassification detection performance, i.e., its ability to discriminate between correct and incorrect predictions by assigning higher confidence values to the correct ones. Previous work has found that different classifiers exhibit varying levels of misclassification detection performance, particularly when using the maximum softmax probability (MSP) as a measure of confidence. However, we argue that these findings are mainly due to a sub-optimal confidence estimator being used for each model. To overcome this issue, we propose a simple and efficient post-hoc confidence estimator, named $p$-NormSoftmax, which consists of transforming the logits through $p$-norm normalization and temperature scaling, followed by taking the MSP, where $p$ and the temperature are optimized based on a hold-out set. This estimator can be easily applied on top of an already trained model and, in many cases, can significantly improve its selective classification performance. When applied to 84 pretrained Imagenet classifiers, our method yields an average improvement of 16% in the area under the risk-coverage curve (AURC), exceeding 40% for some models. Furthermore, after applying $p$-NormSoftmax, we observe that these models exhibit approximately the same level of misclassification detection performance, implying that a model's selective classification performance is almost entirely determined by its accuracy at full coverage.
Task-aware Distributed Source Coding under Dynamic Bandwidth
Abstract
Efficient compression of correlated data is essential to minimize communication overload in multi-sensor networks. In such networks, each sensor independently compresses the data and transmits them to a central node due to limited communication bandwidth. A decoder at the central node decompresses and passes the data to a pre-trained machine learning-based task to generate the final output. Thus, it is important to compress the features that are relevant to the task. Additionally, the final performance depends heavily on the total available bandwidth. In practice, it is common to encounter varying availability in bandwidth, and higher bandwidth results in better performance of the task. We design a novel distributed compression framework composed of independent encoders and a joint decoder, which we call neural distributed principal component analysis (NDPCA). NDPCA flexibly compresses data from multiple sources to any available bandwidth with a single model, reducing computing and storage overhead. NDPCA achieves this by learning low-rank task representations and efficiently distributing bandwidth among sensors, thus providing a graceful trade-off between performance and bandwidth. Experiments show that NDPCA improves the success rate of multi-view robotic arm manipulation by 9% and the accuracy of object detection tasks on satellite imagery by 14% compared to an autoencoder with uniform bandwidth allocation.
Post-processing Private Synthetic Data for Improving Utility on Selected Measures
Authors: Hao Wang, Shivchander Sudalairaj, John Henning, Kristjan Greenewald, Akash Srivastava
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB); Information Theory (cs.IT)
Abstract
Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
Hybrid Eigensolvers for Nuclear Configuration Interaction Calculations
Authors: Abdullah Alperen, Metin Aktulga, Pieter Maris, Chao Yang
Abstract
We examine and compare several iterative methods for solving large-scale eigenvalue problems arising from nuclear structure calculations. In particular, we discuss the possibility of using block Lanczos method, a Chebyshev filtering based subspace iterations and the residual minimization method accelerated by direct inversion of iterative subspace (RMM-DIIS) and describe how these algorithms compare with the standard Lanczos algorithm and the locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm. Although the RMM-DIIS method does not exhibit rapid convergence when the initial approximations to the desired eigenvectors are not sufficiently accurate, it can be effectively combined with either the block Lanczos or the LOBPCG method to yield a hybrid eigensolver that has several desirable properties. We will describe a few practical issues that need to be addressed to make the hybrid solver efficient and robust.
Deep Reinforcement Learning with Plasticity Injection
Authors: Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, André Barreto
Abstract
A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool $\unicode{x2014}$ if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent has to re-learn from scratch due to exhausted plasticity or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.
Non-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence
Authors: Riccardo Bonalli, Alessandro Rudi
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Abstract
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical estimates of learning rates which, unlike previous works, become increasingly tighter when the regularity of the unknown drift and diffusion coefficients becomes higher. Our method being kernel-based, offline pre-processing may in principle be profitably leveraged to enable efficient numerical implementation.
Lightweight Learner for Shared Knowledge Lifelong Learning
Authors: Yunhao Ge, Yuecheng Li, Di Wu, Ao Xu, Adam M. Jones, Amanda Sofie Rios, Iordanis Fostiropoulos, Shixian Wen, Po-Hsuan Huang, Zachary William Murdock, Gozde Sahin, Shuo Ni, Kiran Lekkala, Sumedh Anand Sontakke, Laurent Itti
Abstract
In Lifelong Learning (LL), agents continually learn as they encounter new conditions and tasks. Most current LL is limited to a single agent that learns tasks sequentially. Dedicated LL machinery is then deployed to mitigate the forgetting of old tasks as new tasks are learned. This is inherently slow. We propose a new Shared Knowledge Lifelong Learning (SKILL) challenge, which deploys a decentralized population of LL agents that each sequentially learn different tasks, with all agents operating independently and in parallel. After learning their respective tasks, agents share and consolidate their knowledge over a decentralized communication network, so that, in the end, all agents can master all tasks. We present one solution to SKILL which uses Lightweight Lifelong Learning (LLL) agents, where the goal is to facilitate efficient sharing by minimizing the fraction of the agent that is specialized for any given task. Each LLL agent thus consists of a common task-agnostic immutable part, where most parameters are, and individual task-specific modules that contain fewer parameters but are adapted to each task. Agents share their task-specific modules, plus summary information ("task anchors") representing their tasks in the common task-agnostic latent space of all agents. Receiving agents register each received task-specific module using the corresponding anchor. Thus, every agent improves its ability to solve new tasks each time new task-specific modules and anchors are received. On a new, very challenging SKILL-102 dataset with 102 image classification tasks (5,033 classes in total, 2,041,225 training, 243,464 validation, and 243,464 test images), we achieve much higher (and SOTA) accuracy over 8 LL baselines, while also achieving near perfect parallelization. Code and data can be found at https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learning
Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning
Abstract
Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of finding a global optimum of an expensive-to-evaluate black-box function efficiently. In general, a probabilistic regression model, e.g., Gaussian processes, random forests, and Bayesian neural networks, is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based Bayesian optimization, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, a supervised classifier can be employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy tend to be overconfident for a global solution candidate. To solve this overconfidence problem, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning. Finally, we demonstrate the experimental results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool.
GFairHint: Improving Individual Fairness for Graph Neural Networks via Fairness Hint
Authors: Paiheng Xu, Yuhang Zhou, Bang An, Wei Ai, Furong Huang
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Abstract
Given the growing concerns about fairness in machine learning and the impressive performance of Graph Neural Networks (GNNs) on graph data learning, algorithmic fairness in GNNs has attracted significant attention. While many existing studies improve fairness at the group level, only a few works promote individual fairness, which renders similar outcomes for similar individuals. A desirable framework that promotes individual fairness should (1) balance between fairness and performance, (2) accommodate two commonly-used individual similarity measures (externally annotated and computed from input features), (3) generalize across various GNN models, and (4) be computationally efficient. Unfortunately, none of the prior work achieves all the desirables. In this work, we propose a novel method, GFairHint, which promotes individual fairness in GNNs and achieves all aforementioned desirables. GFairHint learns fairness representations through an auxiliary link prediction task, and then concatenates the representations with the learned node embeddings in original GNNs as a "fairness hint". Through extensive experimental investigations on five real-world graph datasets under three prevalent GNN models covering both individual similarity measures above, GFairHint achieves the best fairness results in almost all combinations of datasets with various backbone models, while generating comparable utility results, with much less computational cost compared to the previous state-of-the-art (SoTA) method.
Abstract
The current approach to connected and autonomous driving function development and evaluation uses model-in-the-loop simulation, hardware-in-the-loop simulation, and limited proving ground work followed by public road deployment of beta version of software and technology. The rest of the road users are involuntarily forced into taking part in the development and evaluation of these connected and autonomous driving functions in this approach. This is an unsafe, costly and inefficient method. Motivated by these shortcomings, this paper introduces the Vehicle-in-Virtual-Environment (VVE) method of safe, efficient and low cost connected and autonomous driving function development, evaluation and demonstration. The VVE method is compared to the existing state-of-the-art. Its basic implementation for a path following task is used to explain the method where the actual autonomous vehicle operates in a large empty area with its sensor feeds being replaced by realistic sensor feeds corresponding to its location and pose in the virtual environment. It is possible to easily change the development virtual environment and inject rare and difficult events which can be tested very safely. Vehicle-to-Pedestrian (V2P) communication based pedestrian safety is chosen as the application use case for VVE and corresponding experimental results are presented and discussed. It is noted that actual pedestrians and other vulnerable road users can be used very safely in this approach.
How to escape sharp minima
Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract
Modern machine learning applications have seen a remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this paradigm, this work formulates and studies the algorithmic question of how to find flat minima. As an initial effort, this work adopts the trace of hessian of the cost function as the measure of flatness, and formally defines the notion of approximate flat minima. Under this notion, we then design algorithms that find approximate flat minima efficiently. For general cost functions, we present a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.
Accelerated solutions of convection-dominated partial differential equations using implicit feature tracking and empirical quadrature
Authors: Marzieh Alireza Mirhoseini, Matthew J. Zahr
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
This work introduces an empirical quadrature-based hyperreduction procedure and greedy training algorithm to effectively reduce the computational cost of solving convection-dominated problems with limited training. The proposed approach circumvents the slowly decaying $n$-width limitation of linear model reduction techniques applied to convection-dominated problems by using a nonlinear approximation manifold systematically defined by composing a low-dimensional affine space with bijections of the underlying domain. The reduced-order model is defined as the solution of a residual minimization problem over the nonlinear manifold. An online-efficient method is obtained by using empirical quadrature to approximate the optimality system such that it can be solved with mesh-independent operations. The proposed reduced-order model is trained using a greedy procedure to systematically sample the parameter domain. The effectiveness of the proposed approach is demonstrated on two shock-dominated computational fluid dynamics benchmarks.
Mixture-of-Expert Conformer for Streaming Multilingual ASR
Authors: Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate which chooses the best two experts among many in forward propagation. The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases. We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline. Compared to an adapter model using ground truth information, our MoE model achieves similar WER and activates similar number of parameters but without any language information. We further show around 3% relative WER improvement by multilingual shallow fusion.
Abstract
Offline-to-online reinforcement learning (RL), by combining the benefits of offline pretraining and online finetuning, promises enhanced sample efficiency and policy performance. However, existing methods, effective as they are, suffer from suboptimal performance, limited adaptability, and unsatisfactory computational efficiency. We propose a novel framework, PROTO, which overcomes the aforementioned limitations by augmenting the standard RL objective with an iteratively evolving regularization term. Performing a trust-region-style update, PROTO yields stable initial finetuning and optimal final performance by gradually evolving the regularization term to relax the constraint strength. By adjusting only a few lines of code, PROTO can bridge any offline policy pretraining and standard off-policy RL finetuning to form a powerful offline-to-online RL pathway, birthing great adaptability to diverse methods. Simple yet elegant, PROTO imposes minimal additional computation and enables highly efficient online finetuning. Extensive experiments demonstrate that PROTO achieves superior performance over SOTA baselines, offering an adaptable and efficient offline-to-online RL framework.
Asking Before Action: Gather Information in Embodied Decision Making with Language Models
Abstract
With strong capabilities of reasoning and a generic understanding of the world, Large Language Models (LLMs) have shown great potential in building versatile embodied decision making agents capable of performing diverse tasks. However, when deployed to unfamiliar environments, we show that LLM agents face challenges in efficiently gathering necessary information, leading to suboptimal performance. On the other hand, in unfamiliar scenarios, human individuals often seek additional information from their peers before taking action, leveraging external knowledge to avoid unnecessary trial and error. Building upon this intuition, we propose \textit{Asking Before Action} (ABA), a method that empowers the agent to proactively query external sources for pertinent information using natural language during their interactions in the environment. In this way, the agent is able to enhance its efficiency and performance by mitigating wasteful steps and circumventing the difficulties associated with exploration in unfamiliar environments. We empirically evaluate our method on an embodied decision making benchmark, ALFWorld, and demonstrate that despite modest modifications in prompts, our method exceeds baseline LLM agents by more than $40$%. Further experiments on two variants of ALFWorld illustrate that by imitation learning, ABA effectively retains and reuses queried and known information in subsequent tasks, mitigating the need for repetitive inquiries. Both qualitative and quantitative results exhibit remarkable performance on tasks that previous methods struggle to solve.
Privacy Protectability: An Information-theoretical Approach
Authors: Siping Shi, Bihai Zhang, Dan Wang
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract
Recently, inference privacy has attracted increasing attention. The inference privacy concern arises most notably in the widely deployed edge-cloud video analytics systems, where the cloud needs the videos captured from the edge. The video data can contain sensitive information and subject to attack when they are transmitted to the cloud for inference. Many privacy protection schemes have been proposed. Yet, the performance of a scheme needs to be determined by experiments or inferred by analyzing the specific case. In this paper, we propose a new metric, \textit{privacy protectability}, to characterize to what degree a video stream can be protected given a certain video analytics task. Such a metric has strong operational meaning. For example, low protectability means that it may be necessary to set up an overall secure environment. We can also evaluate a privacy protection scheme, e.g., assume it obfuscates the video data, what level of protection this scheme has achieved after obfuscation. Our definition of privacy protectability is rooted in information theory and we develop efficient algorithms to estimate the metric. We use experiments on real data to validate that our metric is consistent with empirical measurements on how well a video stream can be protected for a video analytics task.
Abstract
Deep neural networks (DNNs) have demonstrated extraordinary capabilities and are an integral part of modern software systems. However, they also suffer from various vulnerabilities such as adversarial attacks and unfairness. Testing deep learning (DL) systems is therefore an important task, to detect and mitigate those vulnerabilities. Motivated by the success of traditional software testing, which often employs diversity heuristics, various diversity measures on DNNs have been proposed to help efficiently expose the buggy behavior of DNNs. In this work, we argue that many DNN testing tasks should be treated as directed testing problems rather than general-purpose testing tasks, because these tasks are specific and well-defined. Hence, the diversity-based approach is less effective. Following our argument based on the semantics of DNNs and the testing goal, we derive $6$ metrics that can be used for DNN testing and carefully analyze their application scopes. We empirically show their efficacy in exposing bugs in DNNs compared to recent diversity-based metrics. Moreover, we also notice discrepancies between the practices of the software engineering (SE) community and the DL community. We point out some of these gaps, and hopefully, this can lead to bridging the SE practice and DL findings.
Efficient Neural Music Generation
Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang
Abstract
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.
Enhancing the Ranking Context of Dense Retrieval Methods through Reciprocal Nearest Neighbors
Authors: George Zerveas, Navid Rekabsaz, Carsten Eickhoff
Abstract
Sparse annotation poses persistent challenges to training dense retrieval models, such as the problem of false negatives, i.e. unlabeled relevant documents that are spuriously used as negatives in contrastive learning, distorting the training signal. To alleviate this problem, we introduce evidence-based label smoothing, a computationally efficient method that prevents penalizing the model for assigning high relevance to false negatives. To compute the target relevance distribution over candidate documents within the ranking context of a given query, candidates most similar to the ground truth are assigned a non-zero relevance probability based on the degree of their similarity to the ground-truth document(s). As a relevance estimate we leverage an improved similarity metric based on reciprocal nearest neighbors, which can also be used independently to rerank candidates in post-processing. Through extensive experiments on two large-scale ad hoc text retrieval datasets we demonstrate that both methods can improve the ranking effectiveness of dense retrieval models.
A Tutorial on Holographic MIMO Communications--Part I: Channel Modeling and Channel Estimation
Authors: Jiancheng An, Chau Yuen, Chongwen Huang, Merouane Debbah, H. Vincent Poor, Lajos Hanzo
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
By integrating a nearly infinite number of reconfigurable elements into a finite space, a spatially continuous array aperture is formed for holographic multiple-input multiple-output (HMIMO) communications. This three-part tutorial aims for providing an overview of the latest advances in HMIMO communications. As Part I of the tutorial, this letter first introduces the fundamental concept of HMIMO and reviews the recent progress in HMIMO channel modeling, followed by a suite of efficient channel estimation approaches. Finally, numerical results are provided for demonstrating the statistical consistency of the new HMIMO channel model advocated with conventional ones and evaluating the performance of the channel estimators. Parts II and III of the tutorial will delve into the performance analysis and holographic beamforming, and detail the interplay of HMIMO with emerging technologies.
TransWorldNG: Traffic Simulation via Foundation Model
Authors: Ding Wang, Xuhong Wang, Liang Chen, Shengyue Yao, Ming Jing, Honghai Li, Li Li, Shiqiang Bao, Fei-Yue Wang, Yilun Lin
Abstract
Traffic simulation is a crucial tool for transportation decision-making and policy development. However, achieving realistic simulations in the face of the high dimensionality and heterogeneity of traffic environments is a longstanding challenge. In this paper, we present TransWordNG, a traffic simulator that uses Data-driven algorithms and Graph Computing techniques to learn traffic dynamics from real data. The functionality and structure of TransWorldNG are introduced, which utilize a foundation model for transportation management and control. The results demonstrate that TransWorldNG can generate more realistic traffic patterns compared to traditional simulators. Additionally, TransWorldNG exhibits better scalability, as it shows linear growth in computation time as the scenario scale increases. To the best of our knowledge, this is the first traffic simulator that can automatically learn traffic patterns from real-world data and efficiently generate accurate and realistic traffic environments.
Robust Ante-hoc Graph Explainer using Bilevel Optimization
Authors: Mert Kosan, Arlei Silva, Ambuj Singh
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract
Explaining the decisions made by machine learning models for high-stakes applications is critical for increasing transparency and guiding improvements to these decisions. This is particularly true in the case of models for graphs, where decisions often depend on complex patterns combining rich structural and attribute data. While recent work has focused on designing so-called post-hoc explainers, the question of what constitutes a good explanation remains open. One intuitive property is that explanations should be sufficiently informative to enable humans to approximately reproduce the predictions given the data. However, we show that post-hoc explanations do not achieve this goal as their explanations are highly dependent on fixed model parameters (e.g., learned GNN weights). To address this challenge, this paper proposes RAGE (Robust Ante-hoc Graph Explainer), a novel and flexible ante-hoc explainer designed to discover explanations for a broad class of graph neural networks using bilevel optimization. RAGE is able to efficiently identify explanations that contain the full information needed for prediction while still enabling humans to rank these explanations based on their influence. Our experiments, based on graph classification and regression, show that RAGE explanations are more robust than existing post-hoc and ante-hoc approaches and often achieve similar or better accuracy than state-of-the-art models.
T2TD: Text-3D Generation Model based on Prior Knowledge Guidance
Authors: Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, Nicu Sebe
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, 3D models have been utilized in many applications, such as auto-driver, 3D reconstruction, VR, and AR. However, the scarcity of 3D model data does not meet its practical demands. Thus, generating high-quality 3D models efficiently from textual descriptions is a promising but challenging way to solve this problem. In this paper, inspired by the ability of human beings to complement visual information details from ambiguous descriptions based on their own experience, we propose a novel text-3D generation model (T2TD), which introduces the related shapes or textual information as the prior knowledge to improve the performance of the 3D generation model. In this process, we first introduce the text-3D knowledge graph to save the relationship between 3D models and textual semantic information, which can provide the related shapes to guide the target 3D model generation. Second, we integrate an effective causal inference model to select useful feature information from these related shapes, which removes the unrelated shape information and only maintains feature information that is strongly relevant to the textual description. Meanwhile, to effectively integrate multi-modal prior knowledge into textual information, we adopt a novel multi-layer transformer structure to progressively fuse related shape and textual information, which can effectively compensate for the lack of structural information in the text and enhance the final performance of the 3D generation model. The final experimental results demonstrate that our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.
High-Similarity-Pass Attention for Single Image Super-Resolution
Authors: Jian-Nan Su, Min Gan, Guang-Yong Chen, Wenzhong Guo, C. L. Philip Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent developments in the field of non-local attention (NLA) have led to a renewed interest in self-similarity-based single image super-resolution (SISR). Researchers usually used the NLA to explore non-local self-similarity (NSS) in SISR and achieve satisfactory reconstruction results. However, a surprising phenomenon that the reconstruction performance of the standard NLA is similar to the NLA with randomly selected regions stimulated our interest to revisit NLA. In this paper, we first analyzed the attention map of the standard NLA from different perspectives and discovered that the resulting probability distribution always has full support for every local feature, which implies a statistical waste of assigning values to irrelevant non-local features, especially for SISR which needs to model long-range dependence with a large number of redundant non-local features. Based on these findings, we introduced a concise yet effective soft thresholding operation to obtain high-similarity-pass attention (HSPA), which is beneficial for generating a more compact and interpretable distribution. Furthermore, we derived some key properties of the soft thresholding operation that enable training our HSPA in an end-to-end manner. The HSPA can be integrated into existing deep SISR models as an efficient general building block. In addition, to demonstrate the effectiveness of the HSPA, we constructed a deep high-similarity-pass attention network (HSPAN) by integrating a few HSPAs in a simple backbone. Extensive experimental results demonstrate that HSPAN outperforms state-of-the-art approaches on both quantitative and qualitative evaluations.
Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
Authors: Saisai Ding, Juncheng Li, Jun Wang, Shihui Ying, Jun Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The multi-scale information among the whole slide images (WSIs) is essential for cancer diagnosis. Although the existing multi-scale vision Transformer has shown its effectiveness for learning multi-scale image representation, it still cannot work well on the gigapixel WSIs due to their extremely large image sizes. To this end, we propose a novel Multi-scale Efficient Graph-Transformer (MEGT) framework for WSI classification. The key idea of MEGT is to adopt two independent Efficient Graph-based Transformer (EGT) branches to process the low-resolution and high-resolution patch embeddings (i.e., tokens in a Transformer) of WSIs, respectively, and then fuse these tokens via a multi-scale feature fusion module (MFFM). Specifically, we design an EGT to efficiently learn the local-global information of patch tokens, which integrates the graph representation into Transformer to capture spatial-related information of WSIs. Meanwhile, we propose a novel MFFM to alleviate the semantic gap among different resolution patches during feature fusion, which creates a non-patch token for each branch as an agent to exchange information with another branch by cross-attention. In addition, to expedite network training, a novel token pruning module is developed in EGT to reduce the redundant tokens. Extensive experiments on TCGA-RCC and CAMELYON16 datasets demonstrate the effectiveness of the proposed MEGT.
Abstract
Weakly supervised learning aims to empower machine learning when the perfect supervision is unavailable, which has drawn great attention from researchers. Among various types of weak supervision, one of the most challenging cases is to learn from multiple unlabeled (U) datasets with only a little knowledge of the class priors, or U$^m$ learning for short. In this paper, we study the problem of building an AUC (area under ROC curve) optimization model from multiple unlabeled datasets, which maximizes the pairwise ranking ability of the classifier. We propose U$^m$-AUC, an AUC optimization approach that converts the U$^m$ data into a multi-label AUC optimization problem, and can be trained efficiently. We show that the proposed U$^m$-AUC is effective theoretically and empirically.
Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents
Abstract
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative On-call system empowered by the Large Language Model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from serviceX in companyX. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at companyX for over four years.
On Architectural Compression of Text-to-Image Diffusion Models
Authors: Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi
Abstract
Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.
Lucy-SKG: Learning to Play Rocket League Efficiently Using Deep Reinforcement Learning
Authors: Vasileios Moschopoulos, Pantelis Kyriakidis, Aristotelis Lazaridis, Ioannis Vlahavas
Abstract
A successful tactic that is followed by the scientific community for advancing AI is to treat games as problems, which has been proven to lead to various breakthroughs. We adapt this strategy in order to study Rocket League, a widely popular but rather under-explored 3D multiplayer video game with a distinct physics engine and complex dynamics that pose a significant challenge in developing efficient and high-performance game-playing agents. In this paper, we present Lucy-SKG, a Reinforcement Learning-based model that learned how to play Rocket League in a sample-efficient manner, outperforming by a notable margin the two highest-ranking bots in this game, namely Necto (2022 bot champion) and its successor Nexto, thus becoming a state-of-the-art agent. Our contributions include: a) the development of a reward analysis and visualization library, b) novel parameterizable reward shape functions that capture the utility of complex reward types via our proposed Kinesthetic Reward Combination (KRC) technique, and c) design of auxiliary neural architectures for training on reward prediction and state representation tasks in an on-policy fashion for enhanced efficiency in learning speed and performance. By performing thorough ablation studies for each component of Lucy-SKG, we showed their independent effectiveness in overall performance. In doing so, we demonstrate the prospects and challenges of using sample-efficient Reinforcement Learning techniques for controlling complex dynamical systems under competitive team-based multiplayer conditions.
A Burton-Miller-type boundary element method based on a hybrid integral representation and its application to cavity scattering
Abstract
This study builds on a recent paper by Lai et al [Appl. Comput. Harmon. Anal., 2018] in which a novel boundary integral formulation is presented for scalar wave scattering analysis in two-dimensional layered and half-spaces. The seminal paper proposes a hybrid integral representation that combines the Sommerfeld integral and layer potential to efficiently deal with the boundaries of infinite length. In this work, we modify the integral formulation to eliminate the fictitious eigenvalues by employing Burton-Miller's approach. We also discuss reasonable parameter settings for the hybrid integral equation to ensure efficient and accurate numerical solutions. Furthermore, we extend the modified formulation for the scattering from a cavity in a half-space whose boundary is locally perturbed. To address the cavity scattering, we introduce a virtual boundary enclosing the cavity and couple the integral equation on it with the hybrid equation. The effectiveness of the proposed method is demonstrated through numerical examples.
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Authors: Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, Tomáš Rebok
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.
Flexible Spectrum Orchestration of Carrier Aggregation for 5G-Advanced
Authors: Xianghui Han, Chunli Liang, Ruiqi Liu, Xingguang Wei, Mengzhu Chen, Yu-Ngok Ruyue Li, Shi Jin
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP); Systems and Control (eess.SY)
Abstract
With increasing availability of spectrum in the market due to new spectrum allocation and re-farming bands from previous cellular generation networks, a more flexible, efficient and green usage of the spectrum becomes an important topic in 5G-Advanced. In this article, we provide an overview on the 3rd Generation Partnership Project (3GPP) work on flexible spectrum orchestration for carrier aggregation (CA). The configuration settings, requirements and potential specification impacts are analyzed. Some involved Release 18 techniques, such as multi-cell scheduling, transmitter switching and network energy saving, are also presented. Evaluation results show that clear performance gain can be achieved by these techniques.
MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation
Authors: Sebastian Vincent, Robert Flynn, Carolina Scarton
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Efficient utilisation of both intra- and extra-textual context remains one of the critical gaps between machine and human translation. Existing research has primarily focused on providing individual, well-defined types of context in translation, such as the surrounding text or discrete external variables like the speaker's gender. This work introduces MTCue, a novel neural machine translation (NMT) framework that interprets all context (including discrete variables) as text. MTCue learns an abstract representation of context, enabling transferability across different data settings and leveraging similar attributes in low-resource scenarios. With a focus on a dialogue domain with access to document and metadata context, we extensively evaluate MTCue in four language pairs in both translation directions. Our framework demonstrates significant improvements in translation quality over a parameter-matched non-contextual baseline, as measured by BLEU (+0.88) and Comet (+1.58). Moreover, MTCue significantly outperforms a "tagging" baseline at translating English text. Analysis reveals that the context encoder of MTCue learns a representation space that organises context based on specific attributes, such as formality, enabling effective zero-shot control. Pre-training on context embeddings also improves MTCue's few-shot performance compared to the "tagging" baseline. Finally, an ablation study conducted on model components and contextual variables further supports the robustness of MTCue for context-based NMT.
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Abstract
Memes are a powerful tool for communication over social media. Their affinity for evolving across politics, history, and sociocultural phenomena makes them an ideal communication vehicle. To comprehend the subtle message conveyed within a meme, one must understand the background that facilitates its holistic assimilation. Besides digital archiving of memes and their metadata by a few websites like knowyourmeme.com, currently, there is no efficient way to deduce a meme's context dynamically. In this work, we propose a novel task, MEMEX - given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. At first, we develop MCC (Meme Context Corpus), a novel dataset for MEMEX. Further, to benchmark MCC, we propose MIME (MultImodal Meme Explainer), a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context. MIME surpasses several unimodal and multimodal systems and yields an absolute improvement of ~ 4% F1-score over the best baseline. Lastly, we conduct detailed analyses of MIME's performance, highlighting the aspects that could lead to optimal modeling of cross-modal contextual associations.
Sample and Predict Your Latent: Modality-free Sequential Disentanglement via Contrastive Estimation
Abstract
Unsupervised disentanglement is a long-standing challenge in representation learning. Recently, self-supervised techniques achieved impressive results in the sequential setting, where data is time-dependent. However, the latter methods employ modality-based data augmentations and random sampling or solve auxiliary tasks. In this work, we propose to avoid that by generating, sampling, and comparing empirical distributions from the underlying variational model. Unlike existing work, we introduce a self-supervised sequential disentanglement framework based on contrastive estimation with no external signals, while using common batch sizes and samples from the latent space itself. In practice, we propose a unified, efficient, and easy-to-code sampling strategy for semantically similar and dissimilar views of the data. We evaluate our approach on video, audio, and time series benchmarks. Our method presents state-of-the-art results in comparison to existing techniques. The code is available at https://github.com/azencot-group/SPYL.
Mask Attack Detection Using Vascular-weighted Motion-robust rPPG Signals
Abstract
Detecting 3D mask attacks to a face recognition system is challenging. Although genuine faces and 3D face masks show significantly different remote photoplethysmography (rPPG) signals, rPPG-based face anti-spoofing methods often suffer from performance degradation due to unstable face alignment in the video sequence and weak rPPG signals. To enhance the rPPG signal in a motion-robust way, a landmark-anchored face stitching method is proposed to align the faces robustly and precisely at the pixel-wise level by using both SIFT keypoints and facial landmarks. To better encode the rPPG signal, a weighted spatial-temporal representation is proposed, which emphasizes the face regions with rich blood vessels. In addition, characteristics of rPPG signals in different color spaces are jointly utilized. To improve the generalization capability, a lightweight EfficientNet with a Gated Recurrent Unit (GRU) is designed to extract both spatial and temporal features from the rPPG spatial-temporal representation for classification. The proposed method is compared with the state-of-the-art methods on five benchmark datasets under both intra-dataset and cross-dataset evaluations. The proposed method shows a significant and consistent improvement in performance over other state-of-the-art rPPG-based methods for face spoofing detection.
How to Turn Your Knowledge Graph Embeddings into Generative Models via Probabilistic Circuits
Authors: Lorenzo Loconte, Nicola Di Mauro, Robert Peharz, Antonio Vergari
Abstract
Some of the most successful knowledge graph embedding (KGE) models for link prediction -- CP, RESCAL, TuckER, ComplEx -- can be interpreted as energy-based models. Under this perspective they are not amenable for exact maximum-likelihood estimation (MLE), sampling and struggle to integrate logical constraints. This work re-interprets the score functions of these KGEs as circuits -- constrained computational graphs allowing efficient marginalisation. Then, we design two recipes to obtain efficient generative circuit models by either restricting their activations to be non-negative or squaring their outputs. Our interpretation comes with little or no loss of performance for link prediction, while the circuits framework unlocks exact learning by MLE, efficient sampling of new triples, and guarantee that logical constraints are satisfied by design. Furthermore, our models scale more gracefully than the original KGEs on graphs with millions of entities.
Online learning of long-range dependencies
Authors: Nicolas Zucchet, Robert Meier, Simon Schug, Asier Mujika, João Sacramento
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract
Online learning holds the promise of enabling efficient long-term credit assignment in recurrent neural networks. However, current algorithms fall short of offline backpropagation by either not being scalable or failing to learn long-range dependencies. Here we present a high-performance online learning algorithm that merely doubles the memory and computational requirements of a single inference pass. We achieve this by leveraging independent recurrent modules in multi-layer networks, an architectural motif that has recently been shown to be particularly powerful. Experiments on synthetic memory problems and on the challenging long-range arena benchmark suite reveal that our algorithm performs competitively, establishing a new standard for what can be achieved through online learning. This ability to learn long-range dependencies offers a new perspective on learning in the brain and opens a promising avenue in neuromorphic computing.
Online and Streaming Algorithms for Constrained $k$-Submodular Maximization
Authors: Fabian Spaeh, Alina Ene, Huy L. Nguyen
Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Abstract
Constrained $k$-submodular maximization is a general framework that captures many discrete optimization problems such as ad allocation, influence maximization, personalized recommendation, and many others. In many of these applications, datasets are large or decisions need to be made in an online manner, which motivates the development of efficient streaming and online algorithms. In this work, we develop single-pass streaming and online algorithms for constrained $k$-submodular maximization with both monotone and general (possibly non-monotone) objectives subject to cardinality and knapsack constraints. Our algorithms achieve provable constant-factor approximation guarantees which improve upon the state of the art in almost all settings. Moreover, they are combinatorial and very efficient, and have optimal space and running time. We experimentally evaluate our algorithms on instances for ad allocation and other applications, where we observe that our algorithms are efficient and scalable, and construct solutions that are comparable in value to offline greedy algorithms.
GenerateCT: Text-Guided 3D Chest CT Generation
Authors: Ibrahim Ethem Hamamci, Sezgin Er, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Furkan Almas, Sevval Nil Esirgun, Hadrien Reynaud, Sarthak Pati, Christian Bluethgen, Bjoern Menze
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Generative modeling has experienced substantial progress in recent years, particularly in text-to-image and text-to-video synthesis. However, the medical field has not yet fully exploited the potential of large-scale foundational models for synthetic data generation. In this paper, we introduce GenerateCT, the first method for text-conditional computed tomography (CT) generation, addressing the limitations in 3D medical imaging research and making our entire framework open-source. GenerateCT consists of a pre-trained large language model, a transformer-based text-conditional 3D chest CT generation architecture, and a text-conditional spatial super-resolution diffusion model. We also propose CT-ViT, which efficiently compresses CT volumes while preserving auto-regressiveness in-depth, enabling the generation of 3D CT volumes with variable numbers of axial slices. Our experiments demonstrate that GenerateCT can produce realistic, high-resolution, and high-fidelity 3D chest CT volumes consistent with medical language text prompts. We further investigate the potential of GenerateCT by training a model using generated CT volumes for multi-abnormality classification of chest CT volumes. Our contributions provide a valuable foundation for future research in text-conditional 3D medical image generation and have the potential to accelerate advancements in medical imaging research. Our code, pre-trained models, and generated data are available at https://github.com/ibrahimethemhamamci/GenerateCT.
Local Randomized Neural Networks with Discontinuous Galerkin Methods for Diffusive-Viscous Wave Equation
Abstract
The diffusive-viscous wave equation is an advancement in wave equation theory, as it accounts for both diffusion and viscosity effects. This has a wide range of applications in geophysics, such as the attenuation of seismic waves in fluid-saturated solids and frequency-dependent phenomena in porous media. Therefore, the development of an efficient numerical method for the equation is of both theoretical and practical importance. Recently, local randomized neural networks with discontinuous Galerkin (LRNN-DG) methods have been introduced in \cite{Sun2022lrnndg} to solve elliptic and parabolic equations. Numerical examples suggest that LRNN-DG can achieve high accuracy, and can handle time-dependent problems naturally and efficiently by using a space-time framework. In this paper, we develop LRNN-DG methods for solving the diffusive-viscous wave equation and present numerical experiments with several cases. The numerical results show that the proposed methods can solve the diffusive-viscous wave equation more accurately with less computing costs than traditional methods.
Leveraging Human Feedback to Evolve and Discover Novel Emergent Behaviors in Robot Swarms
Authors: Connor Mattson, Daniel S. Brown
Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract
Robot swarms often exhibit emergent behaviors that are fascinating to observe; however, it is often difficult to predict what swarm behaviors can emerge under a given set of agent capabilities. We seek to efficiently leverage human input to automatically discover a taxonomy of collective behaviors that can emerge from a particular multi-agent system, without requiring the human to know beforehand what behaviors are interesting or even possible. Our proposed approach adapts to user preferences by learning a similarity space over swarm collective behaviors using self-supervised learning and human-in-the-loop queries. We combine our learned similarity metric with novelty search and clustering to explore and categorize the space of possible swarm behaviors. We also propose several general-purpose heuristics that improve the efficiency of our novelty search by prioritizing robot controllers that are likely to lead to interesting emergent behaviors. We test our approach in simulation on two robot capability models and show that our methods consistently discover a richer set of emergent behaviors than prior work. Code, videos, and datasets are available at https://sites.google.com/view/evolving-novel-swarms.
Understanding the Capabilities of Large Language Models for Automated Planning
Authors: Vishal Pallagani, Bharath Muppasani, Keerthiram Murugesan, Francesca Rossi, Biplav Srivastava, Lior Horesh, Francesco Fabiano, Andrea Loreggia
Abstract
Automated planning is concerned with developing efficient algorithms to generate plans or sequences of actions to achieve a specific goal in a given environment. Emerging Large Language Models (LLMs) can answer questions, write high-quality programming code, and predict protein folding, showcasing their versatility in solving various tasks beyond language-based problems. In this paper, we aim to explore how LLMs can also be used for automated planning. To do so, we seek to answer four key questions. Firstly, we want to understand the extent to which LLMs can be used for plan generation. Secondly, we aim to identify which pre-training data is most effective in facilitating plan generation. Thirdly, we investigate whether fine-tuning or prompting is a more effective approach for plan generation. Finally, we explore whether LLMs are capable of plan generalization. By answering these questions, the study seeks to shed light on the capabilities of LLMs in solving complex planning problems and provide insights into the most effective approaches for using LLMs in this context.
A New Era of Mobility: Exploring Digital Twin Applications in Autonomous Vehicular Systems
Authors: S M Mostaq Hossain, Sohag Kumar Saha, Shampa Banik, Trapa Banik
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Abstract
Digital Twins (DTs) are virtual representations of physical objects or processes that can collect information from the real environment to represent, validate, and replicate the physical twin's present and future behavior. The DTs are becoming increasingly prevalent in a variety of fields, including manufacturing, automobiles, medicine, smart cities, and other related areas. In this paper, we presented a systematic reviews on DTs in the autonomous vehicular industry. We addressed DTs and their essential characteristics, emphasized on accurate data collection, real-time analytics, and efficient simulation capabilities, while highlighting their role in enhancing performance and reliability. Next, we explored the technical challenges and central technologies of DTs. We illustrated the comparison analysis of different methodologies that have been used for autonomous vehicles in smart cities. Finally, we addressed the application challenges and limitations of DTs in the autonomous vehicular industry.
On Computing Universal Plans for Partially Observable Multi-Agent Path Finding
Authors: Fengming Zhu, Fangzhen Lin
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Abstract
Multi-agent routing problems have drawn significant attention nowadays due to their broad industrial applications in, e.g., warehouse robots, logistics automation, and traffic control. Conventionally, they are modelled as classical planning problems. In this paper, we argue that it is beneficial to formulate them as universal planning problems. We therefore propose universal plans, also known as policies, as the solution concepts, and implement a system called ASP-MAUPF (Answer Set Programming for Multi-Agent Universal Plan Finding) for computing them. Given an arbitrary two-dimensional map and a profile of goals for the agents, the system finds a feasible universal plan for each agent that ensures no collision with others. We use the system to conduct some experiments, and make some observations on the types of goal profiles and environments that will have feasible policies, and how they may depend on agents' sensors. We also demonstrate how users can customize action preferences to compute more efficient policies, even (near-)optimal ones.
C-MCTS: Safe Planning with Monte Carlo Tree Search
Authors: Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler
Abstract
Many real-world decision-making tasks, such as safety-critical scenarios, cannot be fully described in a single-objective setting using the Markov Decision Process (MDP) framework, as they include hard constraints. These can instead be modeled with additional cost functions within the Constrained Markov Decision Process (CMDP) framework. Even though CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches use Monte Carlo cost estimates to avoid constraint violations. However, these suffer from high variance which results in conservative performance with respect to costs. We propose Constrained MCTS (C-MCTS), an algorithm that estimates cost using a safety critic. The safety critic training is based on Temporal Difference learning in an offline phase prior to agent deployment. This critic limits the exploration of the search tree and removes unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards compared to previous work. As a nice byproduct, the planner is more efficient requiring fewer planning steps. Most importantly, we show that under model mismatch between the planner and the real world, our approach is less susceptible to cost violations than previous work.
Persistent Laplacian-enhanced Algorithm for Scarcely Labeled Data Classification
Abstract
The success of many machine learning (ML) methods depends crucially on having large amounts of labeled data. However, obtaining enough labeled data can be expensive, time-consuming, and subject to ethical constraints for many applications. One approach that has shown tremendous value in addressing this challenge is semi-supervised learning (SSL); this technique utilizes both labeled and unlabeled data during training, often with much less labeled data than unlabeled data, which is often relatively easy and inexpensive to obtain. In fact, SSL methods are particularly useful in applications where the cost of labeling data is especially expensive, such as medical analysis, natural language processing (NLP), or speech recognition. A subset of SSL methods that have achieved great success in various domains involves algorithms that integrate graph-based techniques. These procedures are popular due to the vast amount of information provided by the graphical framework and the versatility of their applications. In this work, we propose an algebraic topology-based semi-supervised method called persistent Laplacian-enhanced graph MBO (PL-MBO) by integrating persistent spectral graph theory with the classical Merriman-Bence- Osher (MBO) scheme. Specifically, we use a filtration procedure to generate a sequence of chain complexes and associated families of simplicial complexes, from which we construct a family of persistent Laplacians. Overall, it is a very efficient procedure that requires much less labeled data to perform well compared to many ML techniques, and it can be adapted for both small and large datasets. We evaluate the performance of the proposed method on data classification, and the results indicate that the proposed technique outperforms other existing semi-supervised algorithms.
Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Abstract
Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.
UDPM: Upsampling Diffusion Probabilistic Models
Authors: Shady Abu-Hussein, Raja Giryes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{https://github.com/shadyabh/UDPM}
DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method
Authors: Ahmed Khaled, Konstantin Mishchenko, Chi Jin
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract
This paper proposes a new easy-to-implement parameter-free gradient-based optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is efficient -- matching the convergence rate of optimally tuned gradient descent in convex optimization up to a logarithmic factor without tuning any parameters, and universal -- automatically adapting to both smooth and nonsmooth problems. While popular algorithms such as AdaGrad, Adam, or DoG compute a running average of the squared gradients, DoWG maintains a new distance-based weighted version of the running average, which is crucial to achieve the desired properties. To our best knowledge, DoWG is the first parameter-free, efficient, and universal algorithm that does not require backtracking search procedures. It is also the first parameter-free AdaGrad style algorithm that adapts to smooth optimization. To complement our theory, we also show empirically that DoWG trains at the edge of stability, and validate its effectiveness on practical machine learning tasks. This paper further uncovers the underlying principle behind the success of the AdaGrad family of algorithms by presenting a novel analysis of Normalized Gradient Descent (NGD), that shows NGD adapts to smoothness when it exists, with no change to the stepsize. This establishes the universality of NGD and partially explains the empirical observation that it trains at the edge of stability in a much more general setup compared to standard gradient descent. The latter might be of independent interest to the community.
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Authors: Chia-Wen Kuo, Zsolt Kira
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.
Fine-Grained Complexity Analysis of Multi-Agent Path Finding on 2D Grids
Abstract
Multi-Agent Path Finding (MAPF) is a fundamental motion coordination problem arising in multi-agent systems with a wide range of applications. The problem's intractability has led to extensive research on improving the scalability of solvers for it. Since optimal solvers can struggle to scale, a major challenge that arises is understanding what makes MAPF hard. We tackle this challenge through a fine-grained complexity analysis of time-optimal MAPF on 2D grids, thereby closing two gaps and identifying a new tractability frontier. First, we show that 2-colored MAPF, i.e., where the agents are divided into two teams, each with its own set of targets, remains NP-hard. Second, for the flowtime objective (also called sum-of-costs), we show that it remains NP-hard to find a solution in which agents have an individually optimal cost, which we call an individually optimal solution. The previously tightest results for these MAPF variants are for (non-grid) planar graphs. We use a single hardness construction that replaces, strengthens, and unifies previous proofs. We believe that it is also simpler than previous proofs for the planar case as it employs minimal gadgets that enable its full visualization in one figure. Finally, for the flowtime objective, we establish a tractability frontier based on the number of directions agents can move in. Namely, we complement our hardness result, which holds for three directions, with an efficient algorithm for finding an individually optimal solution if only two directions are allowed. This result sheds new light on the structure of optimal solutions, which may help guide algorithm design for the general problem.
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Authors: Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Keyword: faster
Exploring Automatically Perturbed Natural Language Explanations in Relation Extraction
Abstract
Previous research has demonstrated that natural language explanations provide valuable inductive biases that guide models, thereby improving the generalization ability and data efficiency. In this paper, we undertake a systematic examination of the effectiveness of these explanations. Remarkably, we find that corrupted explanations with diminished inductive biases can achieve competitive or superior performance compared to the original explanations. Our findings furnish novel insights into the characteristics of natural language explanations in the following ways: (1) the impact of explanations varies across different training styles and datasets, with previously believed improvements primarily observed in frozen language models. (2) While previous research has attributed the effect of explanations solely to their inductive biases, our study shows that the effect persists even when the explanations are completely corrupted. We propose that the main effect is due to the provision of additional context space. (3) Utilizing the proposed automatic perturbed context, we were able to attain comparable results to annotated explanations, but with a significant increase in computational efficiency, 20-30 times faster.
How to escape sharp minima
Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract
Modern machine learning applications have seen a remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this paradigm, this work formulates and studies the algorithmic question of how to find flat minima. As an initial effort, this work adopts the trace of hessian of the cost function as the measure of flatness, and formally defines the notion of approximate flat minima. Under this notion, we then design algorithms that find approximate flat minima efficiently. For general cost functions, we present a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.
PRIMP: PRobabilistically-Informed Motion Primitives for Efficient Affordance Learning from Demonstration
Abstract
This paper proposes a learning-from-demonstration method using probability densities on the workspaces of robot manipulators. The method, named "PRobabilistically-Informed Motion Primitives (PRIMP)", learns the probability distribution of the end effector trajectories in the 6D workspace that includes both positions and orientations. It is able to adapt to new situations such as novel via poses with uncertainty and a change of viewing frame. The method itself is robot-agnostic, in which the learned distribution can be transferred to another robot with the adaptation to its workspace density. The learned trajectory distribution is then used to guide an optimization-based motion planning algorithm to further help the robot avoid novel obstacles that are unseen during the demonstration process. The proposed methods are evaluated by several sets of benchmark experiments. PRIMP runs more than 5 times faster while generalizing trajectories more than twice as close to both the demonstrations and novel desired poses. It is then combined with our robot imagination method that learns object affordances, illustrating the applicability of PRIMP to learn tool use through physical experiments.
Extracting Text Representations for Terms and Phrases in Technical Domains
Authors: Francesco Fusco, Diego Antognini
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.
Neural Characteristic Activation Value Analysis for Improved ReLU Network Feature Learning
Abstract
We examine the characteristic activation values of individual ReLU units in neural networks. We refer to the corresponding set for such characteristic activation values in the input space as the characteristic activation set of a ReLU unit. We draw an explicit connection between the characteristic activation set and learned features in ReLU networks. This connection leads to new insights into why various neural network normalization techniques used in modern deep learning architectures regularize and stabilize SGD optimization. Utilizing these insights, we propose a geometric approach to parameterize ReLU networks for improved feature learning. We empirically verify its usefulness with less carefully chosen initialization schemes and larger learning rates. We report improved optimization stability, faster convergence speed, and better generalization performance.
A Fast Algorithm for Consistency Checking Partially Ordered Time
Abstract
Partially ordered models of time occur naturally in applications where agents or processes cannot perfectly communicate with each other, and can be traced back to the seminal work of Lamport. In this paper we consider the problem of deciding if a (likely incomplete) description of a system of events is consistent, the network consistency problem for the point algebra of partially ordered time (POT). While the classical complexity of this problem has been fully settled, comparably little is known of the fine-grained complexity of POT except that it can be solved in $O^((0.368n)^n)$ time by enumerating ordered partitions. We construct a much faster algorithm with a run-time bounded by $O^((0.26n)^n)$. This is achieved by a sophisticated enumeration of structures similar to total orders, which are then greedily expanded toward a solution. While similar ideas have been explored earlier for related problems it turns out that the analysis for POT is non-trivial and requires significant new ideas.
Improved Algorithms for Allen's Interval Algebra by Dynamic Programming with Sublinear Partitioning
Abstract
Allen's interval algebra is one of the most well-known calculi in qualitative temporal reasoning with numerous applications in artificial intelligence. Recently, there has been a surge of improvements in the fine-grained complexity of NP-hard reasoning tasks, improving the running time from the naive $2^{O(n^2)}$ to $O^((1.0615n)^{n})$, with even faster algorithms for unit intervals a bounded number of overlapping intervals (the $O^(\cdot)$ notation suppresses polynomial factors). Despite these improvements the best known lower bound is still only $2^{o(n)}$ (under the exponential-time hypothesis) and major improvements in either direction seemingly require fundamental advances in computational complexity. In this paper we propose a novel framework for solving NP-hard qualitative reasoning problems which we refer to as dynamic programming with sublinear partitioning. Using this technique we obtain a major improvement of $O^((\frac{cn}{\log{n}})^{n})$ for Allen's interval algebra. To demonstrate that the technique is applicable to more domains we apply it to a problem in qualitative spatial reasoning, the cardinal direction point algebra, and solve it in $O^((\frac{cn}{\log{n}})^{2n/3})$ time. Hence, not only do we significantly advance the state-of-the-art for NP-hard qualitative reasoning problems, but obtain a novel algorithmic technique that is likely applicable to many problems where $2^{O(n)}$ time algorithms are unlikely.
Dynamic Inter-treatment Information Sharing for Heterogeneous Treatment Effects Estimation
Authors: Vinod Kumar Chauhan, Jiandong Zhou, Soheila Molaei, Ghadeer Ghosheh, David A. Clifton
Abstract
Existing heterogeneous treatment effects learners, also known as conditional average treatment effects (CATE) learners, lack a general mechanism for end-to-end inter-treatment information sharing, and data have to be split among potential outcome functions to train CATE learners which can lead to biased estimates with limited observational datasets. To address this issue, we propose a novel deep learning-based framework to train CATE learners that facilitates dynamic end-to-end information sharing among treatment groups. The framework is based on \textit{soft weight sharing} of \textit{hypernetworks}, which offers advantages such as parameter efficiency, faster training, and improved results. The proposed framework complements existing CATE learners and introduces a new class of uncertainty-aware CATE learners that we refer to as \textit{HyperCATE}. We develop HyperCATE versions of commonly used CATE learners and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves the CATE estimation error via counterfactual inference, with increasing effectiveness for smaller datasets.
Abstract
In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC
Gaussian Processes with State-Dependent Noise for Stochastic Control
Authors: Marcel Menner, Karl Berntorp
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Abstract
This paper considers a stochastic control framework, in which the residual model uncertainty of the dynamical system is learned using a Gaussian Process (GP). In the proposed formulation, the residual model uncertainty consists of a nonlinear function and state-dependent noise. The proposed formulation uses a posterior-GP to approximate the residual model uncertainty and a prior-GP to account for state-dependent noise. The two GPs are interdependent and are thus learned jointly using an iterative algorithm. Theoretical properties of the iterative algorithm are established. Advantages of the proposed state-dependent formulation include (i) faster convergence of the GP estimate to the unknown function as the GP learns which data samples are more trustworthy and (ii) an accurate estimate of state-dependent noise, which can, e.g., be useful for a controller or decision-maker to determine the uncertainty of an action. Simulation studies highlight these two advantages.
Distributed TD(0) with Almost No Communication
Authors: Rui Liu, Alex Olshevsky
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Abstract
We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. Our approach relies on ``one-shot averaging,'' where $N$ agents run identical local copies of the TD(0) method and average the outcomes only once at the very end. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). This is the first result proving benefits from parallelism for temporal difference methods.
Voyager: An Open-Ended Embodied Agent with Large Language Models
Abstract
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.
Keyword: mobile
Drivers of Mobile Payment Acceptance: The Impact of Network Externalities
Authors: Qasim Ajao
Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Abstract
Mobile payment has become increasingly popular due to the widespread use of smartphones and their applications. However, its adoption in African countries has been limited, despite its potential to simplify our lives. This study aims to enhance our understanding of the factors that affect the acceptance of mobile payment in Nigeria. To achieve this, the paper explores the impact of "network externalities" in addition to traditional technology acceptance factors. The study hypothesizes that the key drivers of mobile payment acceptance are performance expectancy, effort expectancy, social influence, trust, and network externality. The research findings suggest that while traditional drivers still play a role in customers' willingness to adopt mobile payment, network externalities have the strongest impact. Although the results did not support the influence of effort expectancy, the paper provides recommendations for future research.
Automatic off-line design of robot swarms: exploring the transferability of control software and design methods across different platforms
Authors: Miquel Kegeleirs, David Garzón Ramos, Lorenzo Garattoni, Gianpiero Francesca, Mauro Birattari
Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Abstract
Automatic off-line design is an attractive approach to implementing robot swarms. In this approach, a designer specifies a mission for the swarm, and an optimization process generates suitable control software for the individual robots through computer-based simulations. Most relevant literature has focused on effectively transferring control software from simulation to physical robots. For the first time, we investigate (i) whether control software generated via automatic design is transferable across robot platforms and (ii) whether the design methods that generate such control software are themselves transferable. We experiment with two ground mobile platforms with equivalent capabilities. Our measure of transferability is based on the performance drop observed when control software and/or design methods are ported from one platform to another. Results indicate that while the control software generated via automatic design is transferable in some cases, better performance can be achieved when a transferable method is directly applied to the new platform.
A New Era of Mobility: Exploring Digital Twin Applications in Autonomous Vehicular Systems
Authors: S M Mostaq Hossain, Sohag Kumar Saha, Shampa Banik, Trapa Banik
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Abstract
Digital Twins (DTs) are virtual representations of physical objects or processes that can collect information from the real environment to represent, validate, and replicate the physical twin's present and future behavior. The DTs are becoming increasingly prevalent in a variety of fields, including manufacturing, automobiles, medicine, smart cities, and other related areas. In this paper, we presented a systematic reviews on DTs in the autonomous vehicular industry. We addressed DTs and their essential characteristics, emphasized on accurate data collection, real-time analytics, and efficient simulation capabilities, while highlighting their role in enhancing performance and reliability. Next, we explored the technical challenges and central technologies of DTs. We illustrated the comparison analysis of different methodologies that have been used for autonomous vehicles in smart cities. Finally, we addressed the application challenges and limitations of DTs in the autonomous vehicular industry.
Keyword: pruning
On Semantically-Deterministic Automata
Authors: Bader Abu Radi, Orna Kupferman
Subjects: Formal Languages and Automata Theory (cs.FL)
Abstract
A nondeterministic automaton is semantically deterministic (SD) if different nondeterministic choices in the automaton lead to equivalent states. Semantic determinism is interesting as it is a natural relaxation of determinism, and as some applications of deterministic automata in formal methods can actually use automata with some level of nondeterminism, tightly related to semantic determinism. In the context of finite words, semantic determinism coincides with determinism, in the sense that every pruning of an SD automaton to a deterministic one results in an equivalent automaton. We study SD automata on infinite words, focusing on B\"uchi, co-B\"uchi, and weak automata. We show that there, while semantic determinism does not increase the expressive power, the combinatorial and computational properties of SD automata are very different from these of deterministic automata. In particular, SD B\"uchi and co-B\"uchi automata are exponentially more succinct than deterministic ones (in fact, also exponentially more succinct than history-deterministic automata), their complementation involves an exponential blow up, and decision procedures for them like universality and minimization are PSPACE-complete. For weak automata, we show that while an SD weak automaton need not be pruned to an equivalent deterministic one, it can be determinized to an equivalent deterministic weak automaton with the same state space, implying also efficient complementation and decision procedures for SD weak automata.
Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
Authors: Saisai Ding, Juncheng Li, Jun Wang, Shihui Ying, Jun Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The multi-scale information among the whole slide images (WSIs) is essential for cancer diagnosis. Although the existing multi-scale vision Transformer has shown its effectiveness for learning multi-scale image representation, it still cannot work well on the gigapixel WSIs due to their extremely large image sizes. To this end, we propose a novel Multi-scale Efficient Graph-Transformer (MEGT) framework for WSI classification. The key idea of MEGT is to adopt two independent Efficient Graph-based Transformer (EGT) branches to process the low-resolution and high-resolution patch embeddings (i.e., tokens in a Transformer) of WSIs, respectively, and then fuse these tokens via a multi-scale feature fusion module (MFFM). Specifically, we design an EGT to efficiently learn the local-global information of patch tokens, which integrates the graph representation into Transformer to capture spatial-related information of WSIs. Meanwhile, we propose a novel MFFM to alleviate the semantic gap among different resolution patches during feature fusion, which creates a non-patch token for each branch as an agent to exchange information with another branch by cross-attention. In addition, to expedite network training, a novel token pruning module is developed in EGT to reduce the redundant tokens. Extensive experiments on TCGA-RCC and CAMELYON16 datasets demonstrate the effectiveness of the proposed MEGT.
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors: Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hoffmann
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Authors: Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Keyword: diffusion
Non-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence
Authors: Riccardo Bonalli, Alessandro Rudi
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Abstract
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical estimates of learning rates which, unlike previous works, become increasingly tighter when the regularity of the unknown drift and diffusion coefficients becomes higher. Our method being kernel-based, offline pre-processing may in principle be profitably leveraged to enable efficient numerical implementation.
Differentially Private Synthetic Data via Foundation Model APIs 1: Images
Abstract
Generating differentially private (DP) synthetic data that closely resembles the original private data without leaking sensitive user information is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are accessible via their inference APIs while the model weights are unreleased. However, this comes with greater challenges due to strictly more restrictive model access and the additional need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<=7.9 with privacy cost epsilon=0.67, significantly improving the previous SOTA from epsilon=32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images.
Unsupervised Semantic Correspondence Using Stable Diffusion
Authors: Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps
Abstract
Denoising Diffusion Probabilistic Models (DDPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could lead to the problem of exposure bias due to the accumulation of prediction errors over iterations. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DDPM. In this work, we conduct a systematic study of exposure bias in diffusion models and, intriguingly, we find that the exposure bias could be alleviated with a new sampling method, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce an inference method named Time-Shift Sampler. Our framework can be seamlessly integrated with existing sampling algorithms, such as DDIM or DDPM, inducing merely minimal additional computations. Experimental results show that our proposed framework can effectively enhance the quality of images generated by existing sampling algorithms.
Manifold Diffusion Fields
Authors: Ahmed A. Elhag, Joshua M. Susskind, Miguel Angel Bautista
Abstract
We present Manifold Diffusion Fields (MDF), an approach to learn generative models of continuous functions defined over Riemannian manifolds. Leveraging insights from spectral geometry analysis, we define an intrinsic coordinate system on the manifold via the eigen-functions of the Laplace-Beltrami Operator. MDF represents functions using an explicit parametrization formed by a set of multiple input-output pairs. Our approach allows to sample continuous functions on manifolds and is invariant with respect to rigid and isometric transformations of the manifold. Empirical results on several datasets and manifolds show that MDF can capture distributions of such functions with better diversity and fidelity than previous approaches.
Reversible and irreversible bracket-based dynamics for deep graph neural networks
Authors: Anthony Gruber, Kookjin Lee, Nathaniel Trask
Abstract
Recent works have shown that physics-inspired architectures allow the training of deep graph neural networks (GNNs) without oversmoothing. The role of these physics is unclear, however, with successful examples of both reversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena producing comparable results despite diametrically opposed mechanisms, and further complications arising due to empirical departures from mathematical theory. This work presents a series of novel GNN architectures based upon structure-preserving bracket-based dynamical systems, which are provably guaranteed to either conserve energy or generate positive dissipation with increasing depth. It is shown that the theoretically principled framework employed here allows for inherently explainable constructions, which contextualize departures from theory in current architectures and better elucidate the roles of reversibility and irreversibility in network performance.
Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models
Authors: Zhong Yi Wan, Ricardo Baptista, Yi-fan Chen, John Anderson, Anudhyan Boral, Fei Sha, Leonardo Zepeda-Núñez
Abstract
We introduce a two-stage probabilistic framework for statistical downscaling between unpaired data. Statistical downscaling seeks a probabilistic map to transform low-resolution data from a (possibly biased) coarse-grained numerical scheme to high-resolution data that is consistent with a high-fidelity scheme. Our framework tackles the problem by tandeming two transformations: a debiasing step that is performed by an optimal transport map, and an upsampling step that is achieved by a probabilistic diffusion model with \textit{a posteriori} conditional sampling. This approach characterizes a conditional distribution without the need for paired data, and faithfully recovers relevant physical statistics from biased samples. We demonstrate the utility of the proposed approach on one- and two-dimensional fluid flow problems, which are representative of the core difficulties present in numerical simulations of weather and climate. Our method produces realistic high-resolution outputs from low-resolution inputs, by upsampling resolutions of $8\times$ and $16\times$. Moreover, our procedure correctly matches the statistics of physical quantities, even when the low-frequency content of the inputs and outputs do not match, a crucial but difficult-to-satisfy assumption needed by current state-of-the-art alternatives.
Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion
Authors: Dai Shi, Zhiqi Shao, Yi Guo, Qibin Zhao, Junbin Gao
Abstract
This work presents a comprehensive theoretical analysis of graph p-Laplacian based framelet network (pL-UFG) to establish a solid understanding of its properties. We begin by conducting a convergence analysis of the p-Laplacian based implicit layer integrated after the framelet convolution, providing insights into the asymptotic behavior of pL-UFG. By exploring the generalized Dirichlet energy of pL-UFG, we demonstrate that the Dirichlet energy remains non-zero, ensuring the avoidance of over-smoothing issues in pL-UFG as it approaches convergence. Furthermore, we elucidate the dynamic energy perspective through which the implicit layer in pL-UFG synergizes with graph framelets, enhancing the model's adaptability to both homophilic and heterophilic data. Remarkably, we establish that the implicit layer can be interpreted as a generalized non-linear diffusion process, enabling training using diverse schemes. These multifaceted analyses lead to unified conclusions that provide novel insights for understanding and implementing pL-UFG, contributing to advancements in the field of graph-based deep learning.
Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition
Authors: Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many training samples need be collected for each character category, which is both time-consuming and expensive. In this paper, we propose a novel approach to transforming Chinese character glyph images generated from font libraries to handwritten ones with a denoising diffusion probabilistic model (DDPM). Training from handwritten samples of a small character set, the DDPM is capable of mapping printed strokes to handwritten ones, which makes it possible to generate photo-realistic and diverse style handwritten samples of unseen character categories. Combining DDPM-synthesized samples of unseen categories with real samples of other categories, we can build an HCCR system to support the full character set. Experimental results on CASIA-HWDB dataset with 3,755 character categories show that the HCCR systems trained with synthetic samples perform similarly with the one trained with real samples in terms of recognition accuracy. The proposed method has the potential to address HCCR with a larger vocabulary.
Abstract
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at https://github.com/hunto/DiffKD.
Efficient Neural Music Generation
Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang
Abstract
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
Authors: Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. Recent research has extended these models to support text-guided image editing. While text guidance is an intuitive editing interface for users, it often fails to ensure the precise concept conveyed by users. To address this issue, we propose Custom-Edit, in which we (i) customize a diffusion model with a few reference images and then (ii) perform text-guided editing. Our key discovery is that customizing only language-relevant parameters with augmented prompts improves reference similarity significantly while maintaining source similarity. Moreover, we provide our recipe for each customization and editing process. We compare popular customization methods and validate our findings on two editing methods using various datasets.
On Architectural Compression of Text-to-Image Diffusion Models
Authors: Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi
Abstract
Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.
PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion
Abstract
The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely ``transport equation''. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (\textbf{PDE} with \textbf{A}daptive \textbf{D}istributional \textbf{D}iffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated in extensive settings, including clean samples and various corruptions, demonstrating its superior performance compared to SOTA methods.
Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
Authors: Tsu-Ching Hsiao, Hao-Wei Chen, Hsuan-Kung Yang, Chun-Yi Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Addressing accuracy limitations and pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge, particularly due to object symmetries or occlusions. In response, we introduce a novel score-based diffusion method applied to the $SE(3)$ group, marking the first application of diffusion models to $SE(3)$ within the image domain, specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity, mitigating perspective-induced ambiguity, and showcasing the robustness of our surrogate Stein score formulation on $SE(3)$. This formulation not only improves the convergence of Langevin dynamics but also enhances computational efficiency. Thus, we pioneer a promising strategy for 6D object pose estimation.
Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7
Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D.Plumbley, Wenwu Wang
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract
Foley sound generation aims to synthesise the background sound for multimedia content, which involves computationally modelling sound effects with specialized techniques. In this work, we proposed a diffusion based generative model for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data scarcity of the task 7 training set, our model is initially trained with large-scale datasets and downstream into this DCASE task via transfer learning. We have observed that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a large language model, i.e., contrastive language-audio pretraining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which achieves a FAD score of 9.7.
Anomaly Detection with Conditioned Denoising Diffusion Models
Authors: Arian Mousakhan, Thomas Brox, Jawad Tayyub
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Reconstruction-based methods have struggled to achieve competitive performance on anomaly detection. In this paper, we introduce Denoising Diffusion Anomaly Detection (DDAD). We propose a novel denoising process for image reconstruction conditioned on a target image. This results in a coherent restoration that closely resembles the target image. Subsequently, our anomaly detection framework leverages this conditioning where the target image is set as the input image to guide the denoising process, leading to defectless reconstruction while maintaining nominal patterns. We localise anomalies via a pixel-wise and feature-wise comparison of the input and reconstructed image. Finally, to enhance the effectiveness of feature comparison, we introduce a domain adaptation method that utilises generated examples from our conditioned denoising process to fine-tune the feature extractor. The veracity of the approach is demonstrated on various datasets including MVTec and VisA benchmarks, achieving state-of-the-art results of 99.5% and 99.3% image-level AUROC respectively.
DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
Abstract
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score
Authors: Shuhai Zhang, Feng Liu, Jiahao Yang, Yifan Yang, Changsheng Li, Bo Han, Mingkui Tan
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Abstract
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions. Unfortunately, estimating or comparing two data distributions is extremely difficult, especially in high-dimension spaces. Recently, the gradient of log probability density (a.k.a., score) w.r.t. the sample is used as an alternative statistic to compute. However, we find that the score is sensitive in identifying adversarial samples due to insufficient information with one sample only. In this paper, we propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations. Specifically, to obtain adequate information regarding one sample, we perturb it by adding various noises to capture its multi-view observations. We theoretically prove that EPS is a proper statistic to compute the discrepancy between two samples under mild conditions. In practice, we can use a pre-trained diffusion model to estimate EPS for each sample. Last, we propose an EPS-based adversarial detection (EPS-AD) method, in which we develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples. We also prove that the EPS-based MMD between natural and adversarial samples is larger than that among natural samples. Extensive experiments show the superior adversarial detection performance of our EPS-AD.
GenerateCT: Text-Guided 3D Chest CT Generation
Authors: Ibrahim Ethem Hamamci, Sezgin Er, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Furkan Almas, Sevval Nil Esirgun, Hadrien Reynaud, Sarthak Pati, Christian Bluethgen, Bjoern Menze
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Generative modeling has experienced substantial progress in recent years, particularly in text-to-image and text-to-video synthesis. However, the medical field has not yet fully exploited the potential of large-scale foundational models for synthetic data generation. In this paper, we introduce GenerateCT, the first method for text-conditional computed tomography (CT) generation, addressing the limitations in 3D medical imaging research and making our entire framework open-source. GenerateCT consists of a pre-trained large language model, a transformer-based text-conditional 3D chest CT generation architecture, and a text-conditional spatial super-resolution diffusion model. We also propose CT-ViT, which efficiently compresses CT volumes while preserving auto-regressiveness in-depth, enabling the generation of 3D CT volumes with variable numbers of axial slices. Our experiments demonstrate that GenerateCT can produce realistic, high-resolution, and high-fidelity 3D chest CT volumes consistent with medical language text prompts. We further investigate the potential of GenerateCT by training a model using generated CT volumes for multi-abnormality classification of chest CT volumes. Our contributions provide a valuable foundation for future research in text-conditional 3D medical image generation and have the potential to accelerate advancements in medical imaging research. Our code, pre-trained models, and generated data are available at https://github.com/ibrahimethemhamamci/GenerateCT.
Local Randomized Neural Networks with Discontinuous Galerkin Methods for Diffusive-Viscous Wave Equation
Abstract
The diffusive-viscous wave equation is an advancement in wave equation theory, as it accounts for both diffusion and viscosity effects. This has a wide range of applications in geophysics, such as the attenuation of seismic waves in fluid-saturated solids and frequency-dependent phenomena in porous media. Therefore, the development of an efficient numerical method for the equation is of both theoretical and practical importance. Recently, local randomized neural networks with discontinuous Galerkin (LRNN-DG) methods have been introduced in \cite{Sun2022lrnndg} to solve elliptic and parabolic equations. Numerical examples suggest that LRNN-DG can achieve high accuracy, and can handle time-dependent problems naturally and efficiently by using a space-time framework. In this paper, we develop LRNN-DG methods for solving the diffusive-viscous wave equation and present numerical experiments with several cases. The numerical results show that the proposed methods can solve the diffusive-viscous wave equation more accurately with less computing costs than traditional methods.
CACTUS: A Computational Framework for Generating Realistic White Matter Microstructure Substrates
Authors: Juan Luis Villarreal-Haro, Remy Gardier, Erick J Canales-Rodriguez, Elda Fischi Gomez, Gabriel Girard, Jean-Philippe Thiran, Jonathan Rafael-Patino
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Monte-Carlo diffusion simulations are a powerful tool for validating tissue microstructure models by generating synthetic diffusion-weighted magnetic resonance images (DW-MRI) in controlled environments. This is fundamental for understanding the link between micrometre-scale tissue properties and DW-MRI signals measured at the millimetre-scale, optimising acquisition protocols to target microstructure properties of interest, and exploring the robustness and accuracy of estimation methods. However, accurate simulations require substrates that reflect the main microstructural features of the studied tissue. To address this challenge, we introduce a novel computational workflow, CACTUS (Computational Axonal Configurator for Tailored and Ultradense Substrates), for generating synthetic white matter substrates. Our approach allows constructing substrates with higher packing density than existing methods, up to 95 % intra-axonal volume fraction, and larger voxel sizes of up to (500um) 3 with rich fibre complexity. CACTUS generates bundles with angular dispersion, bundle crossings, and variations along the fibres of their inner and outer radii and g-ratio. We achieve this by introducing a novel global cost function and a fibre radial growth approach that allows substrates to match predefined targeted characteristics and mirror those reported in histological studies. CACTUS improves the development of complex synthetic substrates, paving the way for future applications in microstructure imaging.
Unifying GANs and Score-Based Diffusion as Generative Particle Models
Authors: Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Abstract
Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions by differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper, we challenge this interpretation and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Authors: Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page: https://ml.cs.tsinghua.edu.cn/prolificdreamer/
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models
Abstract
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.
ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
Authors: Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, Changsheng Xu
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
UDPM: Upsampling Diffusion Probabilistic Models
Authors: Shady Abu-Hussein, Raja Giryes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{https://github.com/shadyabh/UDPM}
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs
Authors: Guangyao Zhai, Evin Pinar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, Benjamin Busam
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Authors: Matthew Chang, Aditya Prakash, Saurabh Gupta
Abstract
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
Break-A-Scene: Extracting Multiple Concepts from a Single Image
Abstract
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: https://omriavrahami.com/break-a-scene/
Abstract
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
Parallel Sampling of Diffusion Models
Authors: Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari
Abstract
Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Authors: Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Abstract
Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.
Keyword: dynamic
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
Abstract
Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LogicLLaMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LogicLLaMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LogicLLaMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then fine-tunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LogicLLaMA, we present MALLS (large language $\textbf{M}$odel gener$\textbf{A}$ted N$\textbf{L}$-FO$\textbf{L}$ pair$\textbf{S}$), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at $\href{https://github.com/gblackout/LogicLLaMA}{{\small \text{https://github.com/gblackout/LogicLLaMA}}}$.
Deep Reinforcement Learning with Plasticity Injection
Authors: Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, André Barreto
Abstract
A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool $\unicode{x2014}$ if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent has to re-learn from scratch due to exhausted plasticity or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance compared to alternative methods while being computationally efficient.
Semi-global Exponential Stability for Dual Quaternion Based Rigid-Body Tracking Control
Authors: Vrushabh Zinage, S P Arjun Ram, Maruthi R. Akella, Efstathios Bakolas
Abstract
Semi-Global Exponential Stability (SGES) is proved for the combined attitude and position rigid body motion tracking problem, which was previously only known to be uniformly asymptotically stable. Dual quaternions are used to jointly represent the rotational and translation tracking error dynamics of the rigid body. A novel nonlinear feedback tracking controller is proposed and a Lyapunov based analysis is provided to prove the semi-global exponential stability of the closed-loop dynamics. Our analysis does not place any restrictions on the reference trajectory or the feedback gains. This stronger SGES result aids in further analyzing the robustness of the rigid body system by establishing Input-to-State Stability (ISS) in the presence of time-varying additive and bounded external disturbances. Motivated by the fact that in many aerospace applications, stringent adherence to safety constraints such as approach path constraint and input constraints is critical for overall mission success, we present a framework for safe control of spacecraft that combines the proposed feedback controller with Control Barrier Functions. Numerical simulations are provided for the Mars Cube One (MarCO) mission, Apollo transposition and docking problem, and the rendezvous of SpaceX Dragon 2 with the International Space Station to verify the SGES and ISS results as well as the efficacy of the proposed nonlinear feedback controller.
Automated Driving Architecture and Operation of a Light Commercial Vehicle
Authors: Murat Gozu, Mumin Tolga Emirler, Ismail Meric Can Uygan, Tevfik Ali Boke, Levent Guvenc, Bilin Aksun-Guvenc
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Abstract
This paper is on the automated driving architecture and operation of a light commercial vehicle. Simple longitudinal and lateral dynamic models of the vehicle and a more detailed CarSim model are developed and used in simulations and controller design and evaluation. Experimental validation is used to make sure that the models used represent the actual response of the vehicle as closely as possible. The vehicle is made drive-by-wire by interfacing with the existing throttle-by-wire, by adding an active vacuum booster for brake-by-wire and by adding a steering actuator for steer-by-wire operation. Vehicle localization is achieved by using a GPS sensor integrated with six axes IMU with a built-in INS algorithm and a digital compass for heading information. Front looking radar, lidar and camera are used for environmental sensing. Communication with the road infrastructure and other vehicles is made possible by a vehicle to vehicle communication modem. A dedicated computer under real time Linux is used to collect, process and distribute sensor information. A dSPACE MicroAutoBox is used for drive-by-wire controls. CACC based longitudinal control and path tracking of a map of GPS waypoints are used to present the operation of this automated driving vehicle.
Learning Lagrangian Fluid Mechanics with E($3$)-Equivariant Graph Neural Networks
Authors: Artur P. Toshev, Gianluca Galletti, Johannes Brandstetter, Stefan Adami, Nikolaus A. Adams
Abstract
We contribute to the vastly growing field of machine learning for engineering systems by demonstrating that equivariant graph neural networks have the potential to learn more accurate dynamic-interaction models than their non-equivariant counterparts. We benchmark two well-studied fluid-flow systems, namely 3D decaying Taylor-Green vortex and 3D reverse Poiseuille flow, and evaluate the models based on different performance measures, such as kinetic energy or Sinkhorn distance. In addition, we investigate different embedding methods of physical-information histories for equivariant models. We find that while currently being rather slow to train and evaluate, equivariant models with our proposed history embeddings learn more accurate physical interactions.
Reversible and irreversible bracket-based dynamics for deep graph neural networks
Authors: Anthony Gruber, Kookjin Lee, Nathaniel Trask
Abstract
Recent works have shown that physics-inspired architectures allow the training of deep graph neural networks (GNNs) without oversmoothing. The role of these physics is unclear, however, with successful examples of both reversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena producing comparable results despite diametrically opposed mechanisms, and further complications arising due to empirical departures from mathematical theory. This work presents a series of novel GNN architectures based upon structure-preserving bracket-based dynamical systems, which are provably guaranteed to either conserve energy or generate positive dissipation with increasing depth. It is shown that the theoretically principled framework employed here allows for inherently explainable constructions, which contextualize departures from theory in current architectures and better elucidate the roles of reversibility and irreversibility in network performance.
Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion
Authors: Dai Shi, Zhiqi Shao, Yi Guo, Qibin Zhao, Junbin Gao
Abstract
This work presents a comprehensive theoretical analysis of graph p-Laplacian based framelet network (pL-UFG) to establish a solid understanding of its properties. We begin by conducting a convergence analysis of the p-Laplacian based implicit layer integrated after the framelet convolution, providing insights into the asymptotic behavior of pL-UFG. By exploring the generalized Dirichlet energy of pL-UFG, we demonstrate that the Dirichlet energy remains non-zero, ensuring the avoidance of over-smoothing issues in pL-UFG as it approaches convergence. Furthermore, we elucidate the dynamic energy perspective through which the implicit layer in pL-UFG synergizes with graph framelets, enhancing the model's adaptability to both homophilic and heterophilic data. Remarkably, we establish that the implicit layer can be interpreted as a generalized non-linear diffusion process, enabling training using diverse schemes. These multifaceted analyses lead to unified conclusions that provide novel insights for understanding and implementing pL-UFG, contributing to advancements in the field of graph-based deep learning.
Accelerated solutions of convection-dominated partial differential equations using implicit feature tracking and empirical quadrature
Authors: Marzieh Alireza Mirhoseini, Matthew J. Zahr
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
This work introduces an empirical quadrature-based hyperreduction procedure and greedy training algorithm to effectively reduce the computational cost of solving convection-dominated problems with limited training. The proposed approach circumvents the slowly decaying $n$-width limitation of linear model reduction techniques applied to convection-dominated problems by using a nonlinear approximation manifold systematically defined by composing a low-dimensional affine space with bijections of the underlying domain. The reduced-order model is defined as the solution of a residual minimization problem over the nonlinear manifold. An online-efficient method is obtained by using empirical quadrature to approximate the optimality system such that it can be solved with mesh-independent operations. The proposed reduced-order model is trained using a greedy procedure to systematically sample the parameter domain. The effectiveness of the proposed approach is demonstrated on two shock-dominated computational fluid dynamics benchmarks.
FedHC: A Scalable Federated Learning Framework for Heterogeneous and Resource-Constrained Clients
Authors: Min Zhang, Fuxun Yu, Yongbo Yu, Minjia Zhang, Ang Li, Xiang Chen
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Federated Learning (FL) is a distributed learning paradigm that empowers edge devices to collaboratively learn a global model leveraging local data. Simulating FL on GPU is essential to expedite FL algorithm prototyping and evaluations. However, current FL frameworks overlook the disparity between algorithm simulation and real-world deployment, which arises from heterogeneous computing capabilities and imbalanced workloads, thus misleading evaluations of new algorithms. Additionally, they lack flexibility and scalability to accommodate resource-constrained clients. In this paper, we present FedHC, a scalable federated learning framework for heterogeneous and resource-constrained clients. FedHC realizes system heterogeneity by allocating a dedicated and constrained GPU resource budget to each client, and also simulates workload heterogeneity in terms of framework-provided runtime. Furthermore, we enhance GPU resource utilization for scalable clients by introducing a dynamic client scheduler, process manager, and resource-sharing mechanism. Our experiments demonstrate that FedHC has the capability to capture the influence of various factors on client execution time. Moreover, despite resource constraints for each client, FedHC achieves state-of-the-art efficiency compared to existing frameworks without limits. When subjecting existing frameworks to the same resource constraints, FedHC achieves a 2.75x speedup. Code has been released on https://github.com/if-lab-repository/FedHC.
Accelerated K-Serial Stable Coalition for Dynamic Capture and Resource Defense
Abstract
Coalition is an important mean of multi-robot systems to collaborate on common tasks. An effective and adaptive coalition strategy is essential for the online performance in dynamic and unknown environments. In this work, the problem of territory defense by large-scale heterogeneous robotic teams is considered. The tasks include surveillance, capture of dynamic targets, and perimeter defense over valuable resources. Since each robot can choose among many tasks, it remains a challenging problem to coordinate jointly these robots such that the overall utility is maximized. This work proposes a generic coalition strategy called K-serial stable coalition algorithm (KS-COAL). Different from centralized approaches, it is distributed and anytime, meaning that only local communication is required and a K-serial Nash-stable solution is ensured. Furthermore, to accelerate adaptation to dynamic targets and resource distribution that are only perceived online, a heterogeneous graph attention network (HGAN)-based heuristic is learned to select more appropriate parameters and promising initial solutions during local optimization. Compared with manual heuristics or end-to-end predictors, it is shown to both improve online adaptability and retain the quality guarantee. The proposed methods are validated rigorously via large-scale simulations with hundreds of robots, against several strong baselines including GreedyNE and FastMaxSum.
Abstract
Most MPC (Model Predictive Control) algorithms used in industries and studied in the control academia use a two-term QP (quadratic programming), where the first term is the weighted norm of the output errors, and the second term is that of the input increments. In this work, a DMC (Dynamic Matrix Control) algorithm that uses three-term QP is studied, where the third term is the weighted norm of the output increments. In the analysis, a relationship between the three-term DMC and the two-term DMC is established; based on that, the closed-loop response curves are derived. Based on the analysis, two controller tuning procedures are developed for the three-term DMC, one for closed-loop step response and one for disturbance reduction. Finally, it will be proven that the three-term DMC can achieve a higher performance and robustness than the two-term DMC can. Simulation studies are used to demonstrate the findings and the tuning methods.
TransWorldNG: Traffic Simulation via Foundation Model
Authors: Ding Wang, Xuhong Wang, Liang Chen, Shengyue Yao, Ming Jing, Honghai Li, Li Li, Shiqiang Bao, Fei-Yue Wang, Yilun Lin
Abstract
Traffic simulation is a crucial tool for transportation decision-making and policy development. However, achieving realistic simulations in the face of the high dimensionality and heterogeneity of traffic environments is a longstanding challenge. In this paper, we present TransWordNG, a traffic simulator that uses Data-driven algorithms and Graph Computing techniques to learn traffic dynamics from real data. The functionality and structure of TransWorldNG are introduced, which utilize a foundation model for transportation management and control. The results demonstrate that TransWorldNG can generate more realistic traffic patterns compared to traditional simulators. Additionally, TransWorldNG exhibits better scalability, as it shows linear growth in computation time as the scenario scale increases. To the best of our knowledge, this is the first traffic simulator that can automatically learn traffic patterns from real-world data and efficiently generate accurate and realistic traffic environments.
Dynamic Enhancement Network for Partial Multi-modality Person Re-identification
Authors: Aihua Zheng, Ziling He, Zi Wang, Chenglong Li, Jin Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Many existing multi-modality studies are based on the assumption of modality integrity. However, the problem of missing arbitrary modalities is very common in real life, and this problem is less studied, but actually important in the task of multi-modality person re-identification (Re-ID). To this end, we design a novel dynamic enhancement network (DENet), which allows missing arbitrary modalities while maintaining the representation ability of multiple modalities, for partial multi-modality person Re-ID. To be specific, the multi-modal representation of the RGB, near-infrared (NIR) and thermal-infrared (TIR) images is learned by three branches, in which the information of missing modalities is recovered by the feature transformation module. Since the missing state might be changeable, we design a dynamic enhancement module, which dynamically enhances modality features according to the missing state in an adaptive manner, to improve the multi-modality representation. Extensive experiments on multi-modality person Re-ID dataset RGBNT201 and vehicle Re-ID dataset RGBNT100 comparing to the state-of-the-art methods verify the effectiveness of our method in complex and changeable environments.
Residual Dynamics Learning for Trajectory Tracking for Multi-rotor Aerial Vehicles
Authors: Geesara Kulathunga, Hany Hamed, Alexandr Klimchik
Abstract
This paper presents a technique to cope with the gap between high-level planning, e.g., reference trajectory tracking, and low-level controlling using a learning-based method in the plan-based control paradigm. The technique improves the smoothness of maneuvering through cluttered environments, especially targeting low-speed velocity profiles. In such a profile, external aerodynamic effects that are applied on the quadrotor can be neglected. Hence, we used a simplified motion model to represent the motion of the quadrotor when formulating the Nonlinear Model Predictive Control (NMPC)-based local planner. However, the simplified motion model causes residual dynamics between the high-level planner and the low-level controller. The Sparse Gaussian Process Regression-based technique is proposed to reduce these residual dynamics. The proposed technique is compared with Data-Driven MPC. The comparison results yield that an augmented residual dynamics model-based planner helps to reduce the nominal model error by a factor of 2 on average. Further, we compared the proposed complete framework with four other approaches. The proposed approach outperformed the others in terms of tracking the reference trajectory without colliding with obstacles with less flight time without losing computational efficiency.
Lucy-SKG: Learning to Play Rocket League Efficiently Using Deep Reinforcement Learning
Authors: Vasileios Moschopoulos, Pantelis Kyriakidis, Aristotelis Lazaridis, Ioannis Vlahavas
Abstract
A successful tactic that is followed by the scientific community for advancing AI is to treat games as problems, which has been proven to lead to various breakthroughs. We adapt this strategy in order to study Rocket League, a widely popular but rather under-explored 3D multiplayer video game with a distinct physics engine and complex dynamics that pose a significant challenge in developing efficient and high-performance game-playing agents. In this paper, we present Lucy-SKG, a Reinforcement Learning-based model that learned how to play Rocket League in a sample-efficient manner, outperforming by a notable margin the two highest-ranking bots in this game, namely Necto (2022 bot champion) and its successor Nexto, thus becoming a state-of-the-art agent. Our contributions include: a) the development of a reward analysis and visualization library, b) novel parameterizable reward shape functions that capture the utility of complex reward types via our proposed Kinesthetic Reward Combination (KRC) technique, and c) design of auxiliary neural architectures for training on reward prediction and state representation tasks in an on-policy fashion for enhanced efficiency in learning speed and performance. By performing thorough ablation studies for each component of Lucy-SKG, we showed their independent effectiveness in overall performance. In doing so, we demonstrate the prospects and challenges of using sample-efficient Reinforcement Learning techniques for controlling complex dynamical systems under competitive team-based multiplayer conditions.
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors: Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hoffmann
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.
A continuum and computational framework for viscoelastodynamics: II. Strain-driven and energy-momentum consistent schemes
Abstract
We continue our investigation of finite deformation linear viscoelastodynamics by focusing on constructing accurate and reliable numerical schemes. The concrete thermomechanical foundation developed in the previous study paves the way for pursuing discrete formulations with critical physical and mathematical structures preserved. Energy stability, momentum conservation, and temporal accuracy constitute the primary factors in our algorithm design. For inelastic materials, the directionality condition, a property for the stress to be energy consistent, is extended with the dissipation effect taken into account. Moreover, the integration of the constitutive relations calls for an algorithm design of the internal state variables and their conjugate variables. A directionality condition for the conjugate variables is introduced as an indispensable ingredient for ensuring physically correct numerical dissipation. By leveraging the particular structure of the configurational free energy, a set of update formulas for the internal state variables is obtained. Detailed analysis reveals that the overall discrete schemes are energy-momentum consistent and achieve first- and second-order accuracy in time, respectively. Numerical examples are provided to justify the appealing features of the proposed methodology.
Stochastic Modified Equations and Dynamics of Dropout Algorithm
Authors: Zhongwang Zhang, Yuqing Li, Tao Luo, Zhi-Qin John Xu
Abstract
Dropout is a widely utilized regularization technique in the training of neural networks, nevertheless, its underlying mechanism and its impact on achieving good generalization abilities remain poorly understood. In this work, we derive the stochastic modified equations for analyzing the dynamics of dropout, where its discrete iteration process is approximated by a class of stochastic differential equations. In order to investigate the underlying mechanism by which dropout facilitates the identification of flatter minima, we study the noise structure of the derived stochastic modified equation for dropout. By drawing upon the structural resemblance between the Hessian and covariance through several intuitive approximations, we empirically demonstrate the universal presence of the inverse variance-flatness relation and the Hessian-variance relation, throughout the training process of dropout. These theoretical and empirical findings make a substantial contribution to our understanding of the inherent tendency of dropout to locate flatter minima.
Robust asymptotic observer of motion states with nonlinear friction
Abstract
This paper revisits the previously proposed linear asymptotic observer of the motion state variables with nonlinear friction and provides a robust design suitable for both, transient presliding and steady-state sliding phases of the relative motion. The class of motion systems with the only measurable output displacement is considered. The reduced-order Luenberger type observer is designed based on the obtained simplified state-space representation with a time-varying system matrix. The resulted observation error dynamics proves to be robust and appropriate for all variations of the system matrix, which are due to the nonlinear spatially varying friction. A specially designed tribological setup to accurately monitor the relative motion between two contacting friction surfaces is used to collect the experimental data of the deceleration trajectories when excited by a series of impulses. The performance of the state estimation using the proposed observer is shown based on the collected experimental data.
Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
Authors: Tsu-Ching Hsiao, Hao-Wei Chen, Hsuan-Kung Yang, Chun-Yi Lee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Addressing accuracy limitations and pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge, particularly due to object symmetries or occlusions. In response, we introduce a novel score-based diffusion method applied to the $SE(3)$ group, marking the first application of diffusion models to $SE(3)$ within the image domain, specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity, mitigating perspective-induced ambiguity, and showcasing the robustness of our surrogate Stein score formulation on $SE(3)$. This formulation not only improves the convergence of Langevin dynamics but also enhances computational efficiency. Thus, we pioneer a promising strategy for 6D object pose estimation.
Camera-Incremental Object Re-Identification with Identity Knowledge Evolution
Authors: Hantao Yao, Lu Yu, Jifei Luo, Changsheng Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Object Re-identification (ReID) aims to retrieve the probe object from many gallery images with the ReID model inferred based on a stationary camera-free dataset by associating and collecting the identities across all camera views. When deploying the ReID algorithm in real-world scenarios, the aspect of storage, privacy constraints, and dynamic changes of cameras would degrade its generalizability and applicability. Treating each camera's data independently, we introduce a novel ReID task named Camera-Incremental Object Re-identification (CIOR) by continually optimizing the ReID mode from the incoming stream of the camera dataset. Since the identities under different camera views might describe the same object, associating and distilling the knowledge of common identities would boost the discrimination and benefit from alleviating the catastrophic forgetting. In this paper, we propose a novel Identity Knowledge Evolution (IKE) framework for CIOR, consisting of the Identity Knowledge Association (IKA), Identity Knowledge Distillation (IKD), and Identity Knowledge Update (IKU). IKA is proposed to discover the common identities between the current identity and historical identities. IKD has applied to distillate historical identity knowledge from common identities and quickly adapt the historical model to the current camera view. After each camera has been trained, IKU is applied to continually expand the identity knowledge by combining the historical and current identity memories. The evaluation of Market-CL and Veri-CL shows the Identity Knowledge Evolution (IKE) effectiveness for CIOR. code:https://github.com/htyao89/Camera-Incremental-Object-ReID
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Abstract
Memes are a powerful tool for communication over social media. Their affinity for evolving across politics, history, and sociocultural phenomena makes them an ideal communication vehicle. To comprehend the subtle message conveyed within a meme, one must understand the background that facilitates its holistic assimilation. Besides digital archiving of memes and their metadata by a few websites like knowyourmeme.com, currently, there is no efficient way to deduce a meme's context dynamically. In this work, we propose a novel task, MEMEX - given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. At first, we develop MCC (Meme Context Corpus), a novel dataset for MEMEX. Further, to benchmark MCC, we propose MIME (MultImodal Meme Explainer), a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context. MIME surpasses several unimodal and multimodal systems and yields an absolute improvement of ~ 4% F1-score over the best baseline. Lastly, we conduct detailed analyses of MIME's performance, highlighting the aspects that could lead to optimal modeling of cross-modal contextual associations.
Improved Algorithms for Allen's Interval Algebra by Dynamic Programming with Sublinear Partitioning
Abstract
Allen's interval algebra is one of the most well-known calculi in qualitative temporal reasoning with numerous applications in artificial intelligence. Recently, there has been a surge of improvements in the fine-grained complexity of NP-hard reasoning tasks, improving the running time from the naive $2^{O(n^2)}$ to $O^((1.0615n)^{n})$, with even faster algorithms for unit intervals a bounded number of overlapping intervals (the $O^(\cdot)$ notation suppresses polynomial factors). Despite these improvements the best known lower bound is still only $2^{o(n)}$ (under the exponential-time hypothesis) and major improvements in either direction seemingly require fundamental advances in computational complexity. In this paper we propose a novel framework for solving NP-hard qualitative reasoning problems which we refer to as dynamic programming with sublinear partitioning. Using this technique we obtain a major improvement of $O^((\frac{cn}{\log{n}})^{n})$ for Allen's interval algebra. To demonstrate that the technique is applicable to more domains we apply it to a problem in qualitative spatial reasoning, the cardinal direction point algebra, and solve it in $O^((\frac{cn}{\log{n}})^{2n/3})$ time. Hence, not only do we significantly advance the state-of-the-art for NP-hard qualitative reasoning problems, but obtain a novel algorithmic technique that is likely applicable to many problems where $2^{O(n)}$ time algorithms are unlikely.
Dynamic Inter-treatment Information Sharing for Heterogeneous Treatment Effects Estimation
Authors: Vinod Kumar Chauhan, Jiandong Zhou, Soheila Molaei, Ghadeer Ghosheh, David A. Clifton
Abstract
Existing heterogeneous treatment effects learners, also known as conditional average treatment effects (CATE) learners, lack a general mechanism for end-to-end inter-treatment information sharing, and data have to be split among potential outcome functions to train CATE learners which can lead to biased estimates with limited observational datasets. To address this issue, we propose a novel deep learning-based framework to train CATE learners that facilitates dynamic end-to-end information sharing among treatment groups. The framework is based on \textit{soft weight sharing} of \textit{hypernetworks}, which offers advantages such as parameter efficiency, faster training, and improved results. The proposed framework complements existing CATE learners and introduces a new class of uncertainty-aware CATE learners that we refer to as \textit{HyperCATE}. We develop HyperCATE versions of commonly used CATE learners and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves the CATE estimation error via counterfactual inference, with increasing effectiveness for smaller datasets.
Illustrative Motion Smoothing for Attention Guidance in Dynamic Visualizations
Authors: Johannes Eschner, Peter Mindek, Manuela Waldner
Abstract
3D animations are an effective method to learn about complex dynamic phenomena, such as mesoscale biological processes. The animators' goals are to convey a sense of the scene's overall complexity while, at the same time, visually guiding the user through a story of subsequent events embedded in the chaotic environment. Animators use a variety of visual emphasis techniques to guide the observers' attention through the story, such as highlighting, halos -- or by manipulating motion parameters of the scene. In this paper, we investigate the effect of smoothing the motion of contextual scene elements to attract attention to focus elements of the story exhibiting high-frequency motion. We conducted a crowdsourced study with 108 participants observing short animations with two illustrative motion smoothing strategies: geometric smoothing through noise reduction of contextual motion trajectories and visual smoothing through motion blur of \rev{context} items. We investigated the observers' ability to follow the story as well as the effect of the techniques on speed perception in a molecular scene. Our results show that moderate motion blur significantly improves users' ability to follow the story. Geometric motion smoothing is less effective but increases the visual appeal of the animation. However, both techniques also slow down the perceived speed of the animation. We discuss the implications of these results and derive design guidelines for animators of complex dynamic visualizations.
Exploiting Noise as a Resource for Computation and Learning in Spiking Neural Networks
Abstract
Networks of spiking neurons underpin the extraordinary information-processing capabilities of the brain and have emerged as pillar models in neuromorphic intelligence. Despite extensive research on spiking neural networks (SNNs), most are established on deterministic models. Integrating noise into SNNs leads to biophysically more realistic neural dynamics and may benefit model performance. This work presents the noisy spiking neural network (NSNN) and the noise-driven learning rule (NDL) by introducing a spiking neuron model incorporating noisy neuronal dynamics. Our approach shows how noise may act as a resource for computation and learning and theoretically provides a framework for general SNNs. Moreover, NDL provides an insightful rationale for surrogate gradients. By incorporating various SNN architectures and algorithms, we show that our approach exhibits competitive performance and improved robustness against challenging perturbations than deterministic SNNs. Additionally, we demonstrate the utility of the NSNN model for neural coding studies. Overall, NSNN offers a powerful, flexible, and easy-to-use tool for machine learning practitioners and computational neuroscience researchers.
Demystifying Oversmoothing in Attention-Based Graph Neural Networks
Authors: Xinyi Wu, Amir Ajorlou, Zihui Wu, Ali Jadbabaie
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Abstract
Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU.
Learning Safety Constraints from Demonstrations with Unknown Rewards
Authors: David Lindner, Xin Chen, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause
Abstract
We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in tabular environments and a continuous driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior and that can be transferred to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.
From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module
Authors: Claudio Battiloro, Indro Spinelli, Lev Telyatnikov, Michael Bronstein, Simone Scardapane, Paolo Di Lorenzo
Abstract
Latent Graph Inference (LGI) relaxed the reliance of Graph Neural Networks (GNNs) on a given graph topology by dynamically learning it. However, most of LGI methods assume to have a (noisy, incomplete, improvable, ...) input graph to rewire and can solely learn regular graph topologies. In the wake of the success of Topological Deep Learning (TDL), we study Latent Topology Inference (LTI) for learning higher-order cell complexes (with sparse and not regular topology) describing multi-way interactions between data points. To this aim, we introduce the Differentiable Cell Complex Module (DCM), a novel learnable function that computes cell probabilities in the complex to improve the downstream task. We show how to integrate DCM with cell complex message passing networks layers and train it in a end-to-end fashion, thanks to a two-step inference procedure that avoids an exhaustive search across all possible cells in the input, thus maintaining scalability. Our model is tested on several homophilic and heterophilic graph datasets and it is shown to outperform other state-of-the-art techniques, offering significant improvements especially in cases where an input graph is not provided.
Koopman Kernel Regression
Authors: Petar Bevanda, Max Beier, Armin Lederer, Stefan Sosnowski, Eyke Hüllermeier, Sandra Hirche
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Machine Learning (stat.ML)
Abstract
Many machine learning approaches for decision making, such as reinforcement learning, rely on simulators or predictive models to forecast the time-evolution of quantities of interest, e.g., the state of an agent or the reward of a policy. Forecasts of such complex phenomena are commonly described by highly nonlinear dynamical systems, making their use in optimization-based decision-making challenging. Koopman operator theory offers a beneficial paradigm for addressing this problem by characterizing forecasts via linear dynamical systems. This makes system analysis and long-term predictions simple -- involving only matrix multiplications. However, the transformation to a linear system is generally non-trivial and unknown, requiring learning-based approaches. While there exists a variety of approaches, they usually lack crucial learning-theoretic guarantees, such that the behavior of the obtained models with increasing data and dimensionality is often unclear. We address the aforementioned by deriving a novel reproducing kernel Hilbert space (RKHS) that solely spans transformations into linear dynamical systems. The resulting Koopman Kernel Regression (KKR) framework enables the use of statistical learning tools from function approximation for novel convergence results and generalization risk bounds under weaker assumptions than existing work. Our numerical experiments indicate advantages over state-of-the-art statistical learning approaches for Koopman-based predictors.
Abstract
This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with pre-existing offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available at https://github.com/bkkgbkjb/OPPO .
Gaussian Processes with State-Dependent Noise for Stochastic Control
Authors: Marcel Menner, Karl Berntorp
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Abstract
This paper considers a stochastic control framework, in which the residual model uncertainty of the dynamical system is learned using a Gaussian Process (GP). In the proposed formulation, the residual model uncertainty consists of a nonlinear function and state-dependent noise. The proposed formulation uses a posterior-GP to approximate the residual model uncertainty and a prior-GP to account for state-dependent noise. The two GPs are interdependent and are thus learned jointly using an iterative algorithm. Theoretical properties of the iterative algorithm are established. Advantages of the proposed state-dependent formulation include (i) faster convergence of the GP estimate to the unknown function as the GP learns which data samples are more trustworthy and (ii) an accurate estimate of state-dependent noise, which can, e.g., be useful for a controller or decision-maker to determine the uncertainty of an action. Simulation studies highlight these two advantages.
Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Abstract
Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.
Understanding Idea Creation in Collaborative Discourse through Networks: The Joint Attention-Interaction-Creation (AIC) Framework
Abstract
In Computer-Supported Collaborative Learning, ideas generated through collaborative discourse are informative indicators of students' learning and collaboration. Idea creation is a product of emergent and interactive socio-cognitive endeavors. Therefore, analyzing ideas requires capturing contextual information in addition to the ideas themselves. In this paper, we propose the Joint Attention-Interaction-Creation (AIC) framework, which captures important dynamics in collaborative discourse, from attention and interaction to creation. The framework was developed from the networked lens, informed by natural language processing techniques, and inspired by socio-semantic network analysis. A case study was included to exemplify the framework's application in classrooms and to illustrate its potential in broader contexts.
Keyword: adaptive
Adaptive Data Analysis in a Balanced Adversarial Model
Authors: Kobbi Nissim, Uri Stemmer, Eliad Tsfadia
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Abstract
In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $D$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $D$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $D$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $D$. We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but does not have a prior knowledge of the underlying distribution. We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.
Adaptive observer of state variables of a nonlinear time varying system with unknown constant parameters
Authors: Olga Kozachek, Alexey Bobtsov, Nikolay Nikolaev
Abstract
The paper proposes an adaptive observer of the state vector of a nonlinear time varying system based on measurements of the output variable. The problem is solved under the assumption that the control matrix (vector) and the nonlinear component of the equation of state of the system contain unknown constant parameters. When developing an adaptive observer, the GPEBO (generalized parameter estimation based observer) method was used, also known as a generalized observer based on parameter estimation, which was proposed in [1]. During the synthesis of the observer, a preliminary parametrization of the original nonlinear system is carried out. Then the resulting system is reduced to a linear regression model. At the next stage, unknown constant regression parameters are estimated using the least squares method with the forgetting factor [2, 3]. The article suggests the development of the result proposed by the authors in [4]. In [4], a linear non-stationary system containing unknown parameters in a control matrix (vector) was considered. This result is an extension of the result obtained in [4] for the case when the equation of state of the system contains a partially unknown nonlinearity.
Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time
Authors: Xiang Ji, Gen Li
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Abstract
A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
Meta Adaptive Task Sampling for Few-Domain Generalization
Authors: Zheyan Shen, Han Yu, Peng Cui, Jiashuo Liu, Xingxuan Zhang, Linjun Zhou, Furui Liu
Abstract
To ensure the out-of-distribution (OOD) generalization performance, traditional domain generalization (DG) methods resort to training on data from multiple sources with different underlying distributions. And the success of those DG methods largely depends on the fact that there are diverse training distributions. However, it usually needs great efforts to obtain enough heterogeneous data due to the high expenses, privacy issues or the scarcity of data. Thus an interesting yet seldom investigated problem arises: how to improve the OOD generalization performance when the perceived heterogeneity is limited. In this paper, we instantiate a new framework called few-domain generalization (FDG), which aims to learn a generalizable model from very few domains of novel tasks with the knowledge acquired from previous learning experiences on base tasks. Moreover, we propose a Meta Adaptive Task Sampling (MATS) procedure to differentiate base tasks according to their semantic and domain-shift similarity to the novel task. Empirically, we show that the newly introduced FDG framework can substantially improve the OOD generalization performance on the novel task and further combining MATS with episodic training could outperform several state-of-the-art DG baselines on widely used benchmarks like PACS and DomainNet.
CUEING: A pioneer work of encoding human gaze for autonomous driving
Authors: Linfeng Liang, Yiran Wang, Yao Deng, Jianchao Lu, Chen Wang, Xi Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent analysis of incidents involving Autonomous Driving Systems (ADS) has shown that the decision-making process of ADS can be significantly different from that of human drivers. To improve the performance of ADS, it may be helpful to incorporate the human decision-making process, particularly the signals provided by the human gaze. There are many existing works to create human gaze datasets and predict the human gaze using deep learning models. However, current datasets of human gaze are noisy and include irrelevant objects that can hinder model training. Additionally, existing CNN-based models for predicting human gaze lack generalizability across different datasets and driving conditions, and many models have a centre bias in their prediction such that the gaze tends to be generated in the centre of the gaze map. To address these gaps, we propose an adaptive method for cleansing existing human gaze datasets and a robust convolutional self-attention gaze prediction model. Our quantitative metrics show that our cleansing method improves models' performance by up to 7.38% and generalizability by up to 8.24% compared to those trained on the original datasets. Furthermore, our model demonstrates an improvement of up to 12.13% in terms of generalizability compared to the state-of-the-art (SOTA) models. Notably, it achieves these gains while conserving up to 98.12% of computation resources.
Accelerated K-Serial Stable Coalition for Dynamic Capture and Resource Defense
Abstract
Coalition is an important mean of multi-robot systems to collaborate on common tasks. An effective and adaptive coalition strategy is essential for the online performance in dynamic and unknown environments. In this work, the problem of territory defense by large-scale heterogeneous robotic teams is considered. The tasks include surveillance, capture of dynamic targets, and perimeter defense over valuable resources. Since each robot can choose among many tasks, it remains a challenging problem to coordinate jointly these robots such that the overall utility is maximized. This work proposes a generic coalition strategy called K-serial stable coalition algorithm (KS-COAL). Different from centralized approaches, it is distributed and anytime, meaning that only local communication is required and a K-serial Nash-stable solution is ensured. Furthermore, to accelerate adaptation to dynamic targets and resource distribution that are only perceived online, a heterogeneous graph attention network (HGAN)-based heuristic is learned to select more appropriate parameters and promising initial solutions during local optimization. Compared with manual heuristics or end-to-end predictors, it is shown to both improve online adaptability and retain the quality guarantee. The proposed methods are validated rigorously via large-scale simulations with hundreds of robots, against several strong baselines including GreedyNE and FastMaxSum.
Healing Unsafe Dialogue Responses with Weak Supervision Signals
Authors: Zi Liang, Pinghui Wang, Ruofei Zhang, Shuo Zhang, Xiaofan Ye Yi Huang, Junlan Feng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
Dynamic Enhancement Network for Partial Multi-modality Person Re-identification
Authors: Aihua Zheng, Ziling He, Zi Wang, Chenglong Li, Jin Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Many existing multi-modality studies are based on the assumption of modality integrity. However, the problem of missing arbitrary modalities is very common in real life, and this problem is less studied, but actually important in the task of multi-modality person re-identification (Re-ID). To this end, we design a novel dynamic enhancement network (DENet), which allows missing arbitrary modalities while maintaining the representation ability of multiple modalities, for partial multi-modality person Re-ID. To be specific, the multi-modal representation of the RGB, near-infrared (NIR) and thermal-infrared (TIR) images is learned by three branches, in which the information of missing modalities is recovered by the feature transformation module. Since the missing state might be changeable, we design a dynamic enhancement module, which dynamically enhances modality features according to the missing state in an adaptive manner, to improve the multi-modality representation. Extensive experiments on multi-modality person Re-ID dataset RGBNT201 and vehicle Re-ID dataset RGBNT100 comparing to the state-of-the-art methods verify the effectiveness of our method in complex and changeable environments.
Multi-query Vehicle Re-identification: Viewpoint-conditioned Network, Unified Dataset and New Metric
Abstract
Existing vehicle re-identification methods mainly rely on the single query, which has limited information for vehicle representation and thus significantly hinders the performance of vehicle Re-ID in complicated surveillance networks. In this paper, we propose a more realistic and easily accessible task, called multi-query vehicle Re-ID, which leverages multiple queries to overcome viewpoint limitation of single one. Based on this task, we make three major contributions. First, we design a novel viewpoint-conditioned network (VCNet), which adaptively combines the complementary information from different vehicle viewpoints, for multi-query vehicle Re-ID. Moreover, to deal with the problem of missing vehicle viewpoints, we propose a cross-view feature recovery module which recovers the features of the missing viewpoints by learnt the correlation between the features of available and missing viewpoints. Second, we create a unified benchmark dataset, taken by 6142 cameras from a real-life transportation surveillance system, with comprehensive viewpoints and large number of crossed scenes of each vehicle for multi-query vehicle Re-ID evaluation. Finally, we design a new evaluation metric, called mean cross-scene precision (mCSP), which measures the ability of cross-scene recognition by suppressing the positive samples with similar viewpoints from same camera. Comprehensive experiments validate the superiority of the proposed method against other methods, as well as the effectiveness of the designed metric in the evaluation of multi-query vehicle Re-ID.
PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion
Abstract
The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely ``transport equation''. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (\textbf{PDE} with \textbf{A}daptive \textbf{D}istributional \textbf{D}iffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated in extensive settings, including clean samples and various corruptions, demonstrating its superior performance compared to SOTA methods.
Abstract
Continuous level Monte Carlo is an unbiased, continuous version of the celebrated multilevel Monte Carlo method. The approximation level is assumed to be continuous resulting in a stochastic process describing the quantity of interest. Continuous level Monte Carlo methods allow naturally for samplewise adaptive mesh refinements, which are indicated by goal-oriented error estimators. The samplewise refinement levels are drawn in the estimator from an exponentially-distributed random variable. Unfortunately in practical examples this results in higher costs due to high variance in the samples. In this paper we propose a variant of continuous level Monte Carlo, where a quasi Monte Carlo sequence is utilized to "sample" the exponential random variable. We provide a complexity theorem for this novel estimator and show that this results theoretically and practically in a variance reduction of the whole estimator.
Abstract
In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC
L1 Adaptive Resonance Ratio Control for Series Elastic Actuator with Guaranteed Transient Performance
Abstract
To eliminate the static error, overshoot, and vibration of the series elastic actuator (SEA) position control, the resonance ratio control (RRC) algorithm is improved based on L1 adaptive control(L1AC)method. Based on the analysis of the factors affecting the control performance of SEA, the algorithm schema is proposed, the stability is proved, and the main control parameters are analyzed. The algorithm schema is further improved with gravity compensation, and the predicted error and reference error is reduced to guarantee transient performance. Finally, the effectiveness of the algorithm is validated by simulation and platform experiments. The simulation and experiment results show that the algorithm has good adaptability, can improve transient control performance, and can handle effectively the static error, overshoot, and vibration. In addition, when a link-side collision occurs, the algorithm automatically reduces the link speed and limits the motor current, thus protecting the humans and SEA itself, due to the low pass filter characterization of L1AC to disturbance.
Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement
Abstract
Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{https://github.com/ut-vision/AdaptiveGaze}.
SocialLight: Distributed Cooperation Learning towards Network-Wide Traffic Signal Control
Abstract
Many recent works have turned to multi-agent reinforcement learning (MARL) for adaptive traffic signal control to optimize the travel time of vehicles over large urban networks. However, achieving effective and scalable cooperation among junctions (agents) remains an open challenge, as existing methods often rely on extensive, non-generalizable reward shaping or on non-scalable centralized learning. To address these problems, we propose a new MARL method for traffic signal control, SocialLight, which learns cooperative traffic control policies by distributedly estimating the individual marginal contribution of agents on their local neighborhood. SocialLight relies on the Asynchronous Actor Critic (A3C) framework, and makes learning scalable by learning a locally-centralized critic conditioned over the states and actions of neighboring agents, used by agents to estimate individual contributions by counterfactual reasoning. We further introduce important modifications to the advantage calculation that help stabilize policy updates. These modifications decouple the impact of the neighbors' actions on the computed advantages, thereby reducing the variance in the gradient updates. We benchmark our trained network against state-of-the-art traffic signal control methods on standard benchmarks in two traffic simulators, SUMO and CityFlow. Our results show that SocialLight exhibits improved scalability to larger road networks and better performance across usual traffic metrics.
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Authors: Chia-Wen Kuo, Zsolt Kira
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.
Keyword: quantization
On Architectural Compression of Text-to-Image Diffusion Models
Authors: Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi
Abstract
Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.
Abstract
In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC
Keyword: efficient
Deep Learning-enabled MCMC for Probabilistic State Estimation in District Heating Grids
Adaptive Data Analysis in a Balanced Adversarial Model
Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs
Foundational Models for Malware Embeddings Using Spatio-Temporal Parallel Convolutional Networks
On Semantically-Deterministic Automata
Improving selective classification performance of deep neural networks through post-hoc logit normalization and temperature scaling
Task-aware Distributed Source Coding under Dynamic Bandwidth
Post-processing Private Synthetic Data for Improving Utility on Selected Measures
Hybrid Eigensolvers for Nuclear Configuration Interaction Calculations
Deep Reinforcement Learning with Plasticity Injection
Non-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence
Lightweight Learner for Shared Knowledge Lifelong Learning
Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning
GFairHint: Improving Individual Fairness for Graph Neural Networks via Fairness Hint
Vehicle-in-Virtual-Environment (VVE)
How to escape sharp minima
Accelerated solutions of convection-dominated partial differential equations using implicit feature tracking and empirical quadrature
Mixture-of-Expert Conformer for Streaming Multilingual ASR
PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning
Asking Before Action: Gather Information in Embodied Decision Making with Language Models
Privacy Protectability: An Information-theoretical Approach
Rethink Diversity in Deep Learning Testing
Efficient Neural Music Generation
Enhancing the Ranking Context of Dense Retrieval Methods through Reciprocal Nearest Neighbors
A Tutorial on Holographic MIMO Communications--Part I: Channel Modeling and Channel Estimation
TransWorldNG: Traffic Simulation via Foundation Model
Robust Ante-hoc Graph Explainer using Bilevel Optimization
T2TD: Text-3D Generation Model based on Prior Knowledge Guidance
High-Similarity-Pass Attention for Single Image Super-Resolution
Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
AUC Optimization from Multiple Unlabeled Datasets
Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents
On Architectural Compression of Text-to-Image Diffusion Models
Lucy-SKG: Learning to Play Rocket League Efficiently Using Deep Reinforcement Learning
A Burton-Miller-type boundary element method based on a hybrid integral representation and its application to cavity scattering
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Flexible Spectrum Orchestration of Carrier Aggregation for 5G-Advanced
MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Sample and Predict Your Latent: Modality-free Sequential Disentanglement via Contrastive Estimation
Mask Attack Detection Using Vascular-weighted Motion-robust rPPG Signals
How to Turn Your Knowledge Graph Embeddings into Generative Models via Probabilistic Circuits
Online learning of long-range dependencies
Online and Streaming Algorithms for Constrained $k$-Submodular Maximization
GenerateCT: Text-Guided 3D Chest CT Generation
Local Randomized Neural Networks with Discontinuous Galerkin Methods for Diffusive-Viscous Wave Equation
Leveraging Human Feedback to Evolve and Discover Novel Emergent Behaviors in Robot Swarms
Understanding the Capabilities of Large Language Models for Automated Planning
A New Era of Mobility: Exploring Digital Twin Applications in Autonomous Vehicular Systems
On Computing Universal Plans for Partially Observable Multi-Agent Path Finding
C-MCTS: Safe Planning with Monte Carlo Tree Search
Persistent Laplacian-enhanced Algorithm for Scarcely Labeled Data Classification
Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
UDPM: Upsampling Diffusion Probabilistic Models
DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Fine-Grained Complexity Analysis of Multi-Agent Path Finding on 2D Grids
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Keyword: faster
Exploring Automatically Perturbed Natural Language Explanations in Relation Extraction
How to escape sharp minima
PRIMP: PRobabilistically-Informed Motion Primitives for Efficient Affordance Learning from Demonstration
Extracting Text Representations for Terms and Phrases in Technical Domains
Neural Characteristic Activation Value Analysis for Improved ReLU Network Feature Learning
A Fast Algorithm for Consistency Checking Partially Ordered Time
Improved Algorithms for Allen's Interval Algebra by Dynamic Programming with Sublinear Partitioning
Dynamic Inter-treatment Information Sharing for Heterogeneous Treatment Effects Estimation
NVTC: Nonlinear Vector Transform Coding
Gaussian Processes with State-Dependent Noise for Stochastic Control
Distributed TD(0) with Almost No Communication
Voyager: An Open-Ended Embodied Agent with Large Language Models
Keyword: mobile
Drivers of Mobile Payment Acceptance: The Impact of Network Externalities
Automatic off-line design of robot swarms: exploring the transferability of control software and design methods across different platforms
A New Era of Mobility: Exploring Digital Twin Applications in Autonomous Vehicular Systems
Keyword: pruning
On Semantically-Deterministic Automata
Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Keyword: diffusion
Non-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence
Differentially Private Synthetic Data via Foundation Model APIs 1: Images
Unsupervised Semantic Correspondence Using Stable Diffusion
Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps
Manifold Diffusion Fields
Reversible and irreversible bracket-based dynamics for deep graph neural networks
Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models
Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion
Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition
Knowledge Diffusion for Distillation
Efficient Neural Music Generation
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
On Architectural Compression of Text-to-Image Diffusion Models
PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion
Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7
Anomaly Detection with Conditioned Denoising Diffusion Models
DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score
GenerateCT: Text-Guided 3D Chest CT Generation
Local Randomized Neural Networks with Discontinuous Galerkin Methods for Diffusive-Viscous Wave Equation
CACTUS: A Computational Framework for Generating Realistic White Matter Microstructure Substrates
Unifying GANs and Score-Based Diffusion as Generative Particle Models
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models
ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
UDPM: Upsampling Diffusion Probabilistic Models
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Break-A-Scene: Extracting Multiple Concepts from a Single Image
NAP: Neural 3D Articulation Prior
Parallel Sampling of Diffusion Models
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Keyword: dynamic
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
Deep Reinforcement Learning with Plasticity Injection
Semi-global Exponential Stability for Dual Quaternion Based Rigid-Body Tracking Control
Automated Driving Architecture and Operation of a Light Commercial Vehicle
Learning Lagrangian Fluid Mechanics with E($3$)-Equivariant Graph Neural Networks
Reversible and irreversible bracket-based dynamics for deep graph neural networks
Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and Training with Non-Linear Diffusion
Accelerated solutions of convection-dominated partial differential equations using implicit feature tracking and empirical quadrature
FedHC: A Scalable Federated Learning Framework for Heterogeneous and Resource-Constrained Clients
Accelerated K-Serial Stable Coalition for Dynamic Capture and Resource Defense
Analysis and tuning of a three-term DMC
TransWorldNG: Traffic Simulation via Foundation Model
Dynamic Enhancement Network for Partial Multi-modality Person Re-identification
Residual Dynamics Learning for Trajectory Tracking for Multi-rotor Aerial Vehicles
Lucy-SKG: Learning to Play Rocket League Efficiently Using Deep Reinforcement Learning
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
A continuum and computational framework for viscoelastodynamics: II. Strain-driven and energy-momentum consistent schemes
Stochastic Modified Equations and Dynamics of Dropout Algorithm
Robust asymptotic observer of motion states with nonlinear friction
Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
Camera-Incremental Object Re-Identification with Identity Knowledge Evolution
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization
Improved Algorithms for Allen's Interval Algebra by Dynamic Programming with Sublinear Partitioning
Dynamic Inter-treatment Information Sharing for Heterogeneous Treatment Effects Estimation
Illustrative Motion Smoothing for Attention Guidance in Dynamic Visualizations
Exploiting Noise as a Resource for Computation and Learning in Spiking Neural Networks
Demystifying Oversmoothing in Attention-Based Graph Neural Networks
Learning Safety Constraints from Demonstrations with Unknown Rewards
From Latent Graph to Latent Topology Inference: Differentiable Cell Complex Module
Koopman Kernel Regression
Beyond Reward: Offline Preference-guided Policy Optimization
Gaussian Processes with State-Dependent Noise for Stochastic Control
Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Understanding Idea Creation in Collaborative Discourse through Networks: The Joint Attention-Interaction-Creation (AIC) Framework
Keyword: adaptive
Adaptive Data Analysis in a Balanced Adversarial Model
Adaptive observer of state variables of a nonlinear time varying system with unknown constant parameters
Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time
Meta Adaptive Task Sampling for Few-Domain Generalization
CUEING: A pioneer work of encoding human gaze for autonomous driving
Accelerated K-Serial Stable Coalition for Dynamic Capture and Resource Defense
Healing Unsafe Dialogue Responses with Weak Supervision Signals
Dynamic Enhancement Network for Partial Multi-modality Person Re-identification
Multi-query Vehicle Re-identification: Viewpoint-conditioned Network, Unified Dataset and New Metric
PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion
Quasi continuous level Monte Carlo
NVTC: Nonlinear Vector Transform Coding
L1 Adaptive Resonance Ratio Control for Series Elastic Actuator with Guaranteed Transient Performance
Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement
SocialLight: Distributed Cooperation Learning towards Network-Wide Traffic Signal Control
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Keyword: quantization
On Architectural Compression of Text-to-Image Diffusion Models
NVTC: Nonlinear Vector Transform Coding