New submissions for Mon, 22 May 23

Keyword: efficient

DClEVerNet: Deep Combinatorial Learning for Efficient EV Charging Scheduling in Large-scale Networked Facilities

Authors: Bushra Alshehhi, Areg Karapetyan, Khaled Elbassioni, Sid Chi-Kin Chau, Majid Khonji
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11195
Pdf link: https://arxiv.org/pdf/2305.11195
Abstract With the electrification of transportation, the rising uptake of electric vehicles (EVs) might stress distribution networks significantly, leaving their performance degraded and stability jeopardized. To accommodate these new loads cost-effectively, modern power grids require coordinated or ``smart'' charging strategies capable of optimizing EV charging scheduling in a scalable and efficient fashion. With this in view, the present work focuses on reservation management programs for large-scale, networked EV charging stations. We formulate a time-coupled binary optimization problem that maximizes EV users' total welfare gain while accounting for the network's available power capacity and stations' occupancy limits. To tackle the problem at scale while retaining high solution quality, a data-driven optimization framework combining techniques from the fields of Deep Learning and Approximation Algorithms is introduced. The framework's key ingredient is a novel input-output processing scheme for neural networks that allows direct extrapolation to problem sizes substantially larger than those included in the training set. Extensive numerical simulations based on synthetic and real-world data traces verify the effectiveness and superiority of the presented approach over two representative scheduling algorithms. Lastly, we round up the contributions by listing several immediate extensions to the proposed framework and outlining the prospects for further exploration.
PDP: Parameter-free Differentiable Pruning is All You Need
Authors: Minsik Cho, Saurabh Adya, Devang Naik
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11203
Pdf link: https://arxiv.org/pdf/2305.11203
Abstract DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.
Learning Agent Interactions from Density Evolution in 3D Regions With Obstacles
Authors: Amoolya Tirumalai, Christos N. Mavridis, John S. Baras
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11230
Pdf link: https://arxiv.org/pdf/2305.11230
Abstract In this work, we study the inverse problem of identifying complex flocking dynamics in a domain cluttered with obstacles. We get inspiration from animal flocks moving in complex ways with capabilities far beyond what current robots can do. Owing to the difficulty of observing and recovering the trajectories of the agents, we focus on the dynamics of their probability densities, which are governed by partial differential equations (PDEs), namely compressible Euler equations subject to non-local forces. We formulate the inverse problem of learning interactions as a PDE-constrained optimization problem of minimizing the squared Hellinger distance between the histogram of the flock and the distribution associated to our PDEs. The numerical methods used to efficiently solve the PDE-constrained optimization problem are described. Realistic flocking data are simulated using the Boids model of flocking agents, which differs in nature from the reconstruction models used in our PDEs. Our analysis and simulated experiments show that the behavior of cohesive flocks can be recovered accurately with approximate PDE solutions.
Efficient Vertical Federated Learning with Secure Aggregation
Authors: Xinchi Qiu, Heng Pan, Wanru Zhao, Chenyang Ma, Pedro Porto Buarque de Gusmão, Nicholas D. Lane
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2305.11236
Pdf link: https://arxiv.org/pdf/2305.11236
Abstract The majority of work in privacy-preserving federated learning (FL) has been focusing on horizontally partitioned datasets where clients share the same sets of features and can train complete models independently. However, in many interesting problems, such as financial fraud detection and disease detection, individual data points are scattered across different clients/organizations in vertical federated learning. Solutions for this type of FL require the exchange of gradients between participants and rarely consider privacy and security concerns, posing a potential risk of privacy leakage. In this work, we present a novel design for training vertical FL securely and efficiently using state-of-the-art security modules for secure aggregation. We demonstrate empirically that our method does not impact training performance whilst obtaining 9.1e2 ~3.8e4 speedup compared to homomorphic encryption (HE).
A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model
Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11244
Pdf link: https://arxiv.org/pdf/2305.11244
Abstract In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (input-prompting). We design a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI). This is challenging due to the high variation in vocabulary and pronunciation among the numerous regional dialects. We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning. We further reduce the training budgets with the PEL method, which performs within 1.86% accuracy to fine-tuning using only 2.5% of (extra) network trainable parameters. Our study demonstrates how to identify Arabic dialects using a small dataset and limited computation with open source code and pre-trained models.
Constrained Environment Optimization for Prioritized Multi-Agent Navigation
Authors: Zhan Gao, Amanda Prorok
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11260
Pdf link: https://arxiv.org/pdf/2305.11260
Abstract Traditional approaches to the design of multi-agent navigation algorithms consider the environment as a fixed constraint, despite the influence of spatial constraints on agents' performance. Yet hand-designing conducive environment layouts is inefficient and potentially expensive. The goal of this paper is to consider the environment as a decision variable in a system-level optimization problem, where both agent performance and environment cost are incorporated. Towards this end, we propose novel problems of unprioritized and prioritized environment optimization, where the former considers agents unbiasedly and the latter accounts for agent priorities. We show, through formal proofs, under which conditions the environment can change while guaranteeing completeness (i.e., all agents reach goals), and analyze the role of agent priorities in the environment optimization. We proceed to impose real-world constraints on the environment optimization and formulate it mathematically as a constrained stochastic optimization problem. Since the relation between agents, environment and performance is challenging to model, we leverage reinforcement learning to develop a model-free solution and a primal-dual mechanism to handle constraints. Distinct information processing architectures are integrated for various implementation scenarios, including online/offline optimization and discrete/continuous environment. Numerical results corroborate the theory and demonstrate the validity and adaptability of our approach.
On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation
Authors: Jiawei Huang, Batuhan Yardim, Niao He
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11283
Pdf link: https://arxiv.org/pdf/2305.11283
Abstract In this paper, we study the statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MBED), which subsumes a rich family of Mean-Field RL problems. Additionally, we propose algorithms based on Optimistic Maximal Likelihood Estimation, which can return an $\epsilon$-optimal policy for MFC or an $\epsilon$-Nash Equilibrium policy for MFG, with sample complexity polynomial w.r.t. relevant parameters and independent of the number of states, actions and the number of agents. Notably, our results only require a mild assumption of Lipschitz continuity on transition dynamics and avoid strong structural assumptions in previous work. Finally, in the tabular setting, given the access to a generative model, we establish an exponential lower bound for MFC setting, while providing a novel sample-efficient model elimination algorithm to approximate equilibrium in MFG setting. Our results reveal a fundamental separation between RL for single-agent, MFC, and MFG from the sample efficiency perspective.
BELLA: Black box model Explanations by Local Linear Approximations
Authors: Nedeljko Radulovic, Albert Bifet, Fabian Suchanek
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11311
Pdf link: https://arxiv.org/pdf/2305.11311
Abstract In recent years, understanding the decision-making process of black-box models has become not only a legal requirement but also an additional way to assess their performance. However, the state of the art post-hoc interpretation approaches rely on synthetic data generation. This introduces uncertainty and can hurt the reliability of the interpretations. Furthermore, they tend to produce explanations that apply to only very few data points. This makes the explanations brittle and limited in scope. Finally, they provide scores that have no direct verifiable meaning. In this paper, we present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models. BELLA provides explanations in the form of a linear model trained in the feature space. Thus, its coefficients can be used directly to compute the predicted value from the feature values. Furthermore, BELLA maximizes the size of the neighborhood to which the linear model applies, so that the explanations are accurate, simple, general, and robust. BELLA can produce both factual and counterfactual explanations. Our user study confirms the importance of the desiderata we optimize, and our experiments show that BELLA outperforms the state-of-the-art approaches on these desiderata.
Engineering an algorithm for constructing low-stretch geometric graphs with near-greedy average-degrees
Authors: FNU Shariful, Justin Weathers, Anirban Ghosh, Giri Narasimhan
Subjects: Computational Geometry (cs.CG)
Arxiv link: https://arxiv.org/abs/2305.11312
Pdf link: https://arxiv.org/pdf/2305.11312
Abstract We design and engineer Fast-Sparse-Spanner, a simple and practical (fast and memory-efficient) algorithm for constructing sparse low stretch-factor geometric graphs on large pointsets in the plane. To our knowledge, this is the first practical algorithm to construct fast low stretch-factor graphs on large pointsets with average-degrees (hence, the number of edges) competitive with that of greedy-spanners, the sparsest known class of Euclidean geometric spanners. To evaluate our implementation in terms of computation speed, memory usage, and quality of output, we performed extensive experiments with synthetic and real-world pointsets, and by comparing it to our closest competitor Bucketing, the fastest known greedy-spanner algorithm for pointsets in the plane, devised by Alewijnse et al. (Algorithmica, 2017). We always found that Fast-Sparse-Spanner generated near-greedy t-spanners while being fast and memory-efficient. Our experiment with constructing a 1.1-spanner on a large synthetic pointset with 128K points uniformly distributed within a square shows more than a 41-fold speedup with roughly a third of the memory usage of that of Bucketing, but with only a 3% increase in the average-degree of the resulting graph. In terms of diameter, the graphs generated by Fast-Sparse-Spanner beat greedy-spanners in most cases (have substantially lower diameter) while maintaining near-greedy average-degree. As a byproduct of our research, we design and engineer Fast-Stretch-Factor, a practical parallelizable algorithm that can measure the stretch-factor of any graph generated by Fast-Sparse-Spanner. Our experiments show that it is much faster than the naive Dijkstra-based stretch-factor measurement algorithm.
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
Authors: Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2305.11320
Pdf link: https://arxiv.org/pdf/2305.11320
Abstract This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study carries out PEL for TTS where an auxiliary unsupervised loss based on OT is introduced to maximize a difference between the pre-trained source domain and the (unseen) target domain, in addition to its supervised training loss. Further, we leverage upon this unsupervised loss refinement to boost system performance via either sliced Wasserstein distance or maximum mean discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions based on residual adapter learning, and model reprogramming when evaluating the Mandarin accent adaptation. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning, and the auxiliary unsupervised loss improves model performance empirically.
A Cyberattack Detection-Isolation Scheme For CAV Under Changing Driving Environment
Authors: Sanchita Ghosh, Nutan Saha, Tanushree Roy
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11328
Pdf link: https://arxiv.org/pdf/2305.11328
Abstract Under a changing driving environment, a Connected Autonomous Vehicle (CAV) platoon relies strongly on the acquisition of accurate traffic information from neighboring vehicles as well as reliable commands from a centralized supervisory controller through the communication network. Even though such modalities are imperative to ensure the safe and efficient driving performance of CAVs, they led to multiple security challenges. Thus, a cyberattack on this network can corrupt vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication, which can lead to unsafe or undesired driving scenarios. Hence, in this paper, we propose a unified V2V and V2I cyberattack detection scheme along with a V2I isolation scheme for CAVs under changing driving conditions. The proposed scheme is constructed using a bank of residual generators with Lyapunov function-based performance guarantees, such as disturbance-to-state stability, robustness, and sensitivity. Finally, we showcase the efficacy of our proposed algorithm through two simulation studies, namely under highway driving and urban driving. The results show that the proposed scheme can enhance the cybersecurity of CAVs by detecting and isolating cyberattacks on CAV platoons.
Faster Parallel Exact Density Peaks Clustering
Authors: Yihao Huang, Shangdi Yu, Julian Shun
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2305.11335
Pdf link: https://arxiv.org/pdf/2305.11335
Abstract Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects together. Density-based clustering is a clustering approach that defines clusters as dense regions of points. It has the advantage of being able to detect clusters of arbitrary shapes, rendering it useful in many applications. In this paper, we propose fast parallel algorithms for Density Peaks Clustering (DPC), a popular version of density-based clustering. Existing exact DPC algorithms suffer from low parallelism both in theory and in practice, which limits their application to large-scale data sets. Our most performant algorithm, which is based on priority search kd-trees, achieves $O(\log n\log\log n)$ span (parallel time complexity) for a data set of $n$ points. Our algorithm is also work-efficient, achieving a work complexity matching the best existing sequential exact DPC algorithm. In addition, we present another DPC algorithm based on a Fenwick tree that makes fewer assumptions for its average-case complexity to hold. We provide optimized implementations of our algorithms and evaluate their performance via extensive experiments. On a 30-core machine with two-way hyperthreading, we find that our best algorithm achieves a 10.8--13169x speedup over the previous best parallel exact DPC algorithm. Compared to the state-of-the-art parallel approximate DPC algorithm, our best algorithm achieves a 1.5--4206x speedup, while being exact.
Data Redaction from Conditional Generative Models
Authors: Zhifeng Kong, Kamalika Chaudhuri
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11351
Pdf link: https://arxiv.org/pdf/2305.11351
Abstract Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.
Differentially Private Adapters for Parameter Efficient Acoustic Modeling
Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11360
Pdf link: https://arxiv.org/pdf/2305.11360
Abstract In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-trained acoustic model and attain superior performance than DP-based stochastic gradient descent (DPSGD). Next, we insert residual adapters (RA) between layers of the frozen pre-trained acoustic model. The RAs reduce training cost and time significantly with a negligible performance drop. Evaluated on the open-access Multilingual Spoken Words (MLSW) dataset, our solution reduces the number of trainable parameters by 97.5% using the RAs with only a 4% performance drop with respect to fine-tuning the cross-lingual speech classifier while preserving DP guarantees.
ALT: An Automatic System for Long Tail Scenario Modeling
Authors: Ya-Lin Zhang, Jun Zhou, Yankun Ren, Yue Zhang, Xinxing Yang, Meng Li, Qitao Shi, Longfei Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11390
Pdf link: https://arxiv.org/pdf/2305.11390
Abstract In this paper, we consider the problem of long tail scenario modeling with budget limitation, i.e., insufficient human resources for model training stage and limited time and computing resources for model inference stage. This problem is widely encountered in various applications, yet has received deficient attention so far. We present an automatic system named ALT to deal with this problem. Several efforts are taken to improve the algorithms used in our system, such as employing various automatic machine learning related techniques, adopting the meta learning philosophy, and proposing an essential budget-limited neural architecture search method, etc. Moreover, to build the system, many optimizations are performed from a systematic perspective, and essential modules are armed, making the system more feasible and efficient. We perform abundant experiments to validate the effectiveness of our system and demonstrate the usefulness of the critical modules in our system. Moreover, online results are provided, which fully verified the efficacy of our system.
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
Authors: Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11392
Pdf link: https://arxiv.org/pdf/2305.11392
Abstract Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e.g. word, line and paragraph, on different granularity levels and seem hard to achieve a good trade-off between efficiency and performance. To tackle the concerns, we propose Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture, for visual document understanding. Specifically, we design a modality-guided dynamic token merging block to make the model learn multi-granularity representation and prunes redundant tokens. Additionally, we present a multi-modal interaction module called Symmetry Cross Attention (SCA) to consider multi-modal fusion and efficiently guide the token mergence. The SCA allows one modality input as query to calculate cross attention with another modality in a dual phase. Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
Efficient Mixed Transformer for Single Image Super-Resolution
Authors: Ling Zheng, Jinchen Zhu, Jinpeng Shi, Shizhuang Weng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11403
Pdf link: https://arxiv.org/pdf/2305.11403
Abstract Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance. The Code is available at https://github. com/Fried-Rice-Lab/EMT.git.
LATTE: Label-efficient Incident Phenotyping from Longitudinal Electronic Health Records
Authors: Jun Wen, Jue Hou, Clara-Lea Bonzel, Yihan Zhao, Victor M. Castro, Vivian S. Gainer, Dana Weisenfeld, Tianrun Cai, Yuk-Lam Ho, Vidul A. Panickan, Lauren Costa, Chuan Hong, J. Michael Gaziano, Katherine P. Liao, Junwei Lu, Kelly Cho, Tianxi Cai
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11407
Pdf link: https://arxiv.org/pdf/2305.11407
Abstract Electronic health record (EHR) data are increasingly used to support real-world evidence (RWE) studies. Yet its ability to generate reliable RWE is limited by the lack of readily available precise information on the timing of clinical events such as the onset time of heart failure. We propose a LAbel-efficienT incidenT phEnotyping (LATTE) algorithm to accurately annotate the timing of clinical events from longitudinal EHR data. By leveraging the pre-trained semantic embedding vectors from large-scale EHR data as prior knowledge, LATTE selects predictive EHR features in a concept re-weighting module by mining their relationship to the target event and compresses their information into longitudinal visit embeddings through a visit attention learning network. LATTE employs a recurrent neural network to capture the sequential dependency between the target event and visit embeddings before/after it. To improve label efficiency, LATTE constructs highly informative longitudinal silver-standard labels from large-scale unlabeled patients to perform unsupervised pre-training and semi-supervised joint training. Finally, LATTE enhances cross-site portability via contrastive representation learning. LATTE is evaluated on three analyses: the onset of type-2 diabetes, heart failure, and the onset and relapses of multiple sclerosis. We use various evaluation metrics present in the literature including the $ABC_{gain}$, the proportion of reduction in the area between the observed event indicator and the predicted cumulative incidences in reference to the prediction per incident prevalence. LATTE consistently achieves substantial improvement over benchmark methods such as SAMGEP and RETAIN in all settings.
JetSeg: Efficient Real-Time Semantic Segmentation Model for Low-Power GPU-Embedded Systems
Authors: Miguel Lopez-Montiel, Daniel Alejandro Lopez, Oscar Montiel
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11419
Pdf link: https://arxiv.org/pdf/2305.11419
Abstract Real-time semantic segmentation is a challenging task that requires high-accuracy models with low-inference times. Implementing these models on embedded systems is limited by hardware capability and memory usage, which produces bottlenecks. We propose an efficient model for real-time semantic segmentation called JetSeg, consisting of an encoder called JetNet, and an improved RegSeg decoder. The JetNet is designed for GPU-Embedded Systems and includes two main components: a new light-weight efficient block called JetBlock, that reduces the number of parameters minimizing memory usage and inference time without sacrificing accuracy; a new strategy that involves the combination of asymmetric and non-asymmetric convolutions with depthwise-dilated convolutions called JetConv, a channel shuffle operation, light-weight activation functions, and a convenient number of group convolutions for embedded systems, and an innovative loss function named JetLoss, which integrates the Precision, Recall, and IoUB losses to improve semantic segmentation and reduce computational complexity. Experiments demonstrate that JetSeg is much faster on workstation devices and more suitable for Low-Power GPU-Embedded Systems than existing state-of-the-art models for real-time semantic segmentation. Our approach outperforms state-of-the-art real-time encoder-decoder models by reducing 46.70M parameters and 5.14% GFLOPs, which makes JetSeg up to 2x faster on the NVIDIA Titan RTX GPU and the Jetson Xavier than other models. The JetSeg code is available at https://github.com/mmontielpz/jetseg.
PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction
Authors: Hao Wu, Wei Xion, Fan Xu, Xiao Luo, Chong Chen, Xian-Sheng Hua, Haixin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11421
Pdf link: https://arxiv.org/pdf/2305.11421
Abstract In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video predictions. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with state-of-the-art methods, particularly in high-resolution scenarios.
Strix: An End-to-End Streaming Architecture with Two-Level Ciphertext Batching for Fully Homomorphic Encryption with Programmable Bootstrapping
Authors: Adiwena Putra, Prasetiyo, Yi Chen, John Kim, Joo-Young Kim
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2305.11423
Pdf link: https://arxiv.org/pdf/2305.11423
Abstract Homomorphic encryption (HE) enables computations on encrypted data by concealing information under noise for security. However, the process of bootstrapping, which resets the noise level in the ciphertext, is computationally expensive and requires a large bootstrapping key. The TFHE scheme offers a faster and programmable bootstrapping algorithm called PBS, crucial for security-focused applications like machine learning. Nevertheless, the current TFHE scheme lacks support for ciphertext packing, resulting in low throughput. This work thoroughly analyzes TFHE bootstrapping, identifies the bottleneck in GPUs caused by the blind rotation fragmentation problem, and proposes a hardware TFHE accelerator called Strix. Strix introduces a two-level batching approach to enhance the batch size in PBS, utilizes a specialized microarchitecture for efficient streaming data processing, and incorporates a fully-pipelined FFT microarchitecture to improve performance. It achieves significantly higher throughput than state-of-the-art implementations on both CPUs and GPUs, outperforming existing TFHE accelerators by a factor of 7.4.
Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring
Authors: Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li, Zejun Ma
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11438
Pdf link: https://arxiv.org/pdf/2305.11438
Abstract Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring. Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts. We then fine-tune the pre-trained model using human-annotated scoring data. Our experimental results, conducted on datasets such as Speechocean762 and our non-native datasets, show that our proposed method outperforms the baseline systems in terms of Pearson correlation coefficients (PCC). Moreover, we also conduct an ablation study to better understand the contribution of phonetic and prosody factors during the pre-training stage.
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment
Authors: Runqi Wang, Hao Zheng, Xiaoyue Duan, Jianzhuang Liu, Yuning Lu, Tian Wang, Songcen Xu, Baochang Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11439
Pdf link: https://arxiv.org/pdf/2305.11439
Abstract Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning. The implementation code will be available at https://github.com/bhrqw/SADA.
Coordinated Frequency-Constrained Stochastic Economic Dispatch for Integrated Transmission and Distribution System via Distributed Optimization
Authors: Ye Tian, Zhengshuo Li
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11440
Pdf link: https://arxiv.org/pdf/2305.11440
Abstract When large-scale uncertain centralized and distributed renewable energy sources are connected to a power system, separate dispatching of the transmission power system (TPS) and the active distribution network (ADN) will lower the network security and frequency security of the system. To address these problems, this paper proposes a coordinated frequency-constrained stochastic economic dispatch (CFC-SED) model for an integrated transmission and distribution (ITD) system. In this model, the dynamic frequency security constraints and network security constraints of the ITD system are constructed, and the joint chance constraints are adopted to handle the uncertainty. Then, the control parameters of inverter-based resources, the base point power, and the regulation reserve of all dispatchable resources in the ITD system are jointly optimized for the minimum operating cost. TPS and ADNs can deliver base point power bidirectionally and provide frequency regulation support bidirectionally, which extend the existing reserve assumption in ITD dispatch and enhance the operational security of the ITD system. Moreover, based on the alternating direction of multiplier algorithm, a two-layer distributed optimization framework is proposed to solve the CFC-SED model. Case studies show that the CFC-SED model can fully utilize the potential of multiple regulation resources to improve the security performance of the ITD system, and TPS and ADNs can be coordinated efficiently through the proposed distributed optimization framework.
Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models
Authors: Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11455
Pdf link: https://arxiv.org/pdf/2305.11455
Abstract A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.
Evolutionary Diversity Optimisation in Constructing Satisfying Assignments
Authors: Adel Nikfarjam, Ralf Rothenberger, Frank Neumann, Tobias Friedrich
Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2305.11457
Pdf link: https://arxiv.org/pdf/2305.11457
Abstract Computing diverse solutions for a given problem, in particular evolutionary diversity optimisation (EDO), is a hot research topic in the evolutionary computation community. This paper studies the Boolean satisfiability problem (SAT) in the context of EDO. SAT is of great importance in computer science and differs from the other problems studied in EDO literature, such as KP and TSP. SAT is heavily constrained, and the conventional evolutionary operators are inefficient in generating SAT solutions. Our approach avails of the following characteristics of SAT: 1) the possibility of adding more constraints (clauses) to the problem to forbid solutions or to fix variables, and 2) powerful solvers in the literature, such as minisat. We utilise such a solver to construct a diverse set of solutions. Moreover, maximising diversity provides us with invaluable information about the solution space of a given SAT problem, such as how large the feasible region is. In this study, we introduce evolutionary algorithms (EAs) employing a well-known SAT solver to maximise diversity among a set of SAT solutions explicitly. The experimental investigations indicate the introduced algorithms' capability to maximise diversity among the SAT solutions.
Generative Sliced MMD Flows with Riesz Kernels
Authors: Johannes Hertrich, Christian Wald, Fabian Altekrüger, Paul Hagemann
Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11463
Pdf link: https://arxiv.org/pdf/2305.11463
Abstract Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - |x-y|^r$, $r \in (0,2)$ have exceptional properties which allow for their efficient computation. First, the MMD of Riesz kernels coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two empirical measures with $M$ and $N$ support points. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for large scale applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.
Counterfactual Fairness Filter for Fair-Delay Multi-Robot Navigation
Authors: Hikaru Asano, Ryo Yonetani, Mai Nishimura, Tadashi Kozuno
Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11465
Pdf link: https://arxiv.org/pdf/2305.11465
Abstract Multi-robot navigation is the task of finding trajectories for a team of robotic agents to reach their destinations as quickly as possible without collisions. In this work, we introduce a new problem: fair-delay multi-robot navigation, which aims not only to enable such efficient, safe travels but also to equalize the travel delays among agents in terms of actual trajectories as compared to the best possible trajectories. The learning of a navigation policy to achieve this objective requires resolving a nontrivial credit assignment problem with robotic agents having continuous action spaces. Hence, we developed a new algorithm called Navigation with Counterfactual Fairness Filter (NCF2). With NCF2, each agent performs counterfactual inference on whether it can advance toward its goal or should stay still to let other agents go. Doing so allows us to effectively address the aforementioned credit assignment problem and improve fairness regarding travel delays while maintaining high efficiency and safety. Our extensive experimental results in several challenging multi-robot navigation environments demonstrate the greater effectiveness of NCF2 as compared to state-of-the-art fairness-aware multi-agent reinforcement learning methods. Our demo videos and code are available on the project webpage: https://omron-sinicx.github.io/ncf2/
Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness
Authors: Yuxuan Zhou, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Yifeng Geng, Xuansong Xie, Margret Keuper
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11468
Pdf link: https://arxiv.org/pdf/2305.11468
Abstract Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in skeleton-based action recognition, leveraging their ability to unravel the complex dynamics of human joint topology through the graph's adjacency matrix. However, an inherent flaw has come to light in these cutting-edge models: they tend to optimize the adjacency matrix jointly with the model weights. This process, while seemingly efficient, causes a gradual decay of bone connectivity data, culminating in a model indifferent to the very topology it sought to map. As a remedy, we propose a threefold strategy: (1) We forge an innovative pathway that encodes bone connectivity by harnessing the power of graph distances. This approach preserves the vital topological nuances often lost in conventional GCNs. (2) We highlight an oft-overlooked feature - the temporal mean of a skeletal sequence, which, despite its modest guise, carries highly action-specific information. (3) Our investigation revealed strong variations in joint-to-joint relationships across different actions. This finding exposes the limitations of a single adjacency matrix in capturing the variations of relational configurations emblematic of human movement, which we remedy by proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC. This evolution slashes parameters by a substantial margin (above 40%), while elevating performance beyond original GCNs. Our full model, the BlockGCN, establishes new standards in skeleton-based action recognition for small model sizes. Its high accuracy, notably on the large-scale NTU RGB+D 120 dataset, stand as compelling proof of the efficacy of BlockGCN. The source code and model can be found at https://github.com/ZhouYuxuanYX/BlockGCN.
RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration
Authors: Haram Choi, Cheolwoong Na, Jihyeon Oh, Seungjae Lee, Jinseop Kim, Subeen Choe, Jeongmin Lee, Taehoon Kim, Jihoon Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11474
Pdf link: https://arxiv.org/pdf/2305.11474
Abstract Although many recent works have made advancements in the image restoration (IR) field, they often suffer from an excessive number of parameters. Another issue is that most Transformer-based IR methods focus only on either local or global features, leading to limited receptive fields or deficient parameter issues. To address these problems, we propose a lightweight IR network, Reciprocal Attention Mixing Transformer (RAMiT). It employs our proposed dimensional reciprocal attention mixing Transformer (D-RAMiT) blocks, which compute bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. The bi-dimensional attentions help each other to complement their counterpart's drawbacks and are then mixed. Additionally, we introduce a hierarchical reciprocal attention mixing (H-RAMi) layer that compensates for pixel-level information losses and utilizes semantic information while maintaining an efficient hierarchical structure. Furthermore, we revisit and modify MobileNet V1 and V2 to attach efficient convolutions to our proposed components. The experimental results demonstrate that RAMiT achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining. Codes will be available soon.
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation
Authors: Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, Jiangyun Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11481
Pdf link: https://arxiv.org/pdf/2305.11481
Abstract Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression. Due to the essentially distinct data properties between image and text, most of existing methods either introduce complex designs towards fine-grained vision-language alignment or lack required dense alignment, resulting in scalability issues or mis-segmentation problems such as over- or under-segmentation. To achieve effective and efficient fine-grained feature alignment in the RIS task, we explore the potential of masked multimodal modeling coupled with self-distillation and propose a novel cross-modality masked self-distillation framework named CM-MaskSD, in which our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment for better segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost model performance in a nearly parameter-free manner, since it shares weights between the main segmentation branch and the introduced masked self-distillation branches, and solely introduces negligible parameters for coordinating the multimodal features. Comprehensive experiments on three benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task convincingly demonstrate the superiority of our proposed framework over previous state-of-the-art methods.
Nonconvex Robust High-Order Tensor Completion Using Randomized Low-Rank Approximation
Authors: Wenjin Qin, Hailin Wang, Feng Zhang, Weijun Ma, Jianjun Wang, Tingwen Huang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11495
Pdf link: https://arxiv.org/pdf/2305.11495
Abstract Within the tensor singular value decomposition (T-SVD) framework, existing robust low-rank tensor completion approaches have made great achievements in various areas of science and engineering. Nevertheless, these methods involve the T-SVD based low-rank approximation, which suffers from high computational costs when dealing with large-scale tensor data. Moreover, most of them are only applicable to third-order tensors. Against these issues, in this article, two efficient low-rank tensor approximation approaches fusing randomized techniques are first devised under the order-d (d >= 3) T-SVD framework. On this basis, we then further investigate the robust high-order tensor completion (RHTC) problem, in which a double nonconvex model along with its corresponding fast optimization algorithms with convergence guarantees are developed. To the best of our knowledge, this is the first study to incorporate the randomized low-rank approximation into the RHTC problem. Empirical studies on large-scale synthetic and real tensor data illustrate that the proposed method outperforms other state-of-the-art approaches in terms of both computational efficiency and estimated precision.
A new family of fourth-order energy-preserving integrators
Authors: Yuto Miyatake
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2305.11514
Pdf link: https://arxiv.org/pdf/2305.11514
Abstract For Hamiltonian systems with non-canonical structure matrices, a new family of fourth-order energy-preserving integrators is presented. The integrators take a form of a combination of Runge--Kutta methods and continuous-stage Runge--Kutta methods and feature a set of free parameters that offer greater flexibility and efficiency. Specifically, we demonstrate that by carefully choosing these free parameters a simplified Newton iteration applied to the integrators of order four can be parallelizable. This results in faster and more efficient integrators compared with existing fourth-order energy-preserving integrators.
DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment
Authors: Heyuan Li, Bo Wang, Yu Cheng, Mohan Kankanhalli, Robby T. Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11522
Pdf link: https://arxiv.org/pdf/2305.11522
Abstract Sensitivity to severe occlusion and large view angles limits the usage scenarios of the existing monocular 3D dense face alignment methods. The state-of-the-art 3DMM-based method, directly regresses the model's coefficients, underutilizing the low-level 2D spatial and semantic information, which can actually offer cues for face shape and orientation. In this work, we demonstrate how modeling 3D facial geometry in image and model space jointly can solve the occlusion and view angle problems. Instead of predicting the whole face directly, we regress image space features in the visible facial region by dense prediction first. Subsequently, we predict our model's coefficients based on the regressed feature of the visible regions, leveraging the prior knowledge of whole face geometry from the morphable models to complete the invisible regions. We further propose a fusion network that combines the advantages of both the image and model space predictions to achieve high robustness and accuracy in unconstrained scenarios. Thanks to the proposed fusion module, our method is robust not only to occlusion and large pitch and roll view angles, which is the benefit of our image space approach, but also to noise and large yaw angles, which is the benefit of our model space method. Comprehensive evaluations demonstrate the superior performance of our method compared with the state-of-the-art methods. On the 3D dense face alignment task, we achieve 3.80% NME on the AFLW2000-3D dataset, which outperforms the state-of-the-art method by 5.5%. Code is available at https://github.com/lhyfst/DSFNet.
Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots
Authors: Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu, Maosong Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11540
Pdf link: https://arxiv.org/pdf/2305.11540
Abstract Diffusion models have made impressive progress in text-to-image synthesis. However, training such large-scale models (e.g. Stable Diffusion), from scratch requires high computational costs and massive high-quality text-image pairs, which becomes unaffordable in other languages. To handle this challenge, we propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP optimizes only a separate Chinese text encoder with all other parameters fixed to align Chinese semantics space to the English one in CLIP. To achieve this, we innovatively treat images as pivots and minimize the distance of attentive features produced from cross-attention between images and each language respectively. In this way, IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently, advancing the quality of the generated image with direct Chinese prompts. Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%~10% training data.
A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations
Authors: Chenjie Cao, Qiaole Dong, Yikai Wang, Yunuo Cai, Yanwei Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11577
Pdf link: https://arxiv.org/pdf/2305.11577
Abstract Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.
Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape
Authors: Yan Sun, Li Shen, Shixiang Chen, Liang Ding, Dacheng Tao
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11584
Pdf link: https://arxiv.org/pdf/2305.11584
Abstract In federated learning (FL), a cluster of local clients are chaired under the coordination of the global server and cooperatively train one model with privacy protection. Due to the multiple local updates and the isolated non-iid dataset, clients are prone to overfit into their own optima, which extremely deviates from the global objective and significantly undermines the performance. Most previous works only focus on enhancing the consistency between the local and global objectives to alleviate this prejudicial client drifts from the perspective of the optimization view, whose performance would be prominently deteriorated on the high heterogeneity. In this work, we propose a novel and general algorithm {\ttfamily FedSMOO} by jointly considering the optimization and generalization targets to efficiently improve the performance in FL. Concretely, {\ttfamily FedSMOO} adopts a dynamic regularizer to guarantee the local optima towards the global objective, which is meanwhile revised by the global Sharpness Aware Minimization (SAM) optimizer to search for the consistent flat minima. Our theoretical analysis indicates that {\ttfamily FedSMOO} achieves fast $\mathcal{O}(1/T)$ convergence rate with low generalization bound. Extensive numerical studies are conducted on the real-world dataset to verify its peerless efficiency and excellent generality.
Flexible and Inherently Comprehensible Knowledge Representation for Data-Efficient Learning and Trustworthy Human-Machine Teaming in Manufacturing Environments
Authors: Vedran Galetić, Alistair Nottle
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11597
Pdf link: https://arxiv.org/pdf/2305.11597
Abstract Trustworthiness of artificially intelligent agents is vital for the acceptance of human-machine teaming in industrial manufacturing environments. Predictable behaviours and explainable (and understandable) rationale allow humans collaborating with (and building) these agents to understand their motivations and therefore validate decisions that are made. To that aim, we make use of G\"ardenfors's cognitively inspired Conceptual Space framework to represent the agent's knowledge using concepts as convex regions in a space spanned by inherently comprehensible quality dimensions. A simple typicality quantification model is built on top of it to determine fuzzy category membership and classify instances interpretably. We apply it on a use case from the manufacturing domain, using objects' physical properties obtained from cobots' onboard sensors and utilisation properties from crowdsourced commonsense knowledge available at public knowledge bases. Such flexible knowledge representation based on property decomposition allows for data-efficient representation learning of typically highly specialist or specific manufacturing artefacts. In such a setting, traditional data-driven (e.g., computer vision-based) classification approaches would struggle due to training data scarcity. This allows for comprehensibility of an AI agent's acquired knowledge by the human collaborator thus contributing to trustworthiness. We situate our approach within an existing explainability framework specifying explanation desiderata. We provide arguments for our system's applicability and appropriateness for different roles of human agents collaborating with the AI system throughout its design, validation, and operation.
Tune-Mode ConvBN Blocks For Efficient Transfer Learning
Authors: Kaichao You, Anchang Bao, Guo Qin, Meng Cao, Ping Huang, Jiulong Shan, Mingsheng Long
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11624
Pdf link: https://arxiv.org/pdf/2305.11624
Abstract Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learning and model validation, and the Deploy mode is designed for the deployment of models. This paper focuses on the trade-off between stability and efficiency in ConvBN blocks: Deploy mode is efficient but suffers from training instability; Eval mode is widely used in transfer learning but lacks efficiency. To solve the dilemma, we theoretically reveal the reason behind the diminished training stability observed in the Deploy mode. Subsequently, we propose a novel Tune mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune mode is as stable as Eval mode for transfer learning, and its computational efficiency closely matches that of the Deploy mode. Through extensive experiments in both object detection and classification tasks, carried out across various datasets and model architectures, we demonstrate that the proposed Tune mode does not hurt the original performance while significantly reducing GPU memory footprint and training time, thereby contributing an efficient solution to transfer learning with convolutional networks.
LLM-Pruner: On the Structural Pruning of Large Language Models
Authors: Xinyin Ma, Gongfan Fang, Xinchao Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11627
Pdf link: https://arxiv.org/pdf/2305.11627
Abstract Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code is available at: https://github.com/horseee/LLM-Pruner
Goal-Oriented Communications in Federated Learning via Feedback on Risk-Averse Participation
Authors: Shashi Raj Pandey, Van Phuc Bui, Petar Popovski
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11633
Pdf link: https://arxiv.org/pdf/2305.11633
Abstract We treat the problem of client selection in a Federated Learning (FL) setup, where the learning objective and the local incentives of the participants are used to formulate a goal-oriented communication problem. Specifically, we incorporate the risk-averse nature of participants and obtain a communication-efficient on-device performance, while relying on feedback from the Parameter Server (\texttt{PS}). A client has to decide its transmission plan on when not to participate in FL. This is based on its intrinsic incentive, which is the value of the trained global model upon participation by this client. Poor updates not only plunge the performance of the global model with added communication cost but also propagate the loss in performance on other participating devices. We cast the relevance of local updates as \emph{semantic information} for developing local transmission strategies, i.e., making a decision on when to ``not transmit". The devices use feedback about the state of the PS and evaluate their contributions in training the learning model in each aggregation period, which eventually lowers the number of occupied connections. Simulation results validate the efficacy of our proposed approach, with up to $1.4\times$ gain in communication links utilization as compared with the baselines.
Distributed MIS with Low Energy and Time Complexities
Authors: Mohsen Ghaffari, Julian Portmann
Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2305.11639
Pdf link: https://arxiv.org/pdf/2305.11639
Abstract We present randomized distributed algorithms for the maximal independent set problem (MIS) that, while keeping the time complexity nearly matching the best known, reduce the energy complexity substantially. These algorithms work in the standard CONGEST model of distributed message passing with $O(\log n)$ bit messages. The time complexity measures the number of rounds in the algorithm. The energy complexity measures the number of rounds each node is awake; during other rounds, the node sleeps and cannot perform any computation or communications. Our first algorithm has an energy complexity of $O(\log\log n)$ and a time complexity of $O(\log^2 n)$. Our second algorithm is faster but slightly less energy-efficient: it achieves an energy complexity of $O(\log^2 \log n)$ and a time complexity of $O(\log n \cdot \log\log n \cdot \log^* n)$. Thus, this algorithm nearly matches the $O(\log n)$ time complexity of the state-of-the-art MIS algorithms while significantly reducing their energy complexity from $O(\log n)$ to $O(\log^2 \log n)$.
Distribution-Free Matrix Prediction Under Arbitrary Missing Pattern
Authors: Meijia Shao, Yuan Zhang
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11640
Pdf link: https://arxiv.org/pdf/2305.11640
Abstract This paper studies the open problem of conformalized entry prediction in a row/column-exchangeable matrix. The matrix setting presents novel and unique challenges, but there exists little work on this interesting topic. We meticulously define the problem, differentiate it from closely related problems, and rigorously delineate the boundary between achievable and impossible goals. We then propose two practical algorithms. The first method provides a fast emulation of the full conformal prediction, while the second method leverages the technique of algorithmic stability for acceleration. Both methods are computationally efficient and can effectively safeguard coverage validity in presence of arbitrary missing pattern. Further, we quantify the impact of missingness on prediction accuracy and establish fundamental limit results. Empirical evidence from synthetic and real-world data sets corroborates the superior performance of our proposed methods.
The fast reduced QMC matrix-vector product
Authors: Josef Dick, Adrian Ebert, Lukas Herrmann, Peter Kritzer, Marcello Longo
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2305.11645
Pdf link: https://arxiv.org/pdf/2305.11645
Abstract We study the approximation of integrals $\intD f(\boldsymbol{x}^\top A) \mathrm{d} \mu(\boldsymbol{x})$, where $A$ is a matrix, by quasi-Monte Carlo (QMC) rules $N^{-1} \sum{k=0}^{N-1} f(\boldsymbol{x}_k^\top A)$. We are interested in cases where the main cost arises from calculating the products $\boldsymbol{x}_k^\top A$. We design QMC rules for which the computation of $\boldsymbol{x}_k^\top A$, $k = 0, 1, \ldots, N-1$, can be done fast, and for which the error of the QMC rule is similar to the standard QMC error. We do not require that $A$ has any particular structure. For instance, this approach can be used when approximating the expected value of a function with a multivariate normal random variable with a given covariance matrix, or when approximating the expected value of the solution of a PDE with random coefficients. The speed-up of the computation time is sometimes better and sometimes worse than the fast QMC matrix-vector product from [Dick, Kuo, Le Gia, and Schwab, Fast QMC Matrix-Vector Multiplication, SIAM J. Sci. Comput. 37 (2015)]. As in that paper, our approach applies to (polynomial) lattice point sets, but also to digital nets (we are currently not aware of any approach which allows one to apply the fast method from the aforementioned paper of Dick, Kuo, Le Gia, and Schwab to digital nets). Our method does not use FFT, instead we use repeated values in the quadrature points to derive a reduction in the computation time. This arises from the reduced CBC construction of lattice rules and polynomial lattice rules. The reduced CBC construction has been shown to reduce the computation time for the CBC construction. Here we show that it can also be used to also reduce the computation time of the QMC rule.
Enhancing data security against cyberattacks in artificial intelligence based smartgrid systems with crypto agility
Authors: Marcelo Simoes, Mohammed Elmusrati, Tero Vartiainen, Mike Mekkanen, Mazaher Karimi, Sayawu Diaba, Emmanuel Anti, Wilson Lopes
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11652
Pdf link: https://arxiv.org/pdf/2305.11652
Abstract A new paradigm of electricity generation at the distribution level, with renewable and alternative sources, is possible with microgrids. The main idea is to have microgrids deployed on low- or medium-voltage active distribution networks. They can be advantageous in many different ways, such as improving the energy efficiency and reliability of the system and reducing transmission losses and network congestion. There are challenges in implementing MGs with DER units, those are related to power quality and stability issues voltage and fault level changes, energy management, low inertia, further complex protection schemes, load and generation forecasting, cyber-attacks, and cyber security. This paper shows the deep utilization of advanced, accurate, and fast methodologies such as artificial intelligence-based techniques. They guarantee efficient, optimal, safe, and reliable operation of smart grids safe against cyberattacks. AI refers to the computer-based system's ability to perform tasks with intelligence typically associated with human decision-making, they can learn from past experiences and solve problems.
Region of Attraction Estimation Using Union Theorem in Sum-of-Squares Optimization
Authors: Bhaskar Biswas, Dmitry Ignatyev, Argyrios Zolotas, Antonios Tsourdos
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11655
Pdf link: https://arxiv.org/pdf/2305.11655
Abstract Appropriate estimation of Region of Attraction for a nonlinear dynamical system plays a key role in system analysis and control design. Sum-of-Squares optimization is a powerful tool enabling Region of Attraction estimation for polynomial dynamical systems. Employment of a positive definite function called shape function within the Sum-of-Squares procedure helps to find a richer representation of the Lyapunov function and a larger corresponding Region of Attraction estimation. However, existing Sum-of-Squares optimization techniques demonstrate very conservative results. The main novelty of this paper is the Union theorem which enables the use of multiple shape functions to create a polynomial Lyapunov function encompassing all the areas generated by the shape functions. The main contribution of this paper is a novel computationally-efficient numerical method for Region of Attraction estimation, which remarkably improves estimation performance and overcomes limitations of existing methods, while maintaining the resultant Lyapunov function polynomial, thus facilitating control system design and construction of control Lyapunov function with enhanced Region of Attraction using conventional Sum-of-Squares tools. A mathematical proof of the Union theorem along with its application to the numerical algorithm of Region of Attraction estimation is provided. The method yields significantly enlarged Region of Attraction estimations even for systems with non-symmetric or unbounded Region of Attraction, which is demonstrated via simulations of several benchmark examples.
Some results on the antiprimitive BCH codes
Authors: Yanhui Zhang
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2305.11677
Pdf link: https://arxiv.org/pdf/2305.11677
Abstract BCH codes are an important class of cyclic codes due to their efficient encoding and decoding algorithms, antiprimitive BCH codes have taken a lot of attention in recent years. In this paper, we mainly study a class of BCH codes of length $n=\frac{q^{m}+1}{\lambda}$, where $\lambda\mid (q+1)$ is an integer. We give several classes of BCH codes with good parameters in this paper, containing many optimal linear codes. We also present the first few largest coset leaders modulo $n$, so two conjectures about BCH codes are partially solved.
Probabilistic Lexicase Selection
Authors: Li Ding, Edward Pantridge, Lee Spector
Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11681
Pdf link: https://arxiv.org/pdf/2305.11681
Abstract Lexicase selection is a widely used parent selection algorithm in genetic programming, known for its success in various task domains such as program synthesis, symbolic regression, and machine learning. Due to its non-parametric and recursive nature, calculating the probability of each individual being selected by lexicase selection has been proven to be an NP-hard problem, which discourages deeper theoretical understanding and practical improvements to the algorithm. In this work, we introduce probabilistic lexicase selection (plexicase selection), a novel parent selection algorithm that efficiently approximates the probability distribution of lexicase selection. Our method not only demonstrates superior problem-solving capabilities as a semantic-aware selection method, but also benefits from having a probabilistic representation of the selection process for enhanced efficiency and flexibility. Experiments are conducted in two prevalent domains in genetic programming: program synthesis and symbolic regression, using standard benchmarks including PSB and SRBench. The empirical results show that plexicase selection achieves state-of-the-art problem-solving performance that is competitive to the lexicase selection, and significantly outperforms lexicase selection in computation efficiency.
Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Authors: Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11692
Pdf link: https://arxiv.org/pdf/2305.11692
Abstract Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (1) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (2) current fusion strategy of heterogeneous modalities like text and image is naive; (3) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate GIoU loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from MICCAI challenges EndoVis-17 and 18. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localize the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks.
RGCVAE: Relational Graph Conditioned Variational Autoencoder for Molecule Design
Authors: Davide Rigoni, Nicolò Navarin, Alessandro Sperduti
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Arxiv link: https://arxiv.org/abs/2305.11699
Pdf link: https://arxiv.org/pdf/2305.11699
Abstract Identifying molecules that exhibit some pre-specified properties is a difficult problem to solve. In the last few years, deep generative models have been used for molecule generation. Deep Graph Variational Autoencoders are among the most powerful machine learning tools with which it is possible to address this problem. However, existing methods struggle in capturing the true data distribution and tend to be computationally expensive. In this work, we propose RGCVAE, an efficient and effective Graph Variational Autoencoder based on: (i) an encoding network exploiting a new powerful Relational Graph Isomorphism Network; (ii) a novel probabilistic decoding component. Compared to several state-of-the-art VAE methods on two widely adopted datasets, RGCVAE shows state-of-the-art molecule generation performance while being significantly faster to train.
Efficient and Deterministic Search Strategy Based on Residual Projections for Point Cloud Registration
Authors: Xinyi Li, Yinlong Liu, Hu Cao, Xueli Liu, Feihu Zhang, Alois Knoll
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11716
Pdf link: https://arxiv.org/pdf/2305.11716
Abstract Estimating the rigid transformation between two LiDAR scans through putative 3D correspondences is a typical point cloud registration paradigm. Current 3D feature matching approaches commonly lead to numerous outlier correspondences, making outlier-robust registration techniques indispensable. Many recent studies have adopted the branch and bound (BnB) optimization framework to solve the correspondence-based point cloud registration problem globally and deterministically. Nonetheless, BnB-based methods are time-consuming to search the entire 6-dimensional parameter space, since their computational complexity is exponential to the dimension of the solution domain. In order to enhance algorithm efficiency, existing works attempt to decouple the 6 degrees of freedom (DOF) original problem into two 3-DOF sub-problems, thereby reducing the dimension of the parameter space. In contrast, our proposed approach introduces a novel pose decoupling strategy based on residual projections, effectively decomposing the raw problem into three 2-DOF rotation search sub-problems. Subsequently, we employ a novel BnB-based search method to solve these sub-problems, achieving efficient and deterministic registration. Furthermore, our method can be adapted to address the challenging problem of simultaneous pose and correspondence registration (SPCR). Through extensive experiments conducted on synthetic and real-world datasets, we demonstrate that our proposed method outperforms state-of-the-art methods in terms of efficiency, while simultaneously ensuring robustness.
Direction Specific Ambisonics Source Separation with End-To-End Deep Learning
Authors: Francesc Lluís, Nils Meyer-Kahlen, Vasileios Chatziioannou, Alex Hofmann
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11727
Pdf link: https://arxiv.org/pdf/2305.11727
Abstract Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset for universal source separation, under both anechoic and room conditions. Results show that the proposed approaches offer improved separation performance and spatial selectivity compared to conventional SH beamforming.
A One-Class Classifier for the Detection of GAN Manipulated Multi-Spectral Satellite Images
Authors: Lydia Abady, Giovanna Maria Dimitri, Mauro Barni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2305.11795
Pdf link: https://arxiv.org/pdf/2305.11795
Abstract The highly realistic image quality achieved by current image generative models has many academic and industrial applications. To limit the use of such models to benign applications, though, it is necessary that tools to conclusively detect whether an image has been generated synthetically or not are developed. For this reason, several detectors have been developed providing excellent performance in computer vision applications, however, they can not be applied as they are to multispectral satellite images, and hence new models must be trained. In general, two-class classifiers can achieve very good detection accuracies, however they are not able to generalise to image domains and generative models architectures different than those used during training. For this reason, in this paper, we propose a one-class classifier based on Vector Quantized Variational Autoencoder 2 (VQ-VAE 2) features to overcome the limitations of two-class classifiers. First, we emphasize the generalization problem that binary classifiers suffer from by training and testing an EfficientNet-B4 architecture on multiple multispectral datasets. Then we show that, since the VQ-VAE 2 based classifier is trained only on pristine images, it is able to detect images belonging to different domains and generated by architectures that have not been used during training. Last, we compare the two classifiers head-to-head on the same generated datasets, highlighting the superiori generalization capabilities of the VQ-VAE 2-based detector.
Making $\textsf{IP}=\textsf{PSPACE}$ Practical: Efficient Interactive Protocols for BDD Algorithms
Authors: Eszter Couillard, Philipp Czerner, Javier Esparza, Rupak Majumdar
Subjects: Logic in Computer Science (cs.LO); Computational Complexity (cs.CC)
Arxiv link: https://arxiv.org/abs/2305.11813
Pdf link: https://arxiv.org/pdf/2305.11813
Abstract We show that interactive protocols between a prover and a verifier, a well-known tool of complexity theory, can be used in practice to certify the correctness of automated reasoning tools. Theoretically, interactive protocols exist for all $\textsf{PSPACE}$ problems. The verifier of a protocol checks the prover's answer to a problem instance in polynomial time, with polynomially many bits of communication, and with exponentially small probability of error. (The prover may need exponential time.) Existing interactive protocols are not used in practice because their provers use naive algorithms, inefficient even for small instances, that are incompatible with practical implementations of automated reasoning. We bridge the gap between theory and practice by means of a novel interactive protocol whose prover uses BDDs. We consider the problem of counting the number of assignments to a QBF instance ($#\textrm{CP}$), which has a natural BDD-based algorithm. We give an interactive protocol for $#\textrm{CP}$ whose prover is implemented on top of an extended BDD library. The prover has only a linear overhead in computation time over the natural algorithm. We have implemented our protocol in $\textsf{blic}$, a certifying tool for $#\textrm{CP}$. Experiments on standard QBF benchmarks show that \blic\ is competitive with state-of-the-art QBF-solvers. The run time of the verifier is negligible. While loss of absolute certainty can be concerning, the error probability in our experiments is at most $10^{-10}$ and reduces to $10^{-10k}$ by repeating the verification $k$ times.
Video Killed the HD-Map: Predicting Driving Behavior Directly From Drone Images
Authors: Yunpeng Liu, Vasileios Lioutas, Jonathan Wilder Lavington, Matthew Niedoba, Justice Sefas, Setareh Dabiri, Dylan Green, Xiaoxuan Liang, Berend Zwartsenberg, Adam Ścibior, Frank Wood
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11856
Pdf link: https://arxiv.org/pdf/2305.11856
Abstract The development of algorithms that learn behavioral driving models using human demonstrations has led to increasingly realistic simulations. In general, such models learn to jointly predict trajectories for all controlled agents by exploiting road context information such as drivable lanes obtained from manually annotated high-definition (HD) maps. Recent studies show that these models can greatly benefit from increasing the amount of human data available for training. However, the manual annotation of HD maps which is necessary for every new location puts a bottleneck on efficiently scaling up human traffic datasets. We propose a drone birdview image-based map (DBM) representation that requires minimal annotation and provides rich road context information. We evaluate multi-agent trajectory prediction using the DBM by incorporating it into a differentiable driving simulator as an image-texture-based differentiable rendering module. Our results demonstrate competitive multi-agent trajectory prediction performance when using our DBM representation as compared to models trained with rasterized HD maps.
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs
Authors: Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11860
Pdf link: https://arxiv.org/pdf/2305.11860
Abstract A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency - poll the LLM multiple times and output the most frequent solution. Existing Self-Consistency techniques always draw a constant number of samples per question, where a better approach will be to non-uniformly distribute the available budget based on the amount of agreement in the samples drawn so far. In response, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question using a lightweight stopping criterion. Our experiments over 13 datasets and two LLMs demonstrate that Adaptive-Consistency reduces sample budget by up to 6.0 times with an average accuracy drop of less than 0.1%.
Reducing Sequence Length by Predicting Edit Operations with Large Language Models
Authors: Masahiro Kaneko, Naoaki Okazaki
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11862
Pdf link: https://arxiv.org/pdf/2305.11862
Abstract Large Language Models (LLMs) have demonstrated remarkable performance in various tasks and gained significant attention. LLMs are also used for local sequence transduction tasks, including grammatical error correction (GEC) and formality style transfer, where most tokens in a source text are kept unchanged. However, it is inefficient to generate all target tokens because a prediction error of a target token may cause a catastrophe in predicting subsequent tokens and because the computational cost grows quadratically with the target sequence length. This paper proposes to predict a set of edit operations for the source text for local sequence transduction tasks. Representing an edit operation with a span of the source text and changed tokens, we can reduce the length of the target sequence and thus the computational cost for inference. We apply instruction tuning for LLMs on the supervision data of edit operations. Experiments show that the proposed method achieves comparable performance to the baseline in four tasks, paraphrasing, formality style transfer, GEC, and text simplification, despite reducing the length of the target text by as small as 21\%. Furthermore, we report that the instruction tuning with the proposed method achieved the state-of-the-art performance in the four tasks.
Adaptive identification of SISO linear infinite-dimensional systems
Authors: Sudipta Chattopadhyay, Srikant Sukumar, Vivek Natarajan
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11868
Pdf link: https://arxiv.org/pdf/2305.11868
Abstract We propose an adaptive algorithm for identifying the unknown parameter in a linear exponentially stable single-input single-output infinite-dimensional system. We assume that the transfer function of the infinite-dimensional system can be expressed as a ratio of two infinite series in s (the Laplace variable). We also assume that certain identifiability conditions, which include a persistency of excitation condition, hold. For a fixed integer n, we propose an update law driven by real-time input-output data for estimating the first n+1 coefficients in the numerator and the denominator of the transfer function. We show that the estimates for the transfer function coefficients generated by the update law are close to the true values at large times provided n is sufficiently large (the estimates converge to the true values as time and n tend to infinity). The unknown parameter can be reconstructed using the transfer function coefficient estimates obtained with n large and the algebraic expressions relating the transfer function coefficients to the unknown parameter. We also provide a numerical scheme for verifying the identifiability conditions and for choosing n sufficiently large so that the value of the reconstructed parameter is close to the true value. The class of systems to which our approach is applicable includes many partial differential equations with constant/spatially-varying coefficients and distributed/boundary input and output. We illustrate the efficacy of our approach using three examples: a delay system with four unknown scalars, a 1D heat equation with two unknown scalars and a 1D wave equation with an unknown spatially-varying coefficient.
Keyword: faster

Improved and Partially-Tight Lower Bounds for Message-Passing Implementations of Multiplicity Queues
Authors: Anh Tran, Edward Talmage
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2305.11286
Pdf link: https://arxiv.org/pdf/2305.11286
Abstract A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue to allow concurrent Dequeue instances to return the same value. It would seem that this should allow faster implementations, as processes should not need to wait as long to learn about concurrent operations at remote processes and previous work has shown that multiplicity queues are computationally less complex than the unrelaxed version. Intriguingly, recent work has shown that there is, in fact, not much speedup possible versus an unrelaxed queue implementation. Seeking to understand this difference between intuition and real behavior, we extend that work, increasing the lower bound for uniform algorithms. Further, we outline a path forward toward building proofs for even higher lower bounds, allowing us to hypothesize that the worst-case time to Dequeue approaches maximum message delay, which is similar to the time required for an unrelaxed Dequeue. We also give an upper bound for a special case to show that our bounds are tight at that point. To achieve our lower bounds, we use extended shifting arguments, which have been rarely used but allow larger lower bounds than traditional shifting arguments. We use these in series of inductive indistinguishability proofs which allow us to extend our proofs beyond the usual limitations of shifting arguments. This proof structure is an interesting contribution independently of the main result, as developing new lower bound proof techniques may have many uses in future work.
Engineering an algorithm for constructing low-stretch geometric graphs with near-greedy average-degrees
Authors: FNU Shariful, Justin Weathers, Anirban Ghosh, Giri Narasimhan
Subjects: Computational Geometry (cs.CG)
Arxiv link: https://arxiv.org/abs/2305.11312
Pdf link: https://arxiv.org/pdf/2305.11312
Abstract We design and engineer Fast-Sparse-Spanner, a simple and practical (fast and memory-efficient) algorithm for constructing sparse low stretch-factor geometric graphs on large pointsets in the plane. To our knowledge, this is the first practical algorithm to construct fast low stretch-factor graphs on large pointsets with average-degrees (hence, the number of edges) competitive with that of greedy-spanners, the sparsest known class of Euclidean geometric spanners. To evaluate our implementation in terms of computation speed, memory usage, and quality of output, we performed extensive experiments with synthetic and real-world pointsets, and by comparing it to our closest competitor Bucketing, the fastest known greedy-spanner algorithm for pointsets in the plane, devised by Alewijnse et al. (Algorithmica, 2017). We always found that Fast-Sparse-Spanner generated near-greedy t-spanners while being fast and memory-efficient. Our experiment with constructing a 1.1-spanner on a large synthetic pointset with 128K points uniformly distributed within a square shows more than a 41-fold speedup with roughly a third of the memory usage of that of Bucketing, but with only a 3% increase in the average-degree of the resulting graph. In terms of diameter, the graphs generated by Fast-Sparse-Spanner beat greedy-spanners in most cases (have substantially lower diameter) while maintaining near-greedy average-degree. As a byproduct of our research, we design and engineer Fast-Stretch-Factor, a practical parallelizable algorithm that can measure the stretch-factor of any graph generated by Fast-Sparse-Spanner. Our experiments show that it is much faster than the naive Dijkstra-based stretch-factor measurement algorithm.
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
Authors: Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11392
Pdf link: https://arxiv.org/pdf/2305.11392
Abstract Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e.g. word, line and paragraph, on different granularity levels and seem hard to achieve a good trade-off between efficiency and performance. To tackle the concerns, we propose Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture, for visual document understanding. Specifically, we design a modality-guided dynamic token merging block to make the model learn multi-granularity representation and prunes redundant tokens. Additionally, we present a multi-modal interaction module called Symmetry Cross Attention (SCA) to consider multi-modal fusion and efficiently guide the token mergence. The SCA allows one modality input as query to calculate cross attention with another modality in a dual phase. Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
Practical algorithms and experimentally validated incentives for equilibrium-based fair division (A-CEEI)
Authors: Eric Budish, Ruiquan Gao, Abraham Othman, Aviad Rubinstein, Qianfan Zhang
Subjects: Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2305.11406
Pdf link: https://arxiv.org/pdf/2305.11406
Abstract Approximate Competitive Equilibrium from Equal Incomes (A-CEEI) is an equilibrium-based solution concept for fair division of discrete items to agents with combinatorial demands. In theory, it is known that in asymptotically large markets: 1. For incentives, the A-CEEI mechanism is Envy-Free-but-for-Tie-Breaking (EF-TB), which implies that it is Strategyproof-in-the-Large (SP-L). 2. From a computational perspective, computing the equilibrium solution is unfortunately a computationally intractable problem (in the worst-case, assuming $\textsf{PPAD}\ne \textsf{FP}$). We develop a new heuristic algorithm that outperforms the previous state-of-the-art by multiple orders of magnitude. This new, faster algorithm lets us perform experiments on real-world inputs for the first time. We discover that with real-world preferences, even in a realistic implementation that satisfies the EF-TB and SP-L properties, agents may have surprisingly simple and plausible deviations from truthful reporting of preferences. To this end, we propose a novel strengthening of EF-TB, which dramatically reduces the potential for strategic deviations from truthful reporting in our experiments. A (variant of) our algorithm is now in production: on real course allocation problems it is much faster, has zero clearing error, and has stronger incentive properties than the prior state-of-the-art implementation.
JetSeg: Efficient Real-Time Semantic Segmentation Model for Low-Power GPU-Embedded Systems
Authors: Miguel Lopez-Montiel, Daniel Alejandro Lopez, Oscar Montiel
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11419
Pdf link: https://arxiv.org/pdf/2305.11419
Abstract Real-time semantic segmentation is a challenging task that requires high-accuracy models with low-inference times. Implementing these models on embedded systems is limited by hardware capability and memory usage, which produces bottlenecks. We propose an efficient model for real-time semantic segmentation called JetSeg, consisting of an encoder called JetNet, and an improved RegSeg decoder. The JetNet is designed for GPU-Embedded Systems and includes two main components: a new light-weight efficient block called JetBlock, that reduces the number of parameters minimizing memory usage and inference time without sacrificing accuracy; a new strategy that involves the combination of asymmetric and non-asymmetric convolutions with depthwise-dilated convolutions called JetConv, a channel shuffle operation, light-weight activation functions, and a convenient number of group convolutions for embedded systems, and an innovative loss function named JetLoss, which integrates the Precision, Recall, and IoUB losses to improve semantic segmentation and reduce computational complexity. Experiments demonstrate that JetSeg is much faster on workstation devices and more suitable for Low-Power GPU-Embedded Systems than existing state-of-the-art models for real-time semantic segmentation. Our approach outperforms state-of-the-art real-time encoder-decoder models by reducing 46.70M parameters and 5.14% GFLOPs, which makes JetSeg up to 2x faster on the NVIDIA Titan RTX GPU and the Jetson Xavier than other models. The JetSeg code is available at https://github.com/mmontielpz/jetseg.
Beyond Exponential Graph: Communication-Efficient Topologies for Decentralized Learning via Finite-time Convergence
Authors: Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11420
Pdf link: https://arxiv.org/pdf/2305.11420
Abstract Decentralized learning has recently been attracting increasing attention for its applications in parallel computation and privacy preservation. Many recent studies stated that the underlying network topology with a faster consensus rate (a.k.a. spectral gap) leads to a better convergence rate and accuracy for decentralized learning. However, a topology with a fast consensus rate, e.g., the exponential graph, generally has a large maximum degree, which incurs significant communication costs. Thus, seeking topologies with both a fast consensus rate and small maximum degree is important. In this study, we propose a novel topology combining both a fast consensus rate and small maximum degree called the Base-$(k + 1)$ Graph. Unlike the existing topologies, the Base-$(k + 1)$ Graph enables all nodes to reach the exact consensus after a finite number of iterations for any number of nodes and maximum degree k. Thanks to this favorable property, the Base-$(k + 1)$ Graph endows Decentralized SGD (DSGD) with both a faster convergence rate and more communication efficiency than the exponential graph. We conducted experiments with various topologies, demonstrating that the Base-$(k + 1)$ Graph enables various decentralized learning methods to achieve higher accuracy with better communication efficiency than the existing topologies.
Strix: An End-to-End Streaming Architecture with Two-Level Ciphertext Batching for Fully Homomorphic Encryption with Programmable Bootstrapping
Authors: Adiwena Putra, Prasetiyo, Yi Chen, John Kim, Joo-Young Kim
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2305.11423
Pdf link: https://arxiv.org/pdf/2305.11423
Abstract Homomorphic encryption (HE) enables computations on encrypted data by concealing information under noise for security. However, the process of bootstrapping, which resets the noise level in the ciphertext, is computationally expensive and requires a large bootstrapping key. The TFHE scheme offers a faster and programmable bootstrapping algorithm called PBS, crucial for security-focused applications like machine learning. Nevertheless, the current TFHE scheme lacks support for ciphertext packing, resulting in low throughput. This work thoroughly analyzes TFHE bootstrapping, identifies the bottleneck in GPUs caused by the blind rotation fragmentation problem, and proposes a hardware TFHE accelerator called Strix. Strix introduces a two-level batching approach to enhance the batch size in PBS, utilizes a specialized microarchitecture for efficient streaming data processing, and incorporates a fully-pipelined FFT microarchitecture to improve performance. It achieves significantly higher throughput than state-of-the-art implementations on both CPUs and GPUs, outperforming existing TFHE accelerators by a factor of 7.4.
A new family of fourth-order energy-preserving integrators
Authors: Yuto Miyatake
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2305.11514
Pdf link: https://arxiv.org/pdf/2305.11514
Abstract For Hamiltonian systems with non-canonical structure matrices, a new family of fourth-order energy-preserving integrators is presented. The integrators take a form of a combination of Runge--Kutta methods and continuous-stage Runge--Kutta methods and feature a set of free parameters that offer greater flexibility and efficiency. Specifically, we demonstrate that by carefully choosing these free parameters a simplified Newton iteration applied to the integrators of order four can be parallelizable. This results in faster and more efficient integrators compared with existing fourth-order energy-preserving integrators.
Distributed MIS with Low Energy and Time Complexities
Authors: Mohsen Ghaffari, Julian Portmann
Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2305.11639
Pdf link: https://arxiv.org/pdf/2305.11639
Abstract We present randomized distributed algorithms for the maximal independent set problem (MIS) that, while keeping the time complexity nearly matching the best known, reduce the energy complexity substantially. These algorithms work in the standard CONGEST model of distributed message passing with $O(\log n)$ bit messages. The time complexity measures the number of rounds in the algorithm. The energy complexity measures the number of rounds each node is awake; during other rounds, the node sleeps and cannot perform any computation or communications. Our first algorithm has an energy complexity of $O(\log\log n)$ and a time complexity of $O(\log^2 n)$. Our second algorithm is faster but slightly less energy-efficient: it achieves an energy complexity of $O(\log^2 \log n)$ and a time complexity of $O(\log n \cdot \log\log n \cdot \log^* n)$. Thus, this algorithm nearly matches the $O(\log n)$ time complexity of the state-of-the-art MIS algorithms while significantly reducing their energy complexity from $O(\log n)$ to $O(\log^2 \log n)$.
RGCVAE: Relational Graph Conditioned Variational Autoencoder for Molecule Design
Authors: Davide Rigoni, Nicolò Navarin, Alessandro Sperduti
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Arxiv link: https://arxiv.org/abs/2305.11699
Pdf link: https://arxiv.org/pdf/2305.11699
Abstract Identifying molecules that exhibit some pre-specified properties is a difficult problem to solve. In the last few years, deep generative models have been used for molecule generation. Deep Graph Variational Autoencoders are among the most powerful machine learning tools with which it is possible to address this problem. However, existing methods struggle in capturing the true data distribution and tend to be computationally expensive. In this work, we propose RGCVAE, an efficient and effective Graph Variational Autoencoder based on: (i) an encoding network exploiting a new powerful Relational Graph Isomorphism Network; (ii) a novel probabilistic decoding component. Compared to several state-of-the-art VAE methods on two widely adopted datasets, RGCVAE shows state-of-the-art molecule generation performance while being significantly faster to train.
Keyword: mobile

PDP: Parameter-free Differentiable Pruning is All You Need
Authors: Minsik Cho, Saurabh Adya, Devang Naik
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11203
Pdf link: https://arxiv.org/pdf/2305.11203
Abstract DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.
Two-Tier UAV-based Low Power Wide Area Networks: A Testbed and Experimentation Study
Authors: Srdjan Sobot, Milan Lukic, Dusan Bortnik, Vladimir Nikic, Brena Lima, Marko Beko, Dejan Vukobratovic
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2305.11267
Pdf link: https://arxiv.org/pdf/2305.11267
Abstract In this paper, we propose, design, deploy and demonstrate a two-tier Low Power Wide Area Network (LP WAN) system based on Unmanned Aerial Vehicle (UAV) base stations suitable for dynamic deployment in deep rural environments. The proposed UAV-based LP WAN network augments the existing macro-cellular LP WAN network (Tier 1) with an additional layer of mobile base stations (Tier 2) based also on LP WAN technology. Mobile Tier 2 LP WAN base stations provide connectivity to static or mobile LP WAN user equipment deployed in the areas without direct Tier 1 LP WAN network coverage. The proposed two-tier LP WAN network scenario is suitable for various agricultural, forestry and environmental applications such as livestock or wild animal monitoring. In this experimental work, we report the prototype that was successfully deployed and used in a real-world deep rural environment without Tier 1 LP WAN network coverage.
RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration
Authors: Haram Choi, Cheolwoong Na, Jihyeon Oh, Seungjae Lee, Jinseop Kim, Subeen Choe, Jeongmin Lee, Taehoon Kim, Jihoon Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11474
Pdf link: https://arxiv.org/pdf/2305.11474
Abstract Although many recent works have made advancements in the image restoration (IR) field, they often suffer from an excessive number of parameters. Another issue is that most Transformer-based IR methods focus only on either local or global features, leading to limited receptive fields or deficient parameter issues. To address these problems, we propose a lightweight IR network, Reciprocal Attention Mixing Transformer (RAMiT). It employs our proposed dimensional reciprocal attention mixing Transformer (D-RAMiT) blocks, which compute bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. The bi-dimensional attentions help each other to complement their counterpart's drawbacks and are then mixed. Additionally, we introduce a hierarchical reciprocal attention mixing (H-RAMi) layer that compensates for pixel-level information losses and utilizes semantic information while maintaining an efficient hierarchical structure. Furthermore, we revisit and modify MobileNet V1 and V2 to attach efficient convolutions to our proposed components. The experimental results demonstrate that RAMiT achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining. Codes will be available soon.
Terraforming -- Environment Manipulation during Disruptions for Multi-Agent Pickup and Delivery
Authors: David Vainshtein, Yaakov Sherma, Kiril Solovey, Oren Salzman
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11510
Pdf link: https://arxiv.org/pdf/2305.11510
Abstract In automated warehouses, teams of mobile robots fulfill the packaging process by transferring inventory pods to designated workstations while navigating narrow aisles formed by tightly packed pods. This problem is typically modeled as a Multi-Agent Pickup and Delivery (MAPD) problem, which is then solved by repeatedly planning collision-free paths for agents on a fixed graph, as in the Rolling-Horizon Collision Resolution (RHCR) algorithm. However, existing approaches make the limiting assumption that agents are only allowed to move pods that correspond to their current task, while considering the other pods as stationary obstacles (even though all pods are movable). This behavior can result in unnecessarily long paths which could otherwise be avoided by opening additional corridors via pod manipulation. To this end, we explore the implications of allowing agents the flexibility of dynamically relocating pods. We call this new problem Terraforming MAPD (tMAPD) and develop an RHCR-based approach to tackle it. As the extra flexibility of terraforming comes at a significant computational cost, we utilize this capability judiciously by identifying situations where it could make a significant impact on the solution quality. In particular, we invoke terraforming in response to disruptions that often occur in automated warehouses, e.g., when an item is dropped from a pod or when agents malfunction. Empirically, using our approach for tMAPD, where disruptions are modeled via a stochastic process, we improve throughput by over 10%, reduce the maximum service time (the difference between the drop-off time and the pickup time of a pod) by more than 50%, without drastically increasing the runtime, compared to the MAPD setting.
Keyword: pruning

PDP: Parameter-free Differentiable Pruning is All You Need
Authors: Minsik Cho, Saurabh Adya, Devang Naik
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11203
Pdf link: https://arxiv.org/pdf/2305.11203
Abstract DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.
SFP: Spurious Feature-targeted Pruning for Out-of-Distribution Generalization
Authors: ingchun Wang, Jingcai Guo, Yi Liu, Song Guo, Weizhan Zhang, Xiangyong Cao, Qinghua Zheng
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11615
Pdf link: https://arxiv.org/pdf/2305.11615
Abstract Model substructure learning aims to find an invariant network substructure that can have better out-of-distribution (OOD) generalization than the original full structure. Existing works usually search the invariant substructure using modular risk minimization (MRM) with fully exposed out-domain data, which may bring about two drawbacks: 1) Unfairness, due to the dependence of the full exposure of out-domain data; and 2) Sub-optimal OOD generalization, due to the equally feature-untargeted pruning on the whole data distribution. Based on the idea that in-distribution (ID) data with spurious features may have a lower experience risk, in this paper, we propose a novel Spurious Feature-targeted model Pruning framework, dubbed SFP, to automatically explore invariant substructures without referring to the above drawbacks. Specifically, SFP identifies spurious features within ID instances during training using our theoretically verified task loss, upon which, SFP attenuates the corresponding feature projections in model space to achieve the so-called spurious feature-targeted pruning. This is typically done by removing network branches with strong dependencies on identified spurious features, thus SFP can push the model learning toward invariant features and pull that out of spurious features and devise optimal OOD generalization. Moreover, we also conduct detailed theoretical analysis to provide the rationality guarantee and a proof framework for OOD structures via model sparsity, and for the first time, reveal how a highly biased data distribution affects the model's OOD generalization. Experiments on various OOD datasets show that SFP can significantly outperform both structure-based and non-structure-based OOD generalization SOTAs, with accuracy improvement up to 4.72% and 23.35%, respectively
LLM-Pruner: On the Structural Pruning of Large Language Models
Authors: Xinyin Ma, Gongfan Fang, Xinchao Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11627
Pdf link: https://arxiv.org/pdf/2305.11627
Abstract Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code is available at: https://github.com/horseee/LLM-Pruner
Keyword: diffusion

Information-Ordered Bottlenecks for Adaptive Semantic Compression
Authors: Matthew Ho, Xiaosheng Zhao, Benjamin Wandelt
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11213
Pdf link: https://arxiv.org/pdf/2305.11213
Abstract We present the information-ordered bottleneck (IOB), a neural layer designed to adaptively compress data into latent variables ordered by likelihood maximization. Without retraining, IOB nodes can be truncated at any bottleneck width, capturing the most crucial information in the first latent variables. Unifying several previous approaches, we show that IOBs achieve near-optimal compression for a given encoding architecture and can assign ordering to latent signals in a manner that is semantically meaningful. IOBs demonstrate a remarkable ability to compress embeddings of image and text data, leveraging the performance of SOTA architectures such as CNNs, transformers, and diffusion models. Moreover, we introduce a novel theory for estimating global intrinsic dimensionality with IOBs and show that they recover SOTA dimensionality estimates for complex synthetic data. Furthermore, we showcase the utility of these models for exploratory analysis through applications on heterogeneous datasets, enabling computer-aided discovery of dataset complexity.
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Authors: Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11281
Pdf link: https://arxiv.org/pdf/2305.11281
Abstract Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
Authors: Liangchen Song, Liangliang Cao, Hongyu Xu, Kai Kang, Feng Tang, Junsong Yuan, Yang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11337
Pdf link: https://arxiv.org/pdf/2305.11337
Abstract The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated.
A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model
Authors: Ibrahim Malik, Siddique Latif, Raja Jurdak, Björn Schuller
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11413
Pdf link: https://arxiv.org/pdf/2305.11413
Abstract In this paper, we propose to utilise diffusion models for data augmentation in speech emotion recognition (SER). In particular, we present an effective approach to utilise improved denoising diffusion probabilistic models (IDDPM) to generate synthetic emotional data. We condition the IDDPM with the textual embedding from bidirectional encoder representations from transformers (BERT) to generate high-quality synthetic emotional samples in different speakers' voices\footnote{synthetic samples URL: \url{https://emulationai.com/research/diffusion-ser.}}. We implement a series of experiments and show that better quality synthetic data helps improve SER performance. We compare results with generative adversarial networks (GANs) and show that the proposed model generates better-quality synthetic samples that can considerably improve the performance of SER when augmented with synthetic data.
Incomplete Multi-view Clustering via Diffusion Completion
Authors: Sifan Fang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11489
Pdf link: https://arxiv.org/pdf/2305.11489
Abstract Incomplete multi-view clustering is a challenging and non-trivial task to provide effective data analysis for large amounts of unlabeled data in the real world. All incomplete multi-view clustering methods need to address the problem of how to reduce the impact of missing views. To address this issue, we propose diffusion completion to recover the missing views integrated into an incomplete multi-view clustering framework. Based on the observable views information, the diffusion model is used to recover the missing views, and then the consistency information of the multi-view data is learned by contrastive learning to improve the performance of multi-view clustering. To the best of our knowledge, this may be the first work to incorporate diffusion models into an incomplete multi-view clustering framework. Experimental results show that the proposed method performs well in recovering the missing views while achieving superior clustering performance compared to state-of-the-art methods.
DiffuSIA: A Spiral Interaction Architecture for Encoder-Decoder Text Diffusion
Authors: Chao-Hong Tan, Jia-Chen Gu, Zhen-Hua Ling
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11517
Pdf link: https://arxiv.org/pdf/2305.11517
Abstract Diffusion models have emerged as the new state-of-the-art family of deep generative models, and their promising potentials for text generation have recently attracted increasing attention. Existing studies mostly adopt a single encoder architecture with partially noising processes for conditional text generation, but its degree of flexibility for conditional modeling is limited. In fact, the encoder-decoder architecture is naturally more flexible for its detachable encoder and decoder modules, which is extensible to multilingual and multimodal generation tasks for conditions and target texts. However, the encoding process of conditional texts lacks the understanding of target texts. To this end, a spiral interaction architecture for encoder-decoder text diffusion (DiffuSIA) is proposed. Concretely, the conditional information from encoder is designed to be captured by the diffusion decoder, while the target information from decoder is designed to be captured by the conditional encoder. These two types of information flow run through multilayer interaction spirally for deep fusion and understanding. DiffuSIA is evaluated on four text generation tasks, including paraphrase, text simplification, question generation, and open-domain dialogue generation. Experimental results show that DiffuSIA achieves competitive performance among previous methods on all four tasks, demonstrating the effectiveness and generalization ability of the proposed method.
Late-Constraint Diffusion Guidance for Controllable Image Synthesis
Authors: Chang Liu, Dong Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11520
Pdf link: https://arxiv.org/pdf/2305.11520
Abstract Diffusion models, either with or without text condition, have demonstrated impressive capability in synthesizing photorealistic images given a few or even no words. These models may not fully satisfy user need, as normal users or artists intend to control the synthesized images with specific guidance, like overall layout, color, structure, object shape, and so on. To adapt diffusion models for controllable image synthesis, several methods have been proposed to incorporate the required conditions as regularization upon the intermediate features of the diffusion denoising network. These methods, known as early-constraint ones in this paper, have difficulties in handling multiple conditions with a single solution. They intend to train separate models for each specific condition, which require much training cost and result in non-generalizable solutions. To address these difficulties, we propose a new approach namely late-constraint: we leave the diffusion networks unchanged, but constrain its output to be aligned with the required conditions. Specifically, we train a lightweight condition adapter to establish the correlation between external conditions and internal representations of diffusion models. During the iterative denoising process, the conditional guidance is sent into corresponding condition adapter to manipulate the sampling process with the established correlation. We further equip the introduced late-constraint strategy with a timestep resampling method and an early stopping technique, which boost the quality of synthesized image meanwhile complying with the guidance. Our method outperforms the existing early-constraint methods and generalizes better to unseen condition.
Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots
Authors: Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu, Maosong Sun
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11540
Pdf link: https://arxiv.org/pdf/2305.11540
Abstract Diffusion models have made impressive progress in text-to-image synthesis. However, training such large-scale models (e.g. Stable Diffusion), from scratch requires high computational costs and massive high-quality text-image pairs, which becomes unaffordable in other languages. To handle this challenge, we propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP optimizes only a separate Chinese text encoder with all other parameters fixed to align Chinese semantics space to the English one in CLIP. To achieve this, we innovatively treat images as pivots and minimize the distance of attentive features produced from cross-attention between images and each language respectively. In this way, IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently, advancing the quality of the generated image with direct Chinese prompts. Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%~10% training data.
Brain Captioning: Decoding human brain activity into images and text
Authors: Matteo Ferrante, Furkan Ozcelik, Tommaso Boccato, Rufin VanRullen, Nicola Toschi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11560
Pdf link: https://arxiv.org/pdf/2305.11560
Abstract Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications in various fields, including neural art, style transfer, and portable devices.
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields
Authors: Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2305.11588
Pdf link: https://arxiv.org/pdf/2305.11588
Abstract Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts.
Few-shot 3D Shape Generation
Authors: Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11664
Pdf link: https://arxiv.org/pdf/2305.11664
Abstract Realistic and diverse 3D shape generation is helpful for a wide variety of applications such as virtual reality, gaming, and animation. Modern generative models, such as GANs and diffusion models, learn from large-scale datasets and generate new samples following similar data distributions. However, when training data is limited, deep neural generative networks overfit and tend to replicate training samples. Prior works focus on few-shot image generation to produce high-quality and diverse results using a few target images. Unfortunately, abundant 3D shape data is typically hard to obtain as well. In this work, we make the first attempt to realize few-shot 3D shape generation by adapting generative models pre-trained on large source domains to target domains using limited data. To relieve overfitting and keep considerable diversity, we propose to maintain the probability distributions of the pairwise relative distances between adapted samples at feature-level and shape-level during domain adaptation. Our approach only needs the silhouettes of few-shot target samples as training data to learn target geometry distributions and achieve generated shapes with diverse topology and textures. Moreover, we introduce several metrics to evaluate the quality and diversity of few-shot 3D shape generation. The effectiveness of our approach is demonstrated qualitatively and quantitatively under a series of few-shot 3D shape adaptation setups.
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
Authors: Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2305.11675
Pdf link: https://arxiv.org/pdf/2305.11675
Abstract Reconstructing human vision from brain activities has been an appealing task that helps to understand our cognitive process. Even though recent research has seen great success in reconstructing static images from non-invasive brain recordings, work on recovering continuous visual experiences in the form of videos is limited. In this work, we propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation. We show that high-quality videos of arbitrary frame rates can be reconstructed with Mind-Video using adversarial guidance. The recovered videos were evaluated with various semantic and pixel-level metrics. We achieved an average accuracy of 85% in semantic classification tasks and 0.19 in structural similarity index (SSIM), outperforming the previous state-of-the-art by 45%. We also show that our model is biologically plausible and interpretable, reflecting established physiological processes.
The probability flow ODE is provably fast
Authors: Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11798
Pdf link: https://arxiv.org/pdf/2305.11798
Abstract We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework.
Any-to-Any Generation via Composable Diffusion
Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11846
Pdf link: https://arxiv.org/pdf/2305.11846
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io
Chupa: Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic Models
Authors: Byungjun Kim, Patrick Kwon, Kwangho Lee, Myunggi Lee, Sookwan Han, Daesik Kim, Hanbyul Joo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11870
Pdf link: https://arxiv.org/pdf/2305.11870
Abstract We propose a 3D generation pipeline that uses diffusion models to generate realistic human digital avatars. Due to the wide variety of human identities, poses, and stochastic details, the generation of 3D human meshes has been a challenging problem. To address this, we decompose the problem into 2D normal map generation and normal map-based 3D reconstruction. Specifically, we first simultaneously generate realistic normal maps for the front and backside of a clothed human, dubbed dual normal maps, using a pose-conditional diffusion model. For 3D reconstruction, we ``carve'' the prior SMPL-X mesh to a detailed 3D mesh according to the normal maps through mesh optimization. To further enhance the high-frequency details, we present a diffusion resampling scheme on both body and facial regions, thus encouraging the generation of realistic digital avatars. We also seamlessly incorporate a recent text-to-image diffusion model to support text-based human identity control. Our method, namely, Chupa, is capable of generating realistic 3D clothed humans with better perceptual quality and identity variety.
Keyword: dynamic

Vanishing Activations: A Symptom of Deep Capsule Networks
Authors: Miles Everett, Mingjun Zhong, Georgios Leontidis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11178
Pdf link: https://arxiv.org/pdf/2305.11178
Abstract Capsule Networks, an extension to Neural Networks utilizing vector or matrix representations instead of scalars, were initially developed to create a dynamic parse tree where visual concepts evolve from parts to complete objects. Early implementations of Capsule Networks achieved and maintain state-of-the-art results on various datasets. However, recent studies have revealed shortcomings in the original Capsule Network architecture, notably its failure to construct a parse tree and its susceptibility to vanishing gradients when deployed in deeper networks. This paper extends the investigation to a range of leading Capsule Network architectures, demonstrating that these issues are not confined to the original design. We argue that the majority of Capsule Network research has produced architectures that, while modestly divergent from the original Capsule Network, still retain a fundamentally similar structure. We posit that this inherent design similarity might be impeding the scalability of Capsule Networks. Our study contributes to the broader discussion on improving the robustness and scalability of Capsule Networks.
PDP: Parameter-free Differentiable Pruning is All You Need
Authors: Minsik Cho, Saurabh Adya, Devang Naik
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11203
Pdf link: https://arxiv.org/pdf/2305.11203
Abstract DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.
Learning Agent Interactions from Density Evolution in 3D Regions With Obstacles
Authors: Amoolya Tirumalai, Christos N. Mavridis, John S. Baras
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11230
Pdf link: https://arxiv.org/pdf/2305.11230
Abstract In this work, we study the inverse problem of identifying complex flocking dynamics in a domain cluttered with obstacles. We get inspiration from animal flocks moving in complex ways with capabilities far beyond what current robots can do. Owing to the difficulty of observing and recovering the trajectories of the agents, we focus on the dynamics of their probability densities, which are governed by partial differential equations (PDEs), namely compressible Euler equations subject to non-local forces. We formulate the inverse problem of learning interactions as a PDE-constrained optimization problem of minimizing the squared Hellinger distance between the histogram of the flock and the distribution associated to our PDEs. The numerical methods used to efficiently solve the PDE-constrained optimization problem are described. Realistic flocking data are simulated using the Boids model of flocking agents, which differs in nature from the reconstruction models used in our PDEs. Our analysis and simulated experiments show that the behavior of cohesive flocks can be recovered accurately with approximate PDE solutions.
Two-Tier UAV-based Low Power Wide Area Networks: A Testbed and Experimentation Study
Authors: Srdjan Sobot, Milan Lukic, Dusan Bortnik, Vladimir Nikic, Brena Lima, Marko Beko, Dejan Vukobratovic
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2305.11267
Pdf link: https://arxiv.org/pdf/2305.11267
Abstract In this paper, we propose, design, deploy and demonstrate a two-tier Low Power Wide Area Network (LP WAN) system based on Unmanned Aerial Vehicle (UAV) base stations suitable for dynamic deployment in deep rural environments. The proposed UAV-based LP WAN network augments the existing macro-cellular LP WAN network (Tier 1) with an additional layer of mobile base stations (Tier 2) based also on LP WAN technology. Mobile Tier 2 LP WAN base stations provide connectivity to static or mobile LP WAN user equipment deployed in the areas without direct Tier 1 LP WAN network coverage. The proposed two-tier LP WAN network scenario is suitable for various agricultural, forestry and environmental applications such as livestock or wild animal monitoring. In this experimental work, we report the prototype that was successfully deployed and used in a real-world deep rural environment without Tier 1 LP WAN network coverage.
Project-Based Learning for Robot Control Theory: A Robot Operating System (ROS) Based Approach
Authors: Siavash Farzan
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11279
Pdf link: https://arxiv.org/pdf/2305.11279
Abstract Control theory is an important cornerstone of the robotics field and is considered a fundamental subject in an undergraduate and postgraduate robotics curriculum. Furthermore, project-based learning has shown significant benefits in engineering domains, specifically in interdisciplinary fields such as robotics which require hands-on experience to master the discipline adequately. However, designing a project-based learning experience to teach control theory in a hands-on setting can be challenging, due to the rigor of mathematical concepts involved in the subject. Moreover, access to reliable hardware required for a robotics control lab, including the robots, sensors, interfaces, and measurement instruments, may not be feasible in developing countries and even many academic institutions in the US. The current paper presents a set of six project-based assignments for an advanced postgraduate Robot Control course. The assignments leverage the Robot Operating System (ROS), an open-source set of tools, libraries, and software, which is a de facto standard for the development of robotics applications. The use of ROS, along with its physics engine simulation framework, Gazebo, provides a hands-on robotics experience equivalent to working with real hardware. Learning outcomes include: i) theoretical analysis of linear and nonlinear dynamical systems, ii) formulation and implementation of advanced model-based robot control algorithms using classical and modern control theory, and iii) programming and performance evaluation of robotic systems on physics engine robot simulators. Course evaluations and student surveys demonstrate that the proposed project-based assignments successfully bridge the gap between theory and practice, and facilitate learning of control theory concepts and state-of-the-art robotics techniques through a hands-on approach.
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
Authors: Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11281
Pdf link: https://arxiv.org/pdf/2305.11281
Abstract Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation
Authors: Jiawei Huang, Batuhan Yardim, Niao He
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11283
Pdf link: https://arxiv.org/pdf/2305.11283
Abstract In this paper, we study the statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MBED), which subsumes a rich family of Mean-Field RL problems. Additionally, we propose algorithms based on Optimistic Maximal Likelihood Estimation, which can return an $\epsilon$-optimal policy for MFC or an $\epsilon$-Nash Equilibrium policy for MFG, with sample complexity polynomial w.r.t. relevant parameters and independent of the number of states, actions and the number of agents. Notably, our results only require a mild assumption of Lipschitz continuity on transition dynamics and avoid strong structural assumptions in previous work. Finally, in the tabular setting, given the access to a generative model, we establish an exponential lower bound for MFC setting, while providing a novel sample-efficient model elimination algorithm to approximate equilibrium in MFG setting. Our results reveal a fundamental separation between RL for single-agent, MFC, and MFG from the sample efficiency perspective.
SpikeCP: Delay-Adaptive Reliable Spiking Neural Networks via Conformal Prediction
Authors: Jiechen Chen, Sangwoo Park, Osvaldo Simeone
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11322
Pdf link: https://arxiv.org/pdf/2305.11322
Abstract Spiking neural networks (SNNs) process time-series data via internal event-driven neural dynamics whose energy consumption depends on the number of spikes exchanged between neurons over the course of the input presentation. In typical implementations of an SNN classifier, decisions are produced after the entire input sequence has been processed, resulting in latency and energy consumption levels that are fairly uniform across inputs. Recently introduced delay-adaptive SNNs tailor the inference latency -- and, with it, the energy consumption -- to the difficulty of each example, by producing an early decision when the SNN model is sufficiently ``confident''. In this paper, we start by observing that, as an SNN processes input samples, its classification decisions tend to be first under-confident and then over-confident with respect to the decision's ground-truth, unknown, test accuracy. This makes it difficult to determine a stopping time that ensures a desired level of accuracy. To address this problem, we introduce a novel delay-adaptive SNN-based inference methodology that, wrapping around any pre-trained SNN classifier, provides guaranteed reliability for the decisions produced at input-dependent stopping times. The approach entails minimal added complexity as compared to the underlying SNN, requiring only thresholding and counting operations at run time, and it leverages tools from conformal prediction (CP).
Determinism of Multirelations
Authors: Hitoshi Furusawa, Walter Guttmann, Georg Struth
Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2305.11344
Pdf link: https://arxiv.org/pdf/2305.11344
Abstract Binary multirelations can model alternating nondeterminism, for instance, in games or nondeterministically evolving systems interacting with an environment. Such systems can show partial or total functional behaviour at both levels of alternation, so that nondeterministic behaviour may occur only at one level or both levels, or not at all. We study classes of inner and outer partial and total functional multirelations in a multirelational language based on relation algebra and power allegories. While it is known that general multirelations do not form a category, we show that the classes of deterministic multirelations mentioned form categories with respect to Peleg composition from concurrent dynamic logic, and sometimes quantaloids. Some of these are isomorphic to the category of binary relations. We also introduce determinisation maps that approximate multirelations either by binary relations or by deterministic multirelations. Such maps are useful for defining modal operators on multirelations.
Modal Algebra of Multirelations
Authors: Hitoshi Furusawa, Walter Guttmann, Georg Struth
Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2305.11346
Pdf link: https://arxiv.org/pdf/2305.11346
Abstract We formalise the modal operators from the concurrent dynamic logics of Peleg, Nerode and Wijesekera in a multirelational algebraic language based on relation algebra and power allegories, using relational approximation operators on multirelations developed in a companion article. We relate Nerode and Wijesekera's box operator with a relational approximation operator for multirelations and two related operators that approximate multirelations by different kinds of deterministic multirelations. We provide an algebraic soundness proof of Goldblatt's axioms for concurrent dynamic logic as an application.
Understanding the World to Solve Social Dilemmas Using Multi-Agent Reinforcement Learning
Authors: Manuel Rios, Nicanor Quijano, Luis Felipe Giraldo
Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2305.11358
Pdf link: https://arxiv.org/pdf/2305.11358
Abstract Social dilemmas are situations where groups of individuals can benefit from mutual cooperation but conflicting interests impede them from doing so. This type of situations resembles many of humanity's most critical challenges, and discovering mechanisms that facilitate the emergence of cooperative behaviors is still an open problem. In this paper, we study the behavior of self-interested rational agents that learn world models in a multi-agent reinforcement learning (RL) setting and that coexist in environments where social dilemmas can arise. Our simulation results show that groups of agents endowed with world models outperform all the other tested ones when dealing with scenarios where social dilemmas can arise. We exploit the world model architecture to qualitatively assess the learnt dynamics and confirm that each agent's world model is capable to encode information of the behavior of the changing environment and the other agent's actions. This is the first work that shows that world models facilitate the emergence of complex coordinated behaviors that enable interacting agents to ``understand'' both environmental and social dynamics.
Smart Pressure e-Mat for Human Sleeping Posture and Dynamic Activity Recognition
Authors: Liangqi Yuan, Yuan Wei, Jia Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2305.11367
Pdf link: https://arxiv.org/pdf/2305.11367
Abstract With the emphasis on healthcare, early childhood education, and fitness, non-invasive measurement and recognition methods have received more attention. Pressure sensing has been extensively studied due to its advantages of simple structure, easy access, visualization application, and harmlessness. This paper introduces a smart pressure e-mat (SPeM) system based on a piezoresistive material Velostat for human monitoring applications, including sleeping postures, sports, and yoga recognition. After a subsystem scans e-mat readings and processes the signal, it generates a pressure image stream. Deep neural networks (DNNs) are used to fit and train the pressure image stream and recognize the corresponding human behavior. Four sleeping postures and five dynamic activities inspired by Nintendo Switch Ring Fit Adventure (RFA) are used as a preliminary validation of the proposed SPeM system. The SPeM system achieves high accuracies on both applications, which demonstrates the high accuracy and generalization ability of the models. Compared with other pressure sensor-based systems, SPeM possesses more flexible applications and commercial application prospects, with reliable, robust, and repeatable properties.
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding
Authors: Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11392
Pdf link: https://arxiv.org/pdf/2305.11392
Abstract Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e.g. word, line and paragraph, on different granularity levels and seem hard to achieve a good trade-off between efficiency and performance. To tackle the concerns, we propose Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture, for visual document understanding. Specifically, we design a modality-guided dynamic token merging block to make the model learn multi-granularity representation and prunes redundant tokens. Additionally, we present a multi-modal interaction module called Symmetry Cross Attention (SCA) to consider multi-modal fusion and efficiently guide the token mergence. The SCA allows one modality input as query to calculate cross attention with another modality in a dual phase. Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
Remembering What Is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion Prediction
Authors: Tharindu Fernando, Harshala Gammulle, Sridha Sridharan, Simon Denman, Clinton Fookes
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11394
Pdf link: https://arxiv.org/pdf/2305.11394
Abstract Humans exhibit complex motions that vary depending on the task that they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting future poses based on the history of the previous motions is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework for the improved modelling of historical knowledge. Specifically, we disentangle subject-specific, task-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, our proposed dynamic masking strategy makes this feature disentanglement process dynamic. Two novel loss functions are introduced to encourage diversity within the auxiliary memory while ensuring the stability of the memory contents, such that it can locate and store salient information that can aid the long-term prediction of future motion, irrespective of data imbalances or the diversity of the input data distribution. With extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, we demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: $>$ 17\% on the Human3.6M dataset and $>$ 9\% on the CMU-Mocap dataset.
Must the Communication Graph of MPC Protocols be an Expander?
Authors: Elette Boyle, Ran Cohen, Deepesh Data, Pavel Hubáček
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2305.11428
Pdf link: https://arxiv.org/pdf/2305.11428
Abstract Secure multiparty computation (MPC) on incomplete communication networks has been studied within two primary models: (1) Where a partial network is fixed a priori, and thus corruptions can occur dependent on its structure, and (2) Where edges in the communication graph are determined dynamically as part of the protocol. Whereas a rich literature has succeeded in mapping out the feasibility and limitations of graph structures supporting secure computation in the fixed-graph model (including strong classical lower bounds), these bounds do not apply in the latter dynamic-graph setting, which has recently seen exciting new results, but remains relatively unexplored. In this work, we initiate a similar foundational study of MPC within the dynamic-graph model. As a first step, we investigate the property of graph expansion. All existing protocols (implicitly or explicitly) yield communication graphs which are expanders, but it is not clear whether this is inherent. Our results consist of two types (for constant fraction of corruptions): Upper bounds: We demonstrate secure protocols whose induced communication graphs are not expander graphs, within a wide range of settings (computational, information theoretic, with low locality, even with low locality and adaptive security), each assuming some form of input-independent setup. Lower bounds: In the plain model (no setup) with adaptive corruptions, we demonstrate that for certain functionalities, no protocol can maintain a non-expanding communication graph against all adversarial strategies. Our lower bound relies only on protocol correctness (not privacy), and requires a surprisingly delicate argument. More generally, we provide a formal framework for analyzing the evolving communication graph of MPC protocols, giving a starting point for studying the relation between secure computation and further, more general graph properties.
Coordinated Frequency-Constrained Stochastic Economic Dispatch for Integrated Transmission and Distribution System via Distributed Optimization
Authors: Ye Tian, Zhengshuo Li
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11440
Pdf link: https://arxiv.org/pdf/2305.11440
Abstract When large-scale uncertain centralized and distributed renewable energy sources are connected to a power system, separate dispatching of the transmission power system (TPS) and the active distribution network (ADN) will lower the network security and frequency security of the system. To address these problems, this paper proposes a coordinated frequency-constrained stochastic economic dispatch (CFC-SED) model for an integrated transmission and distribution (ITD) system. In this model, the dynamic frequency security constraints and network security constraints of the ITD system are constructed, and the joint chance constraints are adopted to handle the uncertainty. Then, the control parameters of inverter-based resources, the base point power, and the regulation reserve of all dispatchable resources in the ITD system are jointly optimized for the minimum operating cost. TPS and ADNs can deliver base point power bidirectionally and provide frequency regulation support bidirectionally, which extend the existing reserve assumption in ITD dispatch and enhance the operational security of the ITD system. Moreover, based on the alternating direction of multiplier algorithm, a two-layer distributed optimization framework is proposed to solve the CFC-SED model. Case studies show that the CFC-SED model can fully utilize the potential of multiple regulation resources to improve the security performance of the ITD system, and TPS and ADNs can be coordinated efficiently through the proposed distributed optimization framework.
Learning Sequence Descriptor based on Spatiotemporal Attention for Visual Place Recognition
Authors: Fenglin Zhang, Junqiao Zhao, Yingfeng Cai, Gengxuan Tian, Wenjie Mu, Chen Ye
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11467
Pdf link: https://arxiv.org/pdf/2305.11467
Abstract Sequence-based visual place recognition (sVPR) aims to match frame sequences with frames stored in a reference map for localization. Existing methods include sequence matching and sequence descriptor-based retrieval. The former is based on the assumption of constant velocity, which is difficult to hold in real scenarios and does not get rid of the intrinsic single frame descriptor mismatch. The latter solves this problem by extracting a descriptor for the whole sequence, but current sequence descriptors are only constructed by feature aggregation of multi-frames, with no temporal information interaction. In this paper, we propose a sequential descriptor extraction method to fuse spatiotemporal information effectively and generate discriminative descriptors. Specifically, similar features on the same frame focu on each other and learn space structure, and the same local regions of different frames learn local feature changes over time. And we use sliding windows to control the temporal self-attention range and adpot relative position encoding to construct the positional relationships between different features, which allows our descriptor to capture the inherent dynamics in the frame sequence and local feature motion.
Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness
Authors: Yuxuan Zhou, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Yifeng Geng, Xuansong Xie, Margret Keuper
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11468
Pdf link: https://arxiv.org/pdf/2305.11468
Abstract Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in skeleton-based action recognition, leveraging their ability to unravel the complex dynamics of human joint topology through the graph's adjacency matrix. However, an inherent flaw has come to light in these cutting-edge models: they tend to optimize the adjacency matrix jointly with the model weights. This process, while seemingly efficient, causes a gradual decay of bone connectivity data, culminating in a model indifferent to the very topology it sought to map. As a remedy, we propose a threefold strategy: (1) We forge an innovative pathway that encodes bone connectivity by harnessing the power of graph distances. This approach preserves the vital topological nuances often lost in conventional GCNs. (2) We highlight an oft-overlooked feature - the temporal mean of a skeletal sequence, which, despite its modest guise, carries highly action-specific information. (3) Our investigation revealed strong variations in joint-to-joint relationships across different actions. This finding exposes the limitations of a single adjacency matrix in capturing the variations of relational configurations emblematic of human movement, which we remedy by proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC. This evolution slashes parameters by a substantial margin (above 40%), while elevating performance beyond original GCNs. Our full model, the BlockGCN, establishes new standards in skeleton-based action recognition for small model sizes. Its high accuracy, notably on the large-scale NTU RGB+D 120 dataset, stand as compelling proof of the efficacy of BlockGCN. The source code and model can be found at https://github.com/ZhouYuxuanYX/BlockGCN.
Learning Diverse Risk Preferences in Population-based Self-play
Authors: Yuhua Jiang, Qihan Liu, Xiaoteng Ma, Chenghao Li, Yiqin Yang, Jun Yang, Bin Liang, Qianchuan Zhao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2305.11476
Pdf link: https://arxiv.org/pdf/2305.11476
Abstract Among the great successes of Reinforcement Learning (RL), self-play algorithms play an essential role in solving competitive games. Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies, making it often stuck in the local optimum and its strategy style simple and homogeneous. A possible solution is to improve the diversity of policies, which helps the agent break the stalemate and enhances its robustness when facing different opponents. However, enhancing diversity in the self-play algorithms is not trivial. In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. Specifically, we design a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning and allows for policy learning with desired risk preferences. Seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives with experiences from playing against diverse opponents. Empirical results show that our method achieves comparable or superior performance in competitive games and that diverse modes of behaviors emerge. Our code is public online at \url{https://github.com/Jackory/RPBT}.
Terraforming -- Environment Manipulation during Disruptions for Multi-Agent Pickup and Delivery
Authors: David Vainshtein, Yaakov Sherma, Kiril Solovey, Oren Salzman
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11510
Pdf link: https://arxiv.org/pdf/2305.11510
Abstract In automated warehouses, teams of mobile robots fulfill the packaging process by transferring inventory pods to designated workstations while navigating narrow aisles formed by tightly packed pods. This problem is typically modeled as a Multi-Agent Pickup and Delivery (MAPD) problem, which is then solved by repeatedly planning collision-free paths for agents on a fixed graph, as in the Rolling-Horizon Collision Resolution (RHCR) algorithm. However, existing approaches make the limiting assumption that agents are only allowed to move pods that correspond to their current task, while considering the other pods as stationary obstacles (even though all pods are movable). This behavior can result in unnecessarily long paths which could otherwise be avoided by opening additional corridors via pod manipulation. To this end, we explore the implications of allowing agents the flexibility of dynamically relocating pods. We call this new problem Terraforming MAPD (tMAPD) and develop an RHCR-based approach to tackle it. As the extra flexibility of terraforming comes at a significant computational cost, we utilize this capability judiciously by identifying situations where it could make a significant impact on the solution quality. In particular, we invoke terraforming in response to disruptions that often occur in automated warehouses, e.g., when an item is dropped from a pod or when agents malfunction. Empirically, using our approach for tMAPD, where disruptions are modeled via a stochastic process, we improve throughput by over 10%, reduce the maximum service time (the difference between the drop-off time and the pickup time of a pod) by more than 50%, without drastically increasing the runtime, compared to the MAPD setting.
Enhancing Short-Term Wind Speed Forecasting using Graph Attention and Frequency-Enhanced Mechanisms
Authors: Hao Liu, Tianyu Hu
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11526
Pdf link: https://arxiv.org/pdf/2305.11526
Abstract The safe and stable operation of power systems is greatly challenged by the high variability and randomness of wind power in large-scale wind-power-integrated grids. Wind power forecasting is an effective solution to tackle this issue, with wind speed forecasting being an essential aspect. In this paper, a Graph-attentive Frequency-enhanced Spatial-Temporal Wind Speed Forecasting model based on graph attention and frequency-enhanced mechanisms, i.e., GFST-WSF, is proposed to improve the accuracy of short-term wind speed forecasting. The GFST-WSF comprises a Transformer architecture for temporal feature extraction and a Graph Attention Network (GAT) for spatial feature extraction. The GAT is specifically designed to capture the complex spatial dependencies among wind speed stations to effectively aggregate information from neighboring nodes in the graph, thus enhancing the spatial representation of the data. To model the time lag in wind speed correlation between adjacent wind farms caused by geographical factors, a dynamic complex adjacency matrix is formulated and utilized by the GAT. Benefiting from the effective spatio-temporal feature extraction and the deep architecture of the Transformer, the GFST-WSF outperforms other baselines in wind speed forecasting for the 6-24 hours ahead forecast horizon in case studies.
Risk-Sensitive Extended Kalman Filter
Authors: Armand Jordana, Avadesh Meduri, Etienne Arlaud, Justin Carpentier, Ludovic Righetti
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11573
Pdf link: https://arxiv.org/pdf/2305.11573
Abstract In robotics, designing robust algorithms in the face of estimation uncertainty is a challenging task. Indeed, controllers often do not consider the estimation uncertainty and only rely on the most likely estimated state. Consequently, sudden changes in the environment or the robot's dynamics can lead to catastrophic behaviors. In this work, we present a risk-sensitive Extended Kalman Filter that allows doing output-feedback Model Predictive Control (MPC) safely. This filter adapts its estimation to the control objective. By taking a pessimistic estimate concerning the value function resulting from the MPC controller, the filter provides increased robustness to the controller in phases of uncertainty as compared to a standard Extended Kalman Filter (EKF). Moreover, the filter has the same complexity as an EKF, so that it can be used for real-time model-predictive control. The paper evaluates the risk-sensitive behavior of the proposed filter when used in a nonlinear model-predictive control loop on a planar drone and industrial manipulator in simulation, as well as on an external force estimation task on a real quadruped robot. These experiments demonstrate the abilities of the approach to improve performance in the face of uncertainties significantly.
Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape
Authors: Yan Sun, Li Shen, Shixiang Chen, Liang Ding, Dacheng Tao
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11584
Pdf link: https://arxiv.org/pdf/2305.11584
Abstract In federated learning (FL), a cluster of local clients are chaired under the coordination of the global server and cooperatively train one model with privacy protection. Due to the multiple local updates and the isolated non-iid dataset, clients are prone to overfit into their own optima, which extremely deviates from the global objective and significantly undermines the performance. Most previous works only focus on enhancing the consistency between the local and global objectives to alleviate this prejudicial client drifts from the perspective of the optimization view, whose performance would be prominently deteriorated on the high heterogeneity. In this work, we propose a novel and general algorithm {\ttfamily FedSMOO} by jointly considering the optimization and generalization targets to efficiently improve the performance in FL. Concretely, {\ttfamily FedSMOO} adopts a dynamic regularizer to guarantee the local optima towards the global objective, which is meanwhile revised by the global Sharpness Aware Minimization (SAM) optimizer to search for the consistent flat minima. Our theoretical analysis indicates that {\ttfamily FedSMOO} achieves fast $\mathcal{O}(1/T)$ convergence rate with low generalization bound. Extensive numerical studies are conducted on the real-world dataset to verify its peerless efficiency and excellent generality.
DAP: A Dynamic Adversarial Patch for Evading Person Detectors
Authors: Amira Guesmi, Ruitian Ding, Muhammad Abdullah Hanif, Ihsen Alouani, Muhammad Shafique
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11618
Pdf link: https://arxiv.org/pdf/2305.11618
Abstract In this paper, we present a novel approach for generating naturalistic adversarial patches without using GANs. Our proposed approach generates a Dynamic Adversarial Patch (DAP) that looks naturalistic while maintaining high attack efficiency and robustness in real-world scenarios. To achieve this, we redefine the optimization problem by introducing a new objective function, where a similarity metric is used to construct a similarity loss. This guides the patch to follow predefined patterns while maximizing the victim model's loss function. Our technique is based on directly modifying the pixel values in the patch which gives higher flexibility and larger space to incorporate multiple transformations compared to the GAN-based techniques. Furthermore, most clothing-based physical attacks assume static objects and ignore the possible transformations caused by non-rigid deformation due to changes in a person's pose. To address this limitation, we incorporate a ``Creases Transformation'' (CT) block, i.e., a preprocessing block following an Expectation Over Transformation (EOT) block used to generate a large variation of transformed patches incorporated in the training process to increase its robustness to different possible real-world distortions (e.g., creases in the clothing, rotation, re-scaling, random noise, brightness and contrast variations, etc.). We demonstrate that the presence of different real-world variations in clothing and object poses (i.e., above-mentioned distortions) lead to a drop in the performance of state-of-the-art attacks. For instance, these techniques can merely achieve 20\% in the physical world and 30.8\% in the digital world while our attack provides superior success rate of up to 65\% and 84.56\%, respectively when attacking the YOLOv3tiny detector deployed in smart cameras at the edge.
V2X-Boosted Federated Learning for Cooperative Intelligent Transportation Systems with Contextual Client Selection
Authors: Rui Song, Lingjuan Lyu, Wei Jiang, Andreas Festag, Alois Knoll
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11654
Pdf link: https://arxiv.org/pdf/2305.11654
Abstract Machine learning (ML) has revolutionized transportation systems, enabling autonomous driving and smart traffic services. Federated learning (FL) overcomes privacy constraints by training ML models in distributed systems, exchanging model parameters instead of raw data. However, the dynamic states of connected vehicles affect the network connection quality and influence the FL performance. To tackle this challenge, we propose a contextual client selection pipeline that uses Vehicle-to-Everything (V2X) messages to select clients based on the predicted communication latency. The pipeline includes: (i) fusing V2X messages, (ii) predicting future traffic topology, (iii) pre-clustering clients based on local data distribution similarity, and (iv) selecting clients with minimal latency for future model aggregation. Experiments show that our pipeline outperforms baselines on various datasets, particularly in non-iid settings.
Region of Attraction Estimation Using Union Theorem in Sum-of-Squares Optimization
Authors: Bhaskar Biswas, Dmitry Ignatyev, Argyrios Zolotas, Antonios Tsourdos
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2305.11655
Pdf link: https://arxiv.org/pdf/2305.11655
Abstract Appropriate estimation of Region of Attraction for a nonlinear dynamical system plays a key role in system analysis and control design. Sum-of-Squares optimization is a powerful tool enabling Region of Attraction estimation for polynomial dynamical systems. Employment of a positive definite function called shape function within the Sum-of-Squares procedure helps to find a richer representation of the Lyapunov function and a larger corresponding Region of Attraction estimation. However, existing Sum-of-Squares optimization techniques demonstrate very conservative results. The main novelty of this paper is the Union theorem which enables the use of multiple shape functions to create a polynomial Lyapunov function encompassing all the areas generated by the shape functions. The main contribution of this paper is a novel computationally-efficient numerical method for Region of Attraction estimation, which remarkably improves estimation performance and overcomes limitations of existing methods, while maintaining the resultant Lyapunov function polynomial, thus facilitating control system design and construction of control Lyapunov function with enhanced Region of Attraction using conventional Sum-of-Squares tools. A mathematical proof of the Union theorem along with its application to the numerical algorithm of Region of Attraction estimation is provided. The method yields significantly enlarged Region of Attraction estimations even for systems with non-symmetric or unbounded Region of Attraction, which is demonstrated via simulations of several benchmark examples.
Information Technology Needs in Vehicle Dynamics Control
Authors: Levent Guvenc, Bilin Aksun-Guvenc
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11717
Pdf link: https://arxiv.org/pdf/2305.11717
Abstract Significant changes are occurring in the field of vehicle dynamics control. Consequently, vehicle dynamics control systems are expected to be as common as ABS systems in the near future. This paper focuses on the information technology related requirements posed by these advances in the area of vehicle dynamics control. Topics considered include vehicle dynamics simulation tools, hardware in the loop simulation systems, sensors and related fault tolerance/diagnostics, man machine interface problems and the implications of a networked architecture on controller design.
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
Authors: Mengqi Huang, Zhendong Mao, Zhuowei Chen, Yongdong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11718
Pdf link: https://arxiv.org/pdf/2305.11718
Abstract Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm that first learns a codebook to encode images as discrete codes, and then completes generation based on the learned codebook. However, they encode fixed-size image regions into fixed-length codes and ignore their naturally different information densities, which results in insufficiency in important regions and redundancy in unimportant ones, and finally degrades the generation quality and speed. Moreover, the fixed-length coding leads to an unnatural raster-scan autoregressive generation. To address the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities for an accurate and compact code representation. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked-transformer architecture and shared-content, non-shared position input layers designs. Comprehensive experiments on various generation tasks validate our superiorities in both effectiveness and efficiency. Code will be released at https://github.com/CrossmodalGroup/DynamicVectorQuantization.
Non-stationary Projection-free Online Learning with Dynamic and Adaptive Regret Guarantees
Authors: Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bing Wang, Haihong Tang, Yuanyu Wan, Lijun Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11726
Pdf link: https://arxiv.org/pdf/2305.11726
Abstract Projection-free online learning has drawn increasing interest due to its efficiency in solving high-dimensional problems with complicated constraints. However, most existing projection-free online methods focus on minimizing the static regret, which unfortunately fails to capture the challenge of changing environments. In this paper, we investigate non-stationary projection-free online learning, and choose dynamic regret and adaptive regret to measure the performance. Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named $\text{BOGD}_\text{IP}$, and establish an $\mathcal{O}(T^{3/4}(1+P_T))$ dynamic regret bound, where $P_T$ denotes the path-length of the comparator sequence. Then, we improve the upper bound to $\mathcal{O}(T^{3/4}(1+PT)^{1/4})$ by running multiple $\text{BOGD}\text{IP}$ algorithms with different step sizes in parallel, and tracking the best one on the fly. Our results are the first general-case dynamic regret bounds for projection-free online learning, and can recover the existing $\mathcal{O}(T^{3/4})$ static regret by setting $PT = 0$. Furthermore, we propose a projection-free method to attain an $\tilde{\mathcal{O}}(\tau^{3/4})$ adaptive regret bound for any interval with length $\tau$, which nearly matches the static regret over that interval. The essential idea is to maintain a set of $\text{BOGD}\text{IP}$ algorithms dynamically, and combine them by a meta algorithm. Moreover, we demonstrate that it is also equipped with an $\mathcal{O}(T^{3/4}(1+P_T)^{1/4})$ dynamic regret bound. Finally, empirical studies verify our theoretical findings.
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Authors: Javier García Gilabert, Carlos Escolano, Marta R. Costa-Jussà
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11761
Pdf link: https://arxiv.org/pdf/2305.11761
Abstract Our proposed method, ReSeTOX (REdo SEarch if TOXic), addresses the issue of Neural Machine Translation (NMT) generating translation outputs that contain toxic words not present in the input. The objective is to mitigate the introduction of toxic language without the need for re-training. In the case of identified added toxicity during the inference process, ReSeTOX dynamically adjusts the key-value self-attention weights and re-evaluates the beam search hypotheses. Experimental results demonstrate that ReSeTOX achieves a remarkable 57% reduction in added toxicity while maintaining an average translation quality of 99.5% across 164 languages.
Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
Authors: Aran Nayebi, Rishi Rajalingham, Mehrdad Jazayeri, Guangyu Robert Yang
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2305.11772
Pdf link: https://arxiv.org/pdf/2305.11772
Abstract Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for embodied AI more generally.
Lifting Network Protocol Implementation to Precise Format Specification with Security Applications
Authors: Qingkai Shi, Junyang Shao, Yapeng Ye, Mingwei Zheng, Xiangyu Zhang
Subjects: Cryptography and Security (cs.CR); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2305.11781
Pdf link: https://arxiv.org/pdf/2305.11781
Abstract Inferring protocol formats is critical for many security applications. However, existing format-inference techniques often miss many formats, because almost all of them are in a fashion of dynamic analysis and rely on a limited number of network packets to drive their analysis. If a feature is not present in the input packets, the feature will be missed in the resulting formats. We develop a novel static program analysis for format inference. It is well-known that static analysis does not rely on any input packets and can achieve high coverage by scanning every piece of code. However, for efficiency and precision, we have to address two challenges, namely path explosion and disordered path constraints. To this end, our approach uses abstract interpretation to produce a novel data structure called the abstract format graph. It delimits precise but costly operations to only small regions, thus ensuring precision and efficiency at the same time. Our inferred formats are of high coverage and precisely specify both field boundaries and semantic constraints among packet fields. Our evaluation shows that we can infer formats for a protocol in one minute with >95% precision and recall, much better than four baseline techniques. Our inferred formats can substantially enhance existing protocol fuzzers, improving the coverage by 20% to 260% and discovering 53 zero-days with 47 assigned CVEs. We also provide case studies of adopting our inferred formats in other security applications including traffic auditing and intrusion detection.
Real-time and Robust Feature Detection of Continuous Marker Pattern for Dense 3-D Deformation Measurement
Authors: Mingxuan Li, Yen Hang Zhou, Liemin Li, Yao Jiang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11782
Pdf link: https://arxiv.org/pdf/2305.11782
Abstract Visuotactile sensing technology has received much attention in recent years. This article proposes a feature detection method applicable to visuotactile sensors based on continuous marker patterns (CMP) to measure 3-d deformation. First, we construct the feature model of checkerboard-like corners under contact deformation, and design a novel double-layer circular sampler. Then, we propose the judging criteria and response function of corner features by analyzing sampling signals' amplitude-frequency characteristics and circular cross-correlation behavior. The proposed feature detection algorithm fully considers the boundary characteristics retained by the corners with geometric distortion, thus enabling reliable detection at a low calculation cost. The experimental results show that the proposed method has significant advantages in terms of real-time and robustness. Finally, we have achieved the high-density 3-d contact deformation visualization based on this detection method. This technique is able to clearly record the process of contact deformation, thus enabling inverse sensing of dynamic contact processes.
The probability flow ODE is provably fast
Authors: Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2305.11798
Pdf link: https://arxiv.org/pdf/2305.11798
Abstract We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework.
Monte-Carlo Search for an Equilibrium in Dec-POMDPs
Authors: Yang You, Vincent Thomas, Francis Colas, Olivier Buffet
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11811
Pdf link: https://arxiv.org/pdf/2305.11811
Abstract Decentralized partially observable Markov decision processes (Dec-POMDPs) formalize the problem of designing individual controllers for a group of collaborative agents under stochastic dynamics and partial observability. Seeking a global optimum is difficult (NEXP complete), but seeking a Nash equilibrium -- each agent policy being a best response to the other agents -- is more accessible, and allowed addressing infinite-horizon problems with solutions in the form of finite state controllers. In this paper, we show that this approach can be adapted to cases where only a generative model (a simulator) of the Dec-POMDP is available. This requires relying on a simulation-based POMDP solver to construct an agent's FSC node by node. A related process is used to heuristically derive initial FSCs. Experiment with benchmarks shows that MC-JESP is competitive with exisiting Dec-POMDP solvers, even better than many offline methods using explicit models.
Recommendations for Verifying HDR Subjective Testing Workflows
Authors: Vibhoothi, Angeliki Katsenou, John Squires, François Pitié, Anil Kokaram
Subjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2305.11858
Pdf link: https://arxiv.org/pdf/2305.11858
Abstract Over the past few years, there has been an increase in the demand and availability of High Dynamic Range (HDR) displays and content. To ensure the production of high-quality materials, human evaluation is required. However, ascertaining whether the full playback pipeline is indeed HDR-compliant can be challenging. In this paper, we present a set of recommendations for conformance testing to validate various aspects of the testing workflow, including playback, displays, brightness, colours, and viewing environment. We assessed the effectiveness of HDR conversion techniques used in current standards development (3GPP) for making source materials. Additionally, we evaluate HDR display technologies, including OLED and LCD, using both consumer television and a reference monitor.
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs
Authors: Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11860
Pdf link: https://arxiv.org/pdf/2305.11860
Abstract A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency - poll the LLM multiple times and output the most frequent solution. Existing Self-Consistency techniques always draw a constant number of samples per question, where a better approach will be to non-uniformly distribute the available budget based on the amount of agreement in the samples drawn so far. In response, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question using a lightweight stopping criterion. Our experiments over 13 datasets and two LLMs demonstrate that Adaptive-Consistency reduces sample budget by up to 6.0 times with an average accuracy drop of less than 0.1%.
Consistent Conjectural Variations Equilibrium: Characterization & Stability for a Class of Continuous Games
Authors: Daniel J. Calderone, Benjamin J. Chasnov, Samuel A. Burden, Lillian J. Ratliff
Subjects: Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2305.11866
Pdf link: https://arxiv.org/pdf/2305.11866
Abstract Leveraging tools from the study of linear fractional transformations and algebraic Riccati equations, a local characterization of consistent conjectural variations equilibrium is given for two player games on continuous action spaces with costs approximated by quadratic functions. A discrete time dynamical system in the space of conjectures is derived, a solution method for computing fixed points of these dynamics (equilibria) is given, local stability properties of the dynamics around the equilibria are characterized, and conditions are given that guarantee a unique stable equilibrium.
Keyword: adaptive

Information-Ordered Bottlenecks for Adaptive Semantic Compression
Authors: Matthew Ho, Xiaosheng Zhao, Benjamin Wandelt
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11213
Pdf link: https://arxiv.org/pdf/2305.11213
Abstract We present the information-ordered bottleneck (IOB), a neural layer designed to adaptively compress data into latent variables ordered by likelihood maximization. Without retraining, IOB nodes can be truncated at any bottleneck width, capturing the most crucial information in the first latent variables. Unifying several previous approaches, we show that IOBs achieve near-optimal compression for a given encoding architecture and can assign ordering to latent signals in a manner that is semantically meaningful. IOBs demonstrate a remarkable ability to compress embeddings of image and text data, leveraging the performance of SOTA architectures such as CNNs, transformers, and diffusion models. Moreover, we introduce a novel theory for estimating global intrinsic dimensionality with IOBs and show that they recover SOTA dimensionality estimates for complex synthetic data. Furthermore, we showcase the utility of these models for exploratory analysis through applications on heterogeneous datasets, enabling computer-aided discovery of dataset complexity.
AMII: Adaptive Multimodal Inter-personal and Intra-personal Model for Adapted Behavior Synthesis
Authors: Jieyeon Woo, Mireille Fares, Catherine Pelachaud, Catherine Achard
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2305.11310
Pdf link: https://arxiv.org/pdf/2305.11310
Abstract Socially Interactive Agents (SIAs) are physical or virtual embodied agents that display similar behavior as human multimodal behavior. Modeling SIAs' non-verbal behavior, such as speech and facial gestures, has always been a challenging task, given that a SIA can take the role of a speaker or a listener. A SIA must emit appropriate behavior adapted to its own speech, its previous behaviors (intra-personal), and the User's behaviors (inter-personal) for both roles. We propose AMII, a novel approach to synthesize adaptive facial gestures for SIAs while interacting with Users and acting interchangeably as a speaker or as a listener. AMII is characterized by modality memory encoding schema - where modality corresponds to either speech or facial gestures - and makes use of attention mechanisms to capture the intra-personal and inter-personal relationships. We validate our approach by conducting objective evaluations and comparing it with the state-of-the-art approaches.
SpikeCP: Delay-Adaptive Reliable Spiking Neural Networks via Conformal Prediction
Authors: Jiechen Chen, Sangwoo Park, Osvaldo Simeone
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11322
Pdf link: https://arxiv.org/pdf/2305.11322
Abstract Spiking neural networks (SNNs) process time-series data via internal event-driven neural dynamics whose energy consumption depends on the number of spikes exchanged between neurons over the course of the input presentation. In typical implementations of an SNN classifier, decisions are produced after the entire input sequence has been processed, resulting in latency and energy consumption levels that are fairly uniform across inputs. Recently introduced delay-adaptive SNNs tailor the inference latency -- and, with it, the energy consumption -- to the difficulty of each example, by producing an early decision when the SNN model is sufficiently ``confident''. In this paper, we start by observing that, as an SNN processes input samples, its classification decisions tend to be first under-confident and then over-confident with respect to the decision's ground-truth, unknown, test accuracy. This makes it difficult to determine a stopping time that ensures a desired level of accuracy. To address this problem, we introduce a novel delay-adaptive SNN-based inference methodology that, wrapping around any pre-trained SNN classifier, provides guaranteed reliability for the decisions produced at input-dependent stopping times. The approach entails minimal added complexity as compared to the underlying SNN, requiring only thresholding and counting operations at run time, and it leverages tools from conformal prediction (CP).
Coordinated Transformer with Position \& Sample-aware Central Loss for Anatomical Landmark Detection
Authors: Qikui Zhu, Yihui Bi, Danxin Wang, Xiangpeng Chu, Jie Chen, Yanqing Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11338
Pdf link: https://arxiv.org/pdf/2305.11338
Abstract Heatmap-based anatomical landmark detection is still facing two unresolved challenges: 1) inability to accurately evaluate the distribution of heatmap; 2) inability to effectively exploit global spatial structure information. To address the computational inability challenge, we propose a novel position-aware and sample-aware central loss. Specifically, our central loss can absorb position information, enabling accurate evaluation of the heatmap distribution. More advanced is that our central loss is sample-aware, which can adaptively distinguish easy and hard samples and make the model more focused on hard samples while solving the challenge of extreme imbalance between landmarks and non-landmarks. To address the challenge of ignoring structure information, a Coordinated Transformer, called CoorTransformer, is proposed, which establishes long-range dependencies under the guidance of landmark coordination information, making the attention more focused on the sparse landmarks while taking advantage of global spatial structure. Furthermore, CoorTransformer can speed up convergence, effectively avoiding the defect that Transformers have difficulty converging in sparse representation learning. Using the advanced CoorTransformer and central loss, we propose a generalized detection model that can handle various scenarios, inherently exploiting the underlying relationship between landmarks and incorporating rich structural knowledge around the target landmarks. We analyzed and evaluated CoorTransformer and central loss on three challenging landmark detection tasks. The experimental results show that our CoorTransformer outperforms state-of-the-art methods, and the central loss significantly improves the performance of the model with p-values< 0.05.
Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Models
Authors: Wenhao Ding, Tong Che, Ding Zhao, Marco Pavone
Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11340
Pdf link: https://arxiv.org/pdf/2305.11340
Abstract Recently, reward-conditioned reinforcement learning (RCRL) has gained popularity due to its simplicity, flexibility, and off-policy nature. However, we will show that current RCRL approaches are fundamentally limited and fail to address two critical challenges of RCRL -- improving generalization on high reward-to-go (RTG) inputs, and avoiding out-of-distribution (OOD) RTG queries during testing time. To address these challenges when training vanilla RCRL architectures, we propose Bayesian Reparameterized RCRL (BR-RCRL), a novel set of inductive biases for RCRL inspired by Bayes' theorem. BR-RCRL removes a core obstacle preventing vanilla RCRL from generalizing on high RTG inputs -- a tendency that the model treats different RTG inputs as independent values, which we term ``RTG Independence". BR-RCRL also allows us to design an accompanying adaptive inference method, which maximizes total returns while avoiding OOD queries that yield unpredictable behaviors in vanilla RCRL methods. We show that BR-RCRL achieves state-of-the-art performance on the Gym-Mujoco and Atari offline RL benchmarks, improving upon vanilla RCRL by up to 11%.
Enhancing Transformer Backbone for Egocentric Video Action Segmentation
Authors: Sakib Reza, Balaji Sundareshan, Mohsen Moghaddam, Octavia Camps
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11365
Pdf link: https://arxiv.org/pdf/2305.11365
Abstract Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. Additionally, we utilize state-of-the-art visual-language representation learning techniques to extract richer and more compact features for our transformer. Our proposed approach outperforms other state-of-the-art methods on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools datasets, and we validate our introduced components with ablation studies. The source code and supplementary materials are publicly available on https://www.sail-nu.com/dxformer.
Must the Communication Graph of MPC Protocols be an Expander?
Authors: Elette Boyle, Ran Cohen, Deepesh Data, Pavel Hubáček
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2305.11428
Pdf link: https://arxiv.org/pdf/2305.11428
Abstract Secure multiparty computation (MPC) on incomplete communication networks has been studied within two primary models: (1) Where a partial network is fixed a priori, and thus corruptions can occur dependent on its structure, and (2) Where edges in the communication graph are determined dynamically as part of the protocol. Whereas a rich literature has succeeded in mapping out the feasibility and limitations of graph structures supporting secure computation in the fixed-graph model (including strong classical lower bounds), these bounds do not apply in the latter dynamic-graph setting, which has recently seen exciting new results, but remains relatively unexplored. In this work, we initiate a similar foundational study of MPC within the dynamic-graph model. As a first step, we investigate the property of graph expansion. All existing protocols (implicitly or explicitly) yield communication graphs which are expanders, but it is not clear whether this is inherent. Our results consist of two types (for constant fraction of corruptions): Upper bounds: We demonstrate secure protocols whose induced communication graphs are not expander graphs, within a wide range of settings (computational, information theoretic, with low locality, even with low locality and adaptive security), each assuming some form of input-independent setup. Lower bounds: In the plain model (no setup) with adaptive corruptions, we demonstrate that for certain functionalities, no protocol can maintain a non-expanding communication graph against all adversarial strategies. Our lower bound relies only on protocol correctness (not privacy), and requires a surprisingly delicate argument. More generally, we provide a formal framework for analyzing the evolving communication graph of MPC protocols, giving a starting point for studying the relation between secure computation and further, more general graph properties.
Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment
Authors: Baorui Ma, Junsheng Zhou, Yu-Shen Liu, Zhizhong Han
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11601
Pdf link: https://arxiv.org/pdf/2305.11601
Abstract Neural signed distance functions (SDFs) have shown remarkable capability in representing geometry with details. However, without signed distance supervision, it is still a challenge to infer SDFs from point clouds or multi-view images using neural networks. In this paper, we claim that gradient consistency in the field, indicated by the parallelism of level sets, is the key factor affecting the inference accuracy. Hence, we propose a level set alignment loss to evaluate the parallelism of level sets, which can be minimized to achieve better gradient consistency. Our novelty lies in that we can align all level sets to the zero level set by constraining gradients at queries and their projections on the zero level set in an adaptive way. Our insight is to propagate the zero level set to everywhere in the field through consistent gradients to eliminate uncertainty in the field that is caused by the discreteness of 3D point clouds or the lack of observations from multi-view images. Our proposed loss is a general term which can be used upon different methods to infer SDFs from 3D point clouds and multi-view images. Our numerical and visual comparisons demonstrate that our loss can significantly improve the accuracy of SDFs inferred from point clouds or multi-view images under various benchmarks. Code and data are available at https://github.com/mabaorui/TowardsBetterGradient .
Time Optimal Ergodic Search
Authors: Dayi Dong, Henry Berger, Ian Abraham
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2305.11643
Pdf link: https://arxiv.org/pdf/2305.11643
Abstract Robots with the ability to balance time against the thoroughness of search have the potential to provide time-critical assistance in applications such as search and rescue. Current advances in ergodic coverage-based search methods have enabled robots to completely explore and search an area in a fixed amount of time. However, optimizing time against the quality of autonomous ergodic search has yet to be demonstrated. In this paper, we investigate solutions to the time-optimal ergodic search problem for fast and adaptive robotic search and exploration. We pose the problem as a minimum time problem with an ergodic inequality constraint whose upper bound regulates and balances the granularity of search against time. Solutions to the problem are presented analytically using Pontryagin's conditions of optimality and demonstrated numerically through a direct transcription optimization approach. We show the efficacy of the approach in generating time-optimal ergodic search trajectories in simulation and with drone experiments in a cluttered environment. Obstacle avoidance is shown to be readily integrated into our formulation, and we perform ablation studies that investigate parameter dependence on optimized time and trajectory sensitivity for search.
Applying Ising Machines to Multi-objective QUBOs
Authors: Mayowa Ayodele, Richard Allmendinger, Manuel López-Ibáñez, Arnaud Liefooghe, Matthieu Parizy
Subjects: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2305.11648
Pdf link: https://arxiv.org/pdf/2305.11648
Abstract Multi-objective optimisation problems involve finding solutions with varying trade-offs between multiple and often conflicting objectives. Ising machines are physical devices that aim to find the absolute or approximate ground states of an Ising model. To apply Ising machines to multi-objective problems, a weighted sum objective function is used to convert multi-objective into single-objective problems. However, deriving scalarisation weights that archives evenly distributed solutions across the Pareto front is not trivial. Previous work has shown that adaptive weights based on dichotomic search, and one based on averages of previously explored weights can explore the Pareto front quicker than uniformly generated weights. However, these adaptive methods have only been applied to bi-objective problems in the past. In this work, we extend the adaptive method based on averages in two ways: (i)~we extend the adaptive method of deriving scalarisation weights for problems with two or more objectives, and (ii)~we use an alternative measure of distance to improve performance. We compare the proposed method with existing ones and show that it leads to the best performance on multi-objective Unconstrained Binary Quadratic Programming (mUBQP) instances with 3 and 4 objectives and that it is competitive with the best one for instances with 2 objectives.
Learning Global-aware Kernel for Image Harmonization
Authors: Xintian Shen, Jiangning Zhang, Jun Chen, Shipeng Bai, Yue Han, Yabiao Wang, Chengjie Wang, Yong Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11676
Pdf link: https://arxiv.org/pdf/2305.11676
Abstract Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Network (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, \ie, harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with both local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, \eg, achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB $\uparrow$; decreasing fMSE/MSE by 11.5\%$\downarrow$/6.7\%$\downarrow$ compared with the SoTA method. Code will be available at \href{https://github.com/XintianShen/GKNet}{here}.
Non-stationary Projection-free Online Learning with Dynamic and Adaptive Regret Guarantees
Authors: Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bing Wang, Haihong Tang, Yuanyu Wan, Lijun Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2305.11726
Pdf link: https://arxiv.org/pdf/2305.11726
Abstract Projection-free online learning has drawn increasing interest due to its efficiency in solving high-dimensional problems with complicated constraints. However, most existing projection-free online methods focus on minimizing the static regret, which unfortunately fails to capture the challenge of changing environments. In this paper, we investigate non-stationary projection-free online learning, and choose dynamic regret and adaptive regret to measure the performance. Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named $\text{BOGD}_\text{IP}$, and establish an $\mathcal{O}(T^{3/4}(1+P_T))$ dynamic regret bound, where $P_T$ denotes the path-length of the comparator sequence. Then, we improve the upper bound to $\mathcal{O}(T^{3/4}(1+PT)^{1/4})$ by running multiple $\text{BOGD}\text{IP}$ algorithms with different step sizes in parallel, and tracking the best one on the fly. Our results are the first general-case dynamic regret bounds for projection-free online learning, and can recover the existing $\mathcal{O}(T^{3/4})$ static regret by setting $PT = 0$. Furthermore, we propose a projection-free method to attain an $\tilde{\mathcal{O}}(\tau^{3/4})$ adaptive regret bound for any interval with length $\tau$, which nearly matches the static regret over that interval. The essential idea is to maintain a set of $\text{BOGD}\text{IP}$ algorithms dynamically, and combine them by a meta algorithm. Moreover, we demonstrate that it is also equipped with an $\mathcal{O}(T^{3/4}(1+P_T)^{1/4})$ dynamic regret bound. Finally, empirical studies verify our theoretical findings.
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs
Authors: Pranjal Aggarwal, Aman Madaan, Yiming Yang, Mausam
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2305.11860
Pdf link: https://arxiv.org/pdf/2305.11860
Abstract A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency - poll the LLM multiple times and output the most frequent solution. Existing Self-Consistency techniques always draw a constant number of samples per question, where a better approach will be to non-uniformly distribute the available budget based on the amount of agreement in the samples drawn so far. In response, we introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question using a lightweight stopping criterion. Our experiments over 13 datasets and two LLMs demonstrate that Adaptive-Consistency reduces sample budget by up to 6.0 times with an average accuracy drop of less than 0.1%.
Adaptive identification of SISO linear infinite-dimensional systems
Authors: Sudipta Chattopadhyay, Srikant Sukumar, Vivek Natarajan
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2305.11868
Pdf link: https://arxiv.org/pdf/2305.11868
Abstract We propose an adaptive algorithm for identifying the unknown parameter in a linear exponentially stable single-input single-output infinite-dimensional system. We assume that the transfer function of the infinite-dimensional system can be expressed as a ratio of two infinite series in s (the Laplace variable). We also assume that certain identifiability conditions, which include a persistency of excitation condition, hold. For a fixed integer n, we propose an update law driven by real-time input-output data for estimating the first n+1 coefficients in the numerator and the denominator of the transfer function. We show that the estimates for the transfer function coefficients generated by the update law are close to the true values at large times provided n is sufficiently large (the estimates converge to the true values as time and n tend to infinity). The unknown parameter can be reconstructed using the transfer function coefficient estimates obtained with n large and the algebraic expressions relating the transfer function coefficients to the unknown parameter. We also provide a numerical scheme for verifying the identifiability conditions and for choosing n sufficiently large so that the value of the reconstructed parameter is close to the true value. The class of systems to which our approach is applicable includes many partial differential equations with constant/spatially-varying coefficients and distributed/boundary input and output. We illustrate the efficacy of our approach using three examples: a delay system with four unknown scalars, a 1D heat equation with two unknown scalars and a 1D wave equation with an unknown spatially-varying coefficient.
Keyword: quantization

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
Authors: Mengqi Huang, Zhendong Mao, Zhuowei Chen, Yongdong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2305.11718
Pdf link: https://arxiv.org/pdf/2305.11718
Abstract Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm that first learns a codebook to encode images as discrete codes, and then completes generation based on the learned codebook. However, they encode fixed-size image regions into fixed-length codes and ignore their naturally different information densities, which results in insufficiency in important regions and redundancy in unimportant ones, and finally degrades the generation quality and speed. Moreover, the fixed-length coding leads to an unnatural raster-scan autoregressive generation. To address the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities for an accurate and compact code representation. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked-transformer architecture and shared-content, non-shared position input layers designs. Comprehensive experiments on various generation tasks validate our superiorities in both effectiveness and efficiency. Code will be released at https://github.com/CrossmodalGroup/DynamicVectorQuantization.
STOAT: Structured Data to Analytical Text With Controls
Authors: Deepanway Ghosal, Preksha Nema, Aravindan Raghuveer
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2305.11826
Pdf link: https://arxiv.org/pdf/2305.11826
Abstract Recent language models have made tremendous progress in the structured data to text generation task. However, these models still give sub-optimal performance where logical inference is required to generate the descriptions. In this work, we specifically focus on analytical text generation from structured data such as tables. Building on the taxonomy proposed in (Gupta et al., 2020) we focus on controllable table to text generation for the following reasoning categories: numerical reasoning, commonsense reasoning, temporal reasoning, table knowledge, and entity knowledge. We propose STOAT model, which is table and reasoning aware, with vector-quantization to infuse the given reasoning categories in the output. We observe that our model provides 10.19%, 1.13% improvement on the PARENT metric in iToTTo and Infotabs for the analytical sentence task. We also found that our model generates 15.3% more faithful and analytical descriptions as compared to the baseline models in human evaluation. We curate and release two reasoning category annotated table-to-interesting text generation datasets based on the ToTTo (Parikh et al., 2020) and InfoTabs datasets (Gupta et al.,2020).

A-suozhang / GetArxivDaily

New submissions for Mon, 22 May 23 #66

Keyword: efficient

DClEVerNet: Deep Combinatorial Learning for Efficient EV Charging Scheduling in Large-scale Networked Facilities

PDP: Parameter-free Differentiable Pruning is All You Need

Learning Agent Interactions from Density Evolution in 3D Regions With Obstacles

Efficient Vertical Federated Learning with Secure Aggregation

A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

Constrained Environment Optimization for Prioritized Multi-Agent Navigation

On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation

BELLA: Black box model Explanations by Local Linear Approximations

Engineering an algorithm for constructing low-stretch geometric graphs with near-greedy average-degrees

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

A Cyberattack Detection-Isolation Scheme For CAV Under Changing Driving Environment

Faster Parallel Exact Density Peaks Clustering

Data Redaction from Conditional Generative Models

Differentially Private Adapters for Parameter Efficient Acoustic Modeling

ALT: An Automatic System for Long Tail Scenario Modeling

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

Efficient Mixed Transformer for Single Image Super-Resolution

LATTE: Label-efficient Incident Phenotyping from Longitudinal Electronic Health Records

JetSeg: Efficient Real-Time Semantic Segmentation Model for Low-Power GPU-Embedded Systems

PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction

Strix: An End-to-End Streaming Architecture with Two-Level Ciphertext Batching for Fully Homomorphic Encryption with Programmable Bootstrapping

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Coordinated Frequency-Constrained Stochastic Economic Dispatch for Integrated Transmission and Distribution System via Distributed Optimization

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Evolutionary Diversity Optimisation in Constructing Satisfying Assignments

Generative Sliced MMD Flows with Riesz Kernels

Counterfactual Fairness Filter for Fair-Delay Multi-Robot Navigation

Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness

RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

Nonconvex Robust High-Order Tensor Completion Using Randomized Low-Rank Approximation

A new family of fourth-order energy-preserving integrators

DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment

Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots

A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

Dynamic Regularized Sharpness Aware Minimization in Federated Learning: Approaching Global Consistency and Smooth Landscape

Flexible and Inherently Comprehensible Knowledge Representation for Data-Efficient Learning and Trustworthy Human-Machine Teaming in Manufacturing Environments

Tune-Mode ConvBN Blocks For Efficient Transfer Learning

LLM-Pruner: On the Structural Pruning of Large Language Models

Goal-Oriented Communications in Federated Learning via Feedback on Risk-Averse Participation

Distributed MIS with Low Energy and Time Complexities

Distribution-Free Matrix Prediction Under Arbitrary Missing Pattern

The fast reduced QMC matrix-vector product

Enhancing data security against cyberattacks in artificial intelligence based smartgrid systems with crypto agility

Region of Attraction Estimation Using Union Theorem in Sum-of-Squares Optimization

Some results on the antiprimitive BCH codes

Probabilistic Lexicase Selection

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

RGCVAE: Relational Graph Conditioned Variational Autoencoder for Molecule Design

Efficient and Deterministic Search Strategy Based on Residual Projections for Point Cloud Registration

Direction Specific Ambisonics Source Separation with End-To-End Deep Learning

A One-Class Classifier for the Detection of GAN Manipulated Multi-Spectral Satellite Images

Making $\textsf{IP}=\textsf{PSPACE}$ Practical: Efficient Interactive Protocols for BDD Algorithms

Video Killed the HD-Map: Predicting Driving Behavior Directly From Drone Images

Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Adaptive identification of SISO linear infinite-dimensional systems

Keyword: faster

Improved and Partially-Tight Lower Bounds for Message-Passing Implementations of Multiplicity Queues

Engineering an algorithm for constructing low-stretch geometric graphs with near-greedy average-degrees

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

Practical algorithms and experimentally validated incentives for equilibrium-based fair division (A-CEEI)

JetSeg: Efficient Real-Time Semantic Segmentation Model for Low-Power GPU-Embedded Systems

Beyond Exponential Graph: Communication-Efficient Topologies for Decentralized Learning via Finite-time Convergence

Strix: An End-to-End Streaming Architecture with Two-Level Ciphertext Batching for Fully Homomorphic Encryption with Programmable Bootstrapping

A new family of fourth-order energy-preserving integrators

Distributed MIS with Low Energy and Time Complexities

RGCVAE: Relational Graph Conditioned Variational Autoencoder for Molecule Design

Keyword: mobile

PDP: Parameter-free Differentiable Pruning is All You Need

Two-Tier UAV-based Low Power Wide Area Networks: A Testbed and Experimentation Study

RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration

Terraforming -- Environment Manipulation during Disruptions for Multi-Agent Pickup and Delivery

Keyword: pruning

PDP: Parameter-free Differentiable Pruning is All You Need

SFP: Spurious Feature-targeted Pruning for Out-of-Distribution Generalization