New submissions for Mon, 9 Oct 23

Keyword: efficient

Progressive reduced order modeling: empowering data-driven modeling with selective knowledge transfer

Authors: Teeratorn Kadeethum, Daniel O'Malley, Youngsoo Choi, Hari S. Viswanathan, Hongkyu Yoon
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2310.03770
Pdf link: https://arxiv.org/pdf/2310.03770
Abstract Data-driven modeling can suffer from a constant demand for data, leading to reduced accuracy and impractical for engineering applications due to the high cost and scarcity of information. To address this challenge, we propose a progressive reduced order modeling framework that minimizes data cravings and enhances data-driven modeling's practicality. Our approach selectively transfers knowledge from previously trained models through gates, similar to how humans selectively use valuable knowledge while ignoring unuseful information. By filtering relevant information from previous models, we can create a surrogate model with minimal turnaround time and a smaller training set that can still achieve high accuracy. We have tested our framework in several cases, including transport in porous media, gravity-driven flow, and finite deformation in hyperelastic materials. Our results illustrate that retaining information from previous models and utilizing a valuable portion of that knowledge can significantly improve the accuracy of the current model. We have demonstrated the importance of progressive knowledge transfer and its impact on model accuracy with reduced training samples. For instance, our framework with four parent models outperforms the no-parent counterpart trained on data nine times larger. Our research unlocks data-driven modeling's potential for practical engineering applications by mitigating the data scarcity issue. Our proposed framework is a significant step toward more efficient and cost-effective data-driven modeling, fostering advancements across various fields.
PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction
Authors: Saifullah Saifullah (1 and 2), Stefan Agne (2 and 3), Andreas Dengel (1 and 2), Sheraz Ahmed (2 and 3) ((1) Department of Computer Science, University of Kaiserslautern-Landau, Kaiserslautern, Rhineland-Palatinate, Germany, (2) German Research Center for Artificial Intelligence, DFKI GmbH, Kaiserslautern, Rhineland-Palatinate, Germany, (3) DeepReader GmbH, Kaiserlautern, Germany)
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.03777
Pdf link: https://arxiv.org/pdf/2310.03777
Abstract In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.
Quantitative passive imaging by iterative holography: The example of helioseismic holography
Authors: Björn Müller, Thorsten Hohage, Damien Fournier, Laurent Gizon
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.03837
Pdf link: https://arxiv.org/pdf/2310.03837
Abstract In passive imaging, one attempts to reconstruct some coefficients in a wave equation from correlations of observed randomly excited solutions to this wave equation. Many methods proposed for this class of inverse problem so far are only qualitative, e.g., trying to identify the support of a perturbation. Major challenges are the increase in dimensionality when computing correlations from primary data in a preprocessing step, and often very poor pointwise signal-to-noise ratios. In this paper, we propose an approach that addresses both of these challenges: It works only on the primary data while implicitly using the full information contained in the correlation data, and it provides quantitative estimates and convergence by iteration. Our work is motivated by helioseismic holography, a powerful imaging method to map heterogenities and flows in the solar interior. We show that the back-propagation used in classical helioseismic holography can be interpreted as the adjoint of the Fr\'echet derivative of the operator which maps the properties of the solar interior to the correlation data on the solar surface. The theoretical and numerical framework for passive imaging problems developed in this paper extends helioseismic holography to nonlinear problems and allows for quantitative reconstructions. We present a proof of concept in uniform media.
Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning
Authors: Harsh Chaudhari, Giorgio Severi, Alina Oprea, Jonathan Ullman
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03838
Pdf link: https://arxiv.org/pdf/2310.03838
Abstract The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.
ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures
Authors: Haoxuan Liu, Vasu Singh, Michał Filipiuk, Siva Kumar Sastry Hari
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2310.03841
Pdf link: https://arxiv.org/pdf/2310.03841
Abstract Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability. It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors. We propose a novel algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis and protection of transformer-based architectures. First, our work develops an efficient process of computing and ranking the resilience of transformers layers. We find that due to the large size of transformer models, applying traditional network redundancy to a subset of the most vulnerable layers provides high error coverage albeit with impractically high overhead. We address this shortcoming by providing a software-directed, checksum-based error detection technique aimed at protecting the most vulnerable general matrix multiply (GEMM) layers in the transformer models that use either floating-point or integer arithmetic. Results show that our approach achieves over 99% coverage for errors that result in a mismatch at less than 0.2% computation overhead. Lastly, we present the applicability of our framework in various modern GPU architectures under different numerical precisions. We introduce an efficient self-correction mechanism for resolving erroneous detection with an average overhead of less than 0.002% (with a 2% overhead to resolve each erroneous detection).
Information Geometry for the Working Information Theorist
Authors: Kumar Vijay Mishra, M. Ashok Kumar, Ting-Kam Leonard Wong
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Differential Geometry (math.DG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.03884
Pdf link: https://arxiv.org/pdf/2310.03884
Abstract Information geometry is a study of statistical manifolds, that is, spaces of probability distributions from a geometric perspective. Its classical information-theoretic applications relate to statistical concepts such as Fisher information, sufficient statistics, and efficient estimators. Today, information geometry has emerged as an interdisciplinary field that finds applications in diverse areas such as radar sensing, array signal processing, quantum physics, deep learning, and optimal transport. This article presents an overview of essential information geometry to initiate an information theorist, who may be unfamiliar with this exciting area of research. We explain the concepts of divergences on statistical manifolds, generalized notions of distances, orthogonality, and geodesics, thereby paving the way for concrete applications and novel theoretical investigations. We also highlight some recent information-geometric developments, which are of interest to the broader information theory community.
Realizing XR Applications Using 5G-Based 3D Holographic Communication and Mobile Edge Computing
Authors: Dun Yuan, Ekram Hossain, Di Wu, Xue Liu, Gregory Dudek
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.03908
Pdf link: https://arxiv.org/pdf/2310.03908
Abstract 3D holographic communication has the potential to revolutionize the way people interact with each other in virtual spaces, offering immersive and realistic experiences. However, demands for high data rates, extremely low latency, and high computations to enable this technology pose a significant challenge. To address this challenge, we propose a novel job scheduling algorithm that leverages Mobile Edge Computing (MEC) servers in order to minimize the total latency in 3D holographic communication. One of the motivations for this work is to prevent the uncanny valley effect, which can occur when the latency hinders the seamless and real-time rendering of holographic content, leading to a less convincing and less engaging user experience. Our proposed algorithm dynamically allocates computation tasks to MEC servers, considering the network conditions, computational capabilities of the servers, and the requirements of the 3D holographic communication application. We conduct extensive experiments to evaluate the performance of our algorithm in terms of latency reduction, and the results demonstrate that our approach significantly outperforms other baseline methods. Furthermore, we present a practical scenario involving Augmented Reality (AR), which not only illustrates the applicability of our algorithm but also highlights the importance of minimizing latency in achieving high-quality holographic views. By efficiently distributing the computation workload among MEC servers and reducing the overall latency, our proposed algorithm enhances the user experience in 3D holographic communications and paves the way for the widespread adoption of this technology in various applications, such as telemedicine, remote collaboration, and entertainment.
RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels
Authors: Alexander Shmakov, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Antonio Guillen, Soumyendu Sarkar
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.03912
Pdf link: https://arxiv.org/pdf/2310.03912
Abstract Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.
Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust Closed-Loop Control
Authors: Neehal Tumma, Mathias Lechner, Noel Loo, Ramin Hasani, Daniela Rus
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03915
Pdf link: https://arxiv.org/pdf/2310.03915
Abstract Developing autonomous agents that can interact with changing environments is an open challenge in machine learning. Robustness is particularly important in these settings as agents are often fit offline on expert demonstrations but deployed online where they must generalize to the closed feedback loop within the environment. In this work, we explore the application of recurrent neural networks to tasks of this nature and understand how a parameterization of their recurrent connectivity influences robustness in closed-loop settings. Specifically, we represent the recurrent connectivity as a function of rank and sparsity and show both theoretically and empirically that modulating these two variables has desirable effects on network dynamics. The proposed low-rank, sparse connectivity induces an interpretable prior on the network that proves to be most amenable for a class of models known as closed-form continuous-time neural networks (CfCs). We find that CfCs with fewer parameters can outperform their full-rank, fully-connected counterparts in the online setting under distribution shift. This yields memory-efficient and robust agents while opening a new perspective on how we can modulate network dynamics through connectivity.
An Efficient Content-based Time Series Retrieval System
Authors: Chin-Chia Michael Yeh, Huiyuan Chen, Xin Dai, Yan Zheng, Junpeng Wang, Vivian Lai, Yujie Fan, Audrey Der, Zhongfang Zhuang, Liang Wang, Wei Zhang, Jeff M. Phillips
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03919
Pdf link: https://arxiv.org/pdf/2310.03919
Abstract A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated metadata. By analyzing the retrieved metadata, users can gather more information about the source of the time series. Because the CTSR system is required to work with time series data from diverse domains, it needs a high-capacity model to effectively measure the similarity between different time series. On top of that, the model within the CTSR system has to compute the similarity scores in an efficient manner as the users interact with the system in real-time. In this paper, we propose an effective and efficient CTSR model that outperforms alternative models, while still providing reasonable inference runtimes. To demonstrate the capability of the proposed method in solving business problems, we compare it against alternative models using our in-house transaction data. Our findings reveal that the proposed model is the most suitable solution compared to others for our transaction data problem.
EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios
Authors: Tejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.03938
Pdf link: https://arxiv.org/pdf/2310.03938
Abstract Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing SSL models could achieve superior performance compared to using one SSL model. However, fusion models have increased model parameter size, leading to longer inference times. In this paper, we propose a novel approach of predicting other SSL models' features from a single SSL model, resulting in a light-weight framework with competitive performance. Our experiments show that SSL feature prediction models outperform individual SSL models in multilingual speech recognition tasks. The leading prediction model achieves an average SUPERB score increase of 135.4 in ML-SUPERB benchmarks. Moreover, our proposed framework offers an efficient solution, as it reduces the resulting model parameter size and inference times compared to previous fusion models.
Ultimate limit on learning non-Markovian behavior: Fisher information rate and excess information
Authors: Paul M. Riechers
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an)
Arxiv link: https://arxiv.org/abs/2310.03968
Pdf link: https://arxiv.org/pdf/2310.03968
Abstract We address the fundamental limits of learning unknown parameters of any stochastic process from time-series data, and discover exact closed-form expressions for how optimal inference scales with observation length. Given a parametrized class of candidate models, the Fisher information of observed sequence probabilities lower-bounds the variance in model estimation from finite data. As sequence-length increases, the minimal variance scales as the square inverse of the length -- with constant coefficient given by the information rate. We discover a simple closed-form expression for this information rate, even in the case of infinite Markov order. We furthermore obtain the exact analytic lower bound on model variance from the observation-induced metadynamic among belief states. We discover ephemeral, exponential, and more general modes of convergence to the asymptotic information rate. Surprisingly, this myopic information rate converges to the asymptotic Fisher information rate with exactly the same relaxation timescales that appear in the myopic entropy rate as it converges to the Shannon entropy rate for the process. We illustrate these results with a sequence of examples that highlight qualitatively distinct features of stochastic processes that shape optimal learning.
Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation
Authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03986
Pdf link: https://arxiv.org/pdf/2310.03986
Abstract Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
Nonlinear Methods for Shape Optimization Problems in Liquid Crystal Tactoids
Authors: James H. Adler, Anca S. Andrei, Timothy J. Atherton
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.04022
Pdf link: https://arxiv.org/pdf/2310.04022
Abstract Anisotropic fluids, such as nematic liquid crystals, can form non-spherical equilibrium shapes known as tactoids. Predicting the shape of these structures as a function of material parameters is challenging and paradigmatic of a broader class of problems that combine shape and order. Here, we develop a discrete shape optimization approach with finite elements to find the configuration of a two-dimensional tactoid using the Landau de Gennes framework and a Q-tensor representation. Efficient solution of the resulting constrained energy minimization problem is achieved using a quasi-Newton and nested iteration algorithm. Numerical validation is performed with benchmark solutions and compared against experimental data and earlier work. We explore physically motivated subproblems, whereby the shape and order are separately held fixed, to explore the role of both and examine material parameter dependence of the convergence. Nested iteration significantly improves both the computational cost and convergence of numerical solutions of these highly deformable materials.
A short report on preconditioned Anderson acceleration method
Authors: Kewang Chen, Ye Ji, Matthias Möller, Cornelis Vuik
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.04034
Pdf link: https://arxiv.org/pdf/2310.04034
Abstract In this report, we present a versatile and efficient preconditioned Anderson acceleration (PAA) method for fixed-point iterations. The proposed framework offers flexibility in balancing convergence rates (linear, super-linear, or quadratic) and computational costs related to the Jacobian matrix. Our approach recovers various fixed-point iteration techniques, including Picard, Newton, and quasi-Newton iterations. The PAA method can be interpreted as employing Anderson acceleration (AA) as its own preconditioner or as an accelerator for quasi-Newton methods when their convergence is insufficient. Adaptable to a wide range of problems with differing degrees of nonlinearity and complexity, the method achieves improved convergence rates and robustness by incorporating suitable preconditioners. We test multiple preconditioning strategies on various problems and investigate a delayed update strategy for preconditioners to further reduce the computational costs.
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Authors: Josh Alman, Zhao Song
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.04064
Pdf link: https://arxiv.org/pdf/2310.04064
Abstract In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
Maximizing Performance with Minimal Resources for Real-Time Transition Detection
Authors: Zeynep Ozge Orhan, Andrea Dal Prete, Anastasia Bolotnikova, Marta Gandolla, Auke Ijspeert, Mohamed Bouri
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.04117
Pdf link: https://arxiv.org/pdf/2310.04117
Abstract Assistive devices, such as exoskeletons and prostheses, have revolutionized the field of rehabilitation and mobility assistance. Efficiently detecting transitions between different activities, such as walking, stair ascending and descending, and sitting, is crucial for ensuring adaptive control and enhancing user experience. We here present an approach for real-time transition detection, aimed at optimizing the processing-time performance. By establishing activity-specific threshold values through trained machine learning models, we effectively distinguish motion patterns and we identify transition moments between locomotion modes. This threshold-based method improves real-time embedded processing time performance by up to 11 times compared to machine learning approaches. The efficacy of the developed finite-state machine is validated using data collected from three different measurement systems. Moreover, experiments with healthy participants were conducted on an active pelvis orthosis to validate the robustness and reliability of our approach. The proposed algorithm achieved high accuracy in detecting transitions between activities. These promising results show the robustness and reliability of the method, reinforcing its potential for integration into practical applications.
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning
Authors: Yinda Chen, Wei Huang, Shenglong Zhou, Qi Chen, Zhiwei Xiong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04148
Pdf link: https://arxiv.org/pdf/2310.04148
Abstract The performance of existing supervised neuron segmentation methods is highly dependent on the number of accurate annotations, especially when applied to large scale electron microscopy (EM) data. By extracting semantic information from unlabeled data, self-supervised methods can improve the performance of downstream tasks, among which the mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. However, due to the high degree of structural locality in EM images, as well as the existence of considerable noise, many voxels contain little discriminative information, making MIM pretraining inefficient on the neuron segmentation task. To overcome this challenge, we propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Due to the vast exploration space, using single-agent RL for voxel prediction is impractical. Therefore, we treat each input patch as an agent with a shared behavior policy, allowing for multi-agent collaboration. Furthermore, this multi-agent model can capture dependencies between voxels, which is beneficial for the downstream segmentation task. Experiments conducted on representative EM datasets demonstrate that our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation. Code is available at \url{https://github.com/ydchen0806/dbMiM}.
Amortized Network Intervention to Steer the Excitatory Point Processes
Authors: Zitao Song, Wendi Ren, Shuang Li
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04159
Pdf link: https://arxiv.org/pdf/2310.04159
Abstract We tackle the challenge of large-scale network intervention for guiding excitatory point processes, such as infectious disease spread or traffic congestion control. Our model-based reinforcement learning utilizes neural ODEs to capture how the networked excitatory point processes will evolve subject to the time-varying changes in network topology. Our approach incorporates Gradient-Descent based Model Predictive Control (GD-MPC), offering policy flexibility to accommodate prior knowledge and constraints. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortize Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts, while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Our approach has broad applications, from curbing infectious disease spread to reducing carbon emissions through traffic light optimization, and thus has the potential to address critical societal and environmental challenges.
Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching
Authors: Shiquan Yi, Yang Lyu, Lin Hua, Quan Pan, Chunhui Zhao
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.04162
Pdf link: https://arxiv.org/pdf/2310.04162
Abstract Simultaneous Localization and Mapping (SLAM) plays an important role in robot autonomy. Reliability and efficiency are the two most valued features for applying SLAM in robot applications. In this paper, we consider achieving a reliable LiDAR-based SLAM function in computation-limited platforms, such as quadrotor UAVs based on graph-based point cloud association. First, contrary to most works selecting salient features for point cloud registration, we propose a non-conspicuous feature selection strategy for reliability and robustness purposes. Then a two-stage correspondence selection method is used to register the point cloud, which includes a KD-tree-based coarse matching followed by a graph-based matching method that uses geometric consistency to vote out incorrect correspondences. Additionally, we propose an odometry approach where the weight optimizations are guided by vote results from the aforementioned geometric consistency graph. In this way, the optimization of LiDAR odometry rapidly converges and evaluates a fairly accurate transformation resulting in the back-end module efficiently finishing the mapping task. Finally, we evaluate our proposed framework on the KITTI odometry dataset and real-world environments. Experiments show that our SLAM system achieves a comparative level or higher level of accuracy with more balanced computation efficiency compared with the mainstream LiDAR-based SLAM solutions.
Dynamic Relation-Attentive Graph Neural Networks for Fraud Detection
Authors: Heehyeon Kim, Jinhyeok Choi, Joyce Jiyoung Whang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2310.04171
Pdf link: https://arxiv.org/pdf/2310.04171
Abstract Fraud detection aims to discover fraudsters deceiving other users by, for example, leaving fake reviews or making abnormal transactions. Graph-based fraud detection methods consider this task as a classification problem with two classes: frauds or normal. We address this problem using Graph Neural Networks (GNNs) by proposing a dynamic relation-attentive aggregation mechanism. Based on the observation that many real-world graphs include different types of relations, we propose to learn a node representation per relation and aggregate the node representations using a learnable attention function that assigns a different attention coefficient to each relation. Furthermore, we combine the node representations from different layers to consider both the local and global structures of a target node, which is beneficial to improving the performance of fraud detection on graphs with heterophily. By employing dynamic graph attention in all the aggregation processes, our method adaptively computes the attention coefficients for each node. Experimental results show that our method, DRAG, outperforms state-of-the-art fraud detection methods on real-world benchmark datasets.
Towards 6D MCL for LiDARs in 3D TSDF Maps on Embedded Systems with GPUs
Authors: Marc Eisoldt, Alexander Mock, Mario Porrmann, Thomas Wiemann
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.04172
Pdf link: https://arxiv.org/pdf/2310.04172
Abstract Monte Carlo Localization is a widely used approach in the field of mobile robotics. While this problem has been well studied in the 2D case, global localization in 3D maps with six degrees of freedom has so far been too computationally demanding. Hence, no mobile robot system has yet been presented in literature that is able to solve it in real-time. The computationally most intensive step is the evaluation of the sensor model, but it also offers high parallelization potential. This work investigates the massive parallelization of the evaluation of particles in truncated signed distance fields for three-dimensional laser scanners on embedded GPUs. The implementation on the GPU is 30 times as fast and more than 50 times more energy efficient compared to a CPU implementation.
Cross-Edge Orchestration of Serverless Functions with Probabilistic Caching
Authors: Chen Chen, Manuel Herrera, Ge Zheng, Liqiao Xia, Zhengyang Ling, Jiangtao Wang
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2310.04185
Pdf link: https://arxiv.org/pdf/2310.04185
Abstract Serverless edge computing adopts an event-based paradigm that provides back-end services on an as-used basis, resulting in efficient resource utilization. To improve the end-to-end latency and revenue, service providers need to optimize the number and placement of serverless containers while considering the system cost incurred by the provisioning. The particular reason for this circumstance is that frequently creating and destroying containers not only increases the system cost but also degrades the time responsiveness due to the cold-start process. Function caching is a common approach to mitigate the coldstart issue. However, function caching requires extra hardware resources and hence incurs extra system costs. Furthermore, the dynamic and bursty nature of serverless invocations remains an under-explored area. Hence, it is vitally important for service providers to conduct a context-aware request distribution and container caching policy for serverless edge computing. In this paper, we study the request distribution and container caching problem in serverless edge computing. We prove the proposed problem is NP-hard and hence difficult to find a global optimal solution. We jointly consider the distributed and resource constrained nature of edge computing and propose an optimized request distribution algorithm that adapts to the dynamics of serverless invocations with a theoretical performance guarantee. Also, we propose a context-aware probabilistic caching policy that incorporates a number of characteristics of serverless invocations. Via simulation and implementation results, we demonstrate the superiority of the proposed algorithm by outperforming existing caching policies in terms of the overall system cost and cold-start frequency by up to 62.1% and 69.1%, respectively.
Sparse Identification of Nonlinear Dynamics with Side Information (SINDy-SI)
Authors: Gabriel F. Machado, Morgan Jones
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.04227
Pdf link: https://arxiv.org/pdf/2310.04227
Abstract Modern societies have an abundance of data yet good system models are rare. Unfortunately, many of the current system identification and machine learning techniques fail to generalize outside of the training set, producing models that violate basic physical laws. This work proposes a novel method for the Sparse Identification of Nonlinear Dynamics with Side Information (SINDy-SI). SINDy-SI is an iterative method that uses Sum-of-Squares (SOS) programming to learn optimally fitted models while guaranteeing that the learned model satisfies side information, such as symmetry's and physical laws. Guided by the principle of Occam's razor, that the simplest or most regularized best fitted model is typically the superior choice, during each iteration SINDy-SI prunes the basis functions associated with small coefficients, yielding a sparse dynamical model upon termination. Through several numerical experiments we will show how the combination of side information constraints and sparse polynomial representation cultivates dynamical models that obey known physical laws while displaying impressive generalized performance beyond the training set.
Optimization with pattern-avoiding input
Authors: Benjamin Aram Berendsohn, László Kozma, Michal Opler
Subjects: Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2310.04236
Pdf link: https://arxiv.org/pdf/2310.04236
Abstract Permutation pattern-avoidance is a central concept of both enumerative and extremal combinatorics. In this paper we study the effect of permutation pattern-avoidance on the complexity of optimization problems. In the context of the dynamic optimality conjecture (Sleator, Tarjan, STOC 1983), Chalermsook, Goswami, Kozma, Mehlhorn, and Saranurak (FOCS 2015) conjectured that the amortized access cost of an optimal binary search tree (BST) is $O(1)$ whenever the access sequence avoids some fixed pattern. They showed a bound of $2^{\alpha{(n)}^{O(1)}}$, which was recently improved to $2^{\alpha{(n)}(1+o(1))}$ by Chalermsook, Pettie, and Yingchareonthawornchai (2023); here $n$ is the BST size and $\alpha(\cdot)$ the inverse-Ackermann function. In this paper we resolve the conjecture, showing a tight $O(1)$ bound. This indicates a barrier to dynamic optimality: any candidate online BST (e.g., splay trees or greedy trees) must match this optimum, but current analysis techniques only give superconstant bounds. More broadly, we argue that the easiness of pattern-avoiding input is a general phenomenon, not limited to BSTs or even to data structures. To illustrate this, we show that when the input avoids an arbitrary, fixed, a priori unknown pattern, one can efficiently compute a $k$-server solution of $n$ requests from a unit interval, with total cost $n^{O(1/\log k)}$, in contrast to the worst-case $\Theta(n/k)$ bound; and a traveling salesman tour of $n$ points from a unit box, of length $O(\log{n})$, in contrast to the worst-case $\Theta(\sqrt{n})$ bound; similar results hold for the euclidean minimum spanning tree, Steiner tree, and nearest-neighbor graphs. We show both results to be tight. Our techniques build on the Marcus-Tardos proof of the Stanley-Wilf conjecture, and on the recently emerging concept of twin-width; we believe our techniques to be more generally applicable.
Comparing Auxiliary Tasks for Learning Representations for Reinforcement Learning
Authors: Moritz Lange, Noah Krystiniak, Raphael C. Engelhardt, Wolfgang Konen, Laurenz Wiskott
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04241
Pdf link: https://arxiv.org/pdf/2310.04241
Abstract Learning state representations has gained steady popularity in reinforcement learning (RL) due to its potential to improve both sample efficiency and returns on many environments. A straightforward and efficient method is to generate representations with a distinct neural network trained on an auxiliary task, i.e. a task that differs from the actual RL task. While a whole range of such auxiliary tasks has been proposed in the literature, a comparison on typical continuous control benchmark environments is computationally expensive and has, to the best of our knowledge, not been performed before. This paper presents such a comparison of common auxiliary tasks, based on hundreds of agents trained with state-of-the-art off-policy RL algorithms. We compare possible improvements in both sample efficiency and returns for environments ranging from simple pendulum to a complex simulated robotics task. Our findings show that representation learning with auxiliary tasks is beneficial for environments of higher dimension and complexity, and that learning environment dynamics is preferable to predicting rewards. We believe these insights will enable other researchers to make more informed decisions on how to utilize representation learning for their specific problem.
Fully discrete Galerkin scheme for a semilinear subdiffusion equation with nonsmooth data and time-dependent coefficient
Authors: Łukasz Płociniczak, Kacper Taźbierski
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.04246
Pdf link: https://arxiv.org/pdf/2310.04246
Abstract We couple the L1 discretization of the Caputo fractional derivative in time with the Galerkin scheme to devise a linear numerical method for the semilinear subdiffusion equation. Two important points that we make are: nonsmooth initial data and time-dependent diffusion coefficient. We prove the stability and convergence of the method under weak assumptions concerning regularity of the diffusivity. We find optimal pointwise in space and global in time errors, which are verified with several numerical experiments.
Compositional Servoing by Recombining Demonstrations
Authors: Max Argus, Abhijeet Nayak, Martin Büchner, Silvio Galesso, Abhinav Valada, Thomas Brox
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.04271
Pdf link: https://arxiv.org/pdf/2310.04271
Abstract Learning-based manipulation policies from image inputs often show weak task transfer capabilities. In contrast, visual servoing methods allow efficient task transfer in high-precision scenarios while requiring only a few demonstrations. In this work, we present a framework that formulates the visual servoing task as graph traversal. Our method not only extends the robustness of visual servoing, but also enables multitask capability based on a few task-specific demonstrations. We construct demonstration graphs by splitting existing demonstrations and recombining them. In order to traverse the demonstration graph in the inference case, we utilize a similarity function that helps select the best demonstration for a specific task. This enables us to compute the shortest path through the graph. Ultimately, we show that recombining demonstrations leads to higher task-respective success. We present extensive simulation and real-world experimental results that demonstrate the efficacy of our approach.
Latent Graph Inference with Limited Supervision
Authors: Jianglin Lu, Yi Xu, Huan Wang, Yue Bai, Yun Fu
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04314
Pdf link: https://arxiv.org/pdf/2310.04314
Abstract Latent graph inference (LGI) aims to jointly learn the underlying graph structure and node representations from data features. However, existing LGI methods commonly suffer from the issue of supervision starvation, where massive edge weights are learned without semantic supervision and do not contribute to the training loss. Consequently, these supervision-starved weights, which may determine the predictions of testing samples, cannot be semantically optimal, resulting in poor generalization. In this paper, we observe that this issue is actually caused by the graph sparsification operation, which severely destroys the important connections established between pivotal nodes and labeled ones. To address this, we propose to restore the corrupted affinities and replenish the missed supervision for better LGI. The key challenge then lies in identifying the critical nodes and recovering the corrupted affinities. We begin by defining the pivotal nodes as $k$-hop starved nodes, which can be identified based on a given adjacency matrix. Considering the high computational burden, we further present a more efficient alternative inspired by CUR matrix decomposition. Subsequently, we eliminate the starved nodes by reconstructing the destroyed connections. Extensive experiments on representative benchmarks demonstrate that reducing the starved nodes consistently improves the performance of state-of-the-art LGI methods, especially under extremely limited supervision (6.12% improvement on Pubmed with a labeling rate of only 0.3%).
Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
Authors: Yuxin Pan, Yize Chen, Fangzhen Lin
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04323
Pdf link: https://arxiv.org/pdf/2310.04323
Abstract Designing effective policies for the online 3D bin packing problem (3D-BPP) has been a long-standing challenge, primarily due to the unpredictable nature of incoming box sequences and stringent physical constraints. While current deep reinforcement learning (DRL) methods for online 3D-BPP have shown promising results in optimizing average performance over an underlying box sequence distribution, they often fail in real-world settings where some worst-case scenarios can materialize. Standard robust DRL algorithms tend to overly prioritize optimizing the worst-case performance at the expense of performance under normal problem instance distribution. To address these issues, we first introduce a permutation-based attacker to investigate the practical robustness of both DRL-based and heuristic methods proposed for solving online 3D-BPP. Then, we propose an adjustable robust reinforcement learning (AR2L) framework that allows efficient adjustment of robustness weights to achieve the desired balance of the policy's performance in average and worst-case environments. Specifically, we formulate the objective function as a weighted sum of expected and worst-case returns, and derive the lower performance bound by relating to the return under a mixture dynamics. To realize this lower bound, we adopt an iterative procedure that searches for the associated mixture dynamics and improves the corresponding policy. We integrate this procedure into two popular robust adversarial algorithms to develop the exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.
Amortizing intractable inference in large language models
Authors: Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, Nikolay Malkin
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.04363
Pdf link: https://arxiv.org/pdf/2310.04363
Abstract Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
Enhanced Backpressure Routing with Wireless Link Features
Authors: Zhongyuan Zhao, Gunjan Verma, Ananthram Swami, Santiago Segarra
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.04364
Pdf link: https://arxiv.org/pdf/2310.04364
Abstract Backpressure (BP) routing is a well-established framework for distributed routing and scheduling in wireless multi-hop networks. However, the basic BP scheme suffers from poor end-to-end delay due to the drawbacks of slow startup, random walk, and the last packet problem. Biased BP with shortest path awareness can address the first two drawbacks, and sojourn time-based backlog metrics have been proposed for the last packet problem. Furthermore, these BP variations require no additional signaling overhead in each time step compared to the basic BP. In this work, we further address three long-standing challenges associated with the aforementioned low-cost BP variations, including optimal scaling of the biases, bias maintenance under mobility, and incorporating sojourn time awareness into biased BP. Our analysis and experimental results show that proper scaling of biases can be achieved with the help of common link features, which can effectively reduce end-to-end delay of BP by mitigating the random walk of packets under low-to-medium traffic, including the last packet scenario. In addition, our low-overhead bias maintenance scheme is shown to be effective under mobility, and our bio-inspired sojourn time-aware backlog metric is demonstrated to be more efficient and effective for the last packet problem than existing approaches when incorporated into biased BP.
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Authors: Taha Shahroodi, Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, Said Hamdioui
Subjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Genomics (q-bio.GN)
Arxiv link: https://arxiv.org/abs/2310.04366
Pdf link: https://arxiv.org/pdf/2310.04366
Abstract Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNNs. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Sword-fish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Authors: Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04378
Pdf link: https://arxiv.org/pdf/2310.04378
Abstract Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/
Keyword: faster

Fishnets: Information-Optimal, Scalable Aggregation for Sets and Graphs
Authors: T. Lucas Makinen, Justin Alsing, Benjamin D. Wandelt
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.03812
Pdf link: https://arxiv.org/pdf/2310.03812
Abstract Set-based learning is an essential component of modern deep learning and network science. Graph Neural Networks (GNNs) and their edge-free counterparts Deepsets have proven remarkably useful on ragged and topologically challenging datasets. The key to learning informative embeddings for set members is a specified aggregation function, usually a sum, max, or mean. We propose Fishnets, an aggregation strategy for learning information-optimal embeddings for sets of data for both Bayesian inference and graph aggregation. We demonstrate that i) Fishnets neural summaries can be scaled optimally to an arbitrary number of data objects, ii) Fishnets aggregations are robust to changes in data distribution, unlike standard deepsets, iii) Fishnets saturate Bayesian information content and extend to regimes where MCMC techniques fail and iv) Fishnets can be used as a drop-in aggregation scheme within GNNs. We show that by adopting a Fishnets aggregation scheme for message passing, GNNs can achieve state-of-the-art performance versus architecture size on ogbn-protein data over existing benchmarks with a fraction of learnable parameters and faster training time.
Accelerated Neural Network Training with Rooted Logistic Objectives
Authors: Zhu Wang, Praveen Raj Veluswami, Harsh Mishra, Sathya N. Ravi
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.03890
Pdf link: https://arxiv.org/pdf/2310.03890
Abstract Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution wherever possible. The strict convexity of the derived function can be extended to finetune state-of-the-art models and applications. In empirical experimental analysis, we apply our proposed rooted logistic objective to multiple deep models, e.g., fully-connected neural networks and transformers, on various of classification benchmarks. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements. Furthermore, we illustrate applications of our novel rooted loss function in generative modeling based downstream applications, such as finetuning StyleGAN model with the rooted loss. The code implementing our losses and models can be found here for open source software development purposes: https://anonymous.4open.science/r/rooted_loss.
PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability
Authors: Avisek Naug, Antonio Guillen, Ricardo Luna Gutiérrez, Vineet Gundecha, Dejan Markovikj, Lekhapriya Dheeraj Kashyap, Lorenz Krause, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Soumyendu Sarkar
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03906
Pdf link: https://arxiv.org/pdf/2310.03906
Abstract The increasing global emphasis on sustainability and reducing carbon emissions is pushing governments and corporations to rethink their approach to data center design and operation. Given their high energy consumption and exponentially large computational workloads, data centers are prime candidates for optimizing power consumption, especially in areas such as cooling and IT energy usage. A significant challenge in this pursuit is the lack of a configurable and scalable thermal data center model that offers an end-to-end pipeline. Data centers consist of multiple IT components whose geometric configuration and heat dissipation make thermal modeling difficult. This paper presents PyDCM, a customizable Data Center Model implemented in Python, that allows users to create unique configurations of IT equipment with custom server specifications and geometric arrangements of IT cabinets. The use of vectorized thermal calculations makes PyDCM orders of magnitude faster (30 times) than current Energy Plus modeling implementations and scales sublinearly with the number of CPUs. Also, PyDCM enables the use of Deep Reinforcement Learning via the Gymnasium wrapper to optimize data center cooling and offers a user-friendly platform for testing various data center design prototypes.
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Authors: Josh Alman, Zhao Song
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.04064
Pdf link: https://arxiv.org/pdf/2310.04064
Abstract In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: $\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time. $\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
Reinforcement Learning with Fast and Forgetful Memory
Authors: Steven Morad, Ryan Kortvelesy, Stephan Liwicki, Amanda Prorok
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04128
Pdf link: https://arxiv.org/pdf/2310.04128
Abstract Nearly all real world tasks are inherently partially observable, necessitating the use of memory in Reinforcement Learning (RL). Most model-free approaches summarize the trajectory into a latent Markov state using memory models borrowed from Supervised Learning (SL), even though RL tends to exhibit different training and efficiency characteristics. Addressing this discrepancy, we introduce Fast and Forgetful Memory, an algorithm-agnostic memory model designed specifically for RL. Our approach constrains the model search space via strong structural priors inspired by computational psychology. It is a drop-in replacement for recurrent neural networks (RNNs) in recurrent RL algorithms, achieving greater reward than RNNs across various recurrent benchmarks and algorithms without changing any hyperparameters. Moreover, Fast and Forgetful Memory exhibits training speeds two orders of magnitude faster than RNNs, attributed to its logarithmic time and linear space complexity. Our implementation is available at https://github.com/proroklab/ffm.
Pika: Empowering Non-Programmers to Author Executable Governance Policies in Online Communities
Authors: Leijie Wang, Nicolas Vincent, Julija Rukanskaitė, Amy X. Zhang
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2310.04329
Pdf link: https://arxiv.org/pdf/2310.04329
Abstract Internet users have formed a wide array of online communities with nuanced and diverse community goals and norms. However, most online platforms only offer a limited set of governance models in their software infrastructure and leave little room for customization. Consequently, technical proficiency becomes a prerequisite for online communities to build governance policies in code, excluding non-programmers from participation in designing community governance. In this paper, we present Pika, a system that empowers non-programmers to author a wide range of executable governance policies. At its core, Pika incorporates a declarative language that decomposes governance policies into modular components, thereby facilitating expressive policy authoring through a user-friendly, form-based web interface. Our user studies with 17 participants show that Pika can empower non-programmers to author governance policies approximately 2.5 times faster than programmers who author in code. We also provide insights about Pika's expressivity in supporting diverse policies that online communities want.
Near-linear Time Dispersion of Mobile Agents
Authors: Yuichi Sudo, Masahiro Shibata, Junya Nakamura, Yonghwan Kim, Toshimitsu Masuzawa
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.04376
Pdf link: https://arxiv.org/pdf/2310.04376
Abstract Consider that there are $k\le n$ agents in a simple, connected, and undirected graph $G=(V,E)$ with $n$ nodes and $m$ edges. The goal of the dispersion problem is to move these $k$ agents to distinct nodes. Agents can communicate only when they are at the same node, and no other means of communication such as whiteboards are available. We assume that the agents operate synchronously. We consider two scenarios: when all agents are initially located at any single node (rooted setting) and when they are initially distributed over any one or more nodes (general setting). Kshemkalyani and Sharma presented a dispersion algorithm for the general setting, which uses $O(m_k)$ time and $\log(k+\delta)$ bits of memory per agent [OPODIS 2021]. Here, $m_k$ is the maximum number of edges in any induced subgraph of $G$ with $k$ nodes, and $\delta$ is the maximum degree of $G$. This algorithm is the fastest in the literature, as no algorithm with $o(m_k)$ time has been discovered even for the rooted setting. In this paper, we present faster algorithms for both the rooted and general settings. First, we present an algorithm for the rooted setting that solves the dispersion problem in $O(k\log \min(k,\delta))=O(k\log k)$ time using $O(\log \delta)$ bits of memory per agent. Next, we propose an algorithm for the general setting that achieves dispersion in $O(k (\log k)\cdot (\log \min(k,\delta))=O(k \log^2 k)$ time using $O(\log (k+\delta))$ bits.
Keyword: mobile

ECAvg: An Edge-Cloud Collaborative Learning Approach using Averaged Weights
Authors: Atah Nuh Mih, Hung Cao, Asfia Kawnine, Monica Wachowicz
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.03823
Pdf link: https://arxiv.org/pdf/2310.03823
Abstract The use of edge devices together with cloud provides a collaborative relationship between both classes of devices where one complements the shortcomings of the other. Resource-constraint edge devices can benefit from the abundant computing power provided by servers by offloading computationally intensive tasks to the server. Meanwhile, edge devices can leverage their close proximity to the data source to perform less computationally intensive tasks on the data. In this paper, we propose a collaborative edge-cloud paradigm called ECAvg in which edge devices pre-train local models on their respective datasets and transfer the models to the server for fine-tuning. The server averages the pre-trained weights into a global model, which is fine-tuned on the combined data from the various edge devices. The local (edge) models are then updated with the weights of the global (server) model. We implement a CIFAR-10 classification task using MobileNetV2, a CIFAR-100 classification task using ResNet50, and an MNIST classification using a neural network with a single hidden layer. We observed performance improvement in the CIFAR-10 and CIFAR-100 classification tasks using our approach, where performance improved on the server model with averaged weights and the edge models had a better performance after model update. On the MNIST classification, averaging weights resulted in a drop in performance on both the server and edge models due to negative transfer learning. From the experiment results, we conclude that our approach is successful when implemented on deep neural networks such as MobileNetV2 and ResNet50 instead of simple neural networks.
Realizing XR Applications Using 5G-Based 3D Holographic Communication and Mobile Edge Computing
Authors: Dun Yuan, Ekram Hossain, Di Wu, Xue Liu, Gregory Dudek
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.03908
Pdf link: https://arxiv.org/pdf/2310.03908
Abstract 3D holographic communication has the potential to revolutionize the way people interact with each other in virtual spaces, offering immersive and realistic experiences. However, demands for high data rates, extremely low latency, and high computations to enable this technology pose a significant challenge. To address this challenge, we propose a novel job scheduling algorithm that leverages Mobile Edge Computing (MEC) servers in order to minimize the total latency in 3D holographic communication. One of the motivations for this work is to prevent the uncanny valley effect, which can occur when the latency hinders the seamless and real-time rendering of holographic content, leading to a less convincing and less engaging user experience. Our proposed algorithm dynamically allocates computation tasks to MEC servers, considering the network conditions, computational capabilities of the servers, and the requirements of the 3D holographic communication application. We conduct extensive experiments to evaluate the performance of our algorithm in terms of latency reduction, and the results demonstrate that our approach significantly outperforms other baseline methods. Furthermore, we present a practical scenario involving Augmented Reality (AR), which not only illustrates the applicability of our algorithm but also highlights the importance of minimizing latency in achieving high-quality holographic views. By efficiently distributing the computation workload among MEC servers and reducing the overall latency, our proposed algorithm enhances the user experience in 3D holographic communications and paves the way for the widespread adoption of this technology in various applications, such as telemedicine, remote collaboration, and entertainment.
Quantized Transformer Language Model Implementations on Edge Devices
Authors: Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik Satam, Soheil Salehi
Subjects: Computation and Language (cs.CL); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2310.03971
Pdf link: https://arxiv.org/pdf/2310.03971
Abstract Large-scale transformer-based models like the Bidirectional Encoder Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications, wherein these models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency. In order to overcome these limitations, such large-scale models can be converted to an optimized FlatBuffer format, tailored for deployment on resource-constrained edge devices. Herein, we evaluate the performance of such FlatBuffer transformed MobileBERT models on three different edge devices, fine-tuned for Reputation analysis of English language tweets in the RepLab 2013 dataset. In addition, this study encompassed an evaluation of the deployed models, wherein their latency, performance, and resource efficiency were meticulously assessed. Our experiment results show that, compared to the original BERT large model, the converted and quantized MobileBERT models have 160$\times$ smaller footprints for a 4.1% drop in accuracy while analyzing at least one tweet per second on edge devices. Furthermore, our study highlights the privacy-preserving aspect of TinyML systems as all data is processed locally within a serverless environment.
DEFT: A new distance-based feature set for keystroke dynamics
Authors: Nuwan Kaluarachchi, Sevvandi Kandanaarachchi, Kristen Moore, Arathi Arakala
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04059
Pdf link: https://arxiv.org/pdf/2310.04059
Abstract Keystroke dynamics is a behavioural biometric utilised for user identification and authentication. We propose a new set of features based on the distance between keys on the keyboard, a concept that has not been considered before in keystroke dynamics. We combine flight times, a popular metric, with the distance between keys on the keyboard and call them as Distance Enhanced Flight Time features (DEFT). This novel approach provides comprehensive insights into a person's typing behaviour, surpassing typing velocity alone. We build a DEFT model by combining DEFT features with other previously used keystroke dynamic features. The DEFT model is designed to be device-agnostic, allowing us to evaluate its effectiveness across three commonly used devices: desktop, mobile, and tablet. The DEFT model outperforms the existing state-of-the-art methods when we evaluate its effectiveness across two datasets. We obtain accuracy rates exceeding 99% and equal error rates below 10% on all three devices.
Towards 6D MCL for LiDARs in 3D TSDF Maps on Embedded Systems with GPUs
Authors: Marc Eisoldt, Alexander Mock, Mario Porrmann, Thomas Wiemann
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.04172
Pdf link: https://arxiv.org/pdf/2310.04172
Abstract Monte Carlo Localization is a widely used approach in the field of mobile robotics. While this problem has been well studied in the 2D case, global localization in 3D maps with six degrees of freedom has so far been too computationally demanding. Hence, no mobile robot system has yet been presented in literature that is able to solve it in real-time. The computationally most intensive step is the evaluation of the sensor model, but it also offers high parallelization potential. This work investigates the massive parallelization of the evaluation of particles in truncated signed distance fields for three-dimensional laser scanners on embedded GPUs. The implementation on the GPU is 30 times as fast and more than 50 times more energy efficient compared to a CPU implementation.
Entropic Score metric: Decoupling Topology and Size in Training-free NAS
Authors: Niccolò Cavagnero, Luca Robbiano, Francesca Pistilli, Barbara Caputo, Giuseppe Averta
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04179
Pdf link: https://arxiv.org/pdf/2310.04179
Abstract Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification.
Keyword: pruning

Non-Redundant Graph Neural Networks with Improved Expressiveness
Authors: Franka Bause, Samir Moustafa, Johannes Langguth, Wilfried N. Gansterer, Nils M. Kriege
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04190
Pdf link: https://arxiv.org/pdf/2310.04190
Abstract Message passing graph neural networks iteratively compute node embeddings by aggregating messages from all neighbors. This procedure can be viewed as a neural variant of the Weisfeiler-Leman method, which limits their expressive power. Moreover, oversmoothing and oversquashing restrict the number of layers these networks can effectively utilize. The repeated exchange and encoding of identical information in message passing amplifies oversquashing. We propose a novel aggregation scheme based on neighborhood trees, which allows for controlling the redundancy by pruning branches of the unfolding trees underlying standard message passing. We prove that reducing redundancy improves expressivity and experimentally show that it alleviates oversquashing. We investigate the interaction between redundancy in message passing and redundancy in computation and propose a compact representation of neighborhood trees, from which we compute node and graph embeddings via a neural tree canonization technique. Our method is provably more expressive than the Weisfeiler-Leman method, less susceptible to oversquashing than message passing neural networks, and provides high classification accuracy on widely-used benchmark datasets.
Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach
Authors: Junkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.04399
Pdf link: https://arxiv.org/pdf/2310.04399
Abstract Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.
Keyword: diffusion

Generative Hyperelasticity with Physics-Informed Probabilistic Diffusion Fields
Authors: Vahidullah Tac, Manuel K Rausch, Ilias Bilionis, Francisco Sahli Costabal, Adrian Buganza Tepole
Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03745
Pdf link: https://arxiv.org/pdf/2310.03745
Abstract Many natural materials exhibit highly complex, nonlinear, anisotropic, and heterogeneous mechanical properties. Recently, it has been demonstrated that data-driven strain energy functions possess the flexibility to capture the behavior of these complex materials with high accuracy while satisfying physics-based constraints. However, most of these approaches disregard the uncertainty in the estimates and the spatial heterogeneity of these materials. In this work, we leverage recent advances in generative models to address these issues. We use as building block neural ordinary equations (NODE) that -- by construction -- create polyconvex strain energy functions, a key property of realistic hyperelastic material models. We combine this approach with probabilistic diffusion models to generate new samples of strain energy functions. This technique allows us to sample a vector of Gaussian white noise and translate it to NODE parameters thereby representing plausible strain energy functions. We extend our approach to spatially correlated diffusion resulting in heterogeneous material properties for arbitrary geometries. We extensively test our method with synthetic and experimental data on biological tissues and run finite element simulations with various degrees of spatial heterogeneity. We believe this approach is a major step forward including uncertainty in predictive, data-driven models of hyperelasticity
Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model
Authors: Cagla Deniz Bahadir, Benjamin Liechty, David J. Pisapia, Mert R. Sabuncu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.03893
Pdf link: https://arxiv.org/pdf/2310.03893
Abstract Mitotic figure detection in histology images is a hard-to-define, yet clinically significant task, where labels are generated with pathologist interpretations and where there is no ``gold-standard'' independent ground-truth. However, it is well-established that these interpretation based labels are often unreliable, in part, due to differences in expertise levels and human subjectivity. In this paper, our goal is to shed light on the inherent uncertainty of mitosis labels and characterize the mitotic figure classification task in a human interpretable manner. We train a probabilistic diffusion model to synthesize patches of cell nuclei for a given mitosis label condition. Using this model, we can then generate a sequence of synthetic images that correspond to the same nucleus transitioning into the mitotic state. This allows us to identify different image features associated with mitosis, such as cytoplasm granularity, nuclear density, nuclear irregularity and high contrast between the nucleus and the cell body. Our approach offers a new tool for pathologists to interpret and communicate the features driving the decision to recognize a mitotic figure.
Diffusion Models as Masked Audio-Video Learners
Authors: Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.03937
Pdf link: https://arxiv.org/pdf/2310.03937
Abstract Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.04004
Pdf link: https://arxiv.org/pdf/2310.04004
Abstract Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
Observation-Guided Diffusion Probabilistic Models
Authors: Junoh Kang, Jinyoung Choi, Sungik Choi, Bohyung Han
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04041
Pdf link: https://arxiv.org/pdf/2310.04041
Abstract We propose a novel diffusion model called observation-guided diffusion probabilistic model (OGDM), which effectively addresses the trade-off between quality control and fast sampling. Our approach reestablishes the training objective by integrating the guidance of the observation process with the Markov chain in a principled way. This is achieved by introducing an additional loss term derived from the observation based on the conditional discriminator on noise level, which employs Bernoulli distribution indicating whether its input lies on the (noisy) real manifold or not. This strategy allows us to optimize the more accurate negative log-likelihood induced in the inference stage especially when the number of function evaluations is limited. The proposed training method is also advantageous even when incorporated only into the fine-tuning process, and it is compatible with various fast inference strategies since our method yields better denoising networks using the exactly same inference procedure without incurring extra computational cost. We demonstrate the effectiveness of the proposed training algorithm using diverse inference methods on strong diffusion model baselines.
VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification
Authors: Han Huang, Yan Huang, Liang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.04122
Pdf link: https://arxiv.org/pdf/2310.04122
Abstract Visible-Infrared person re-identification (VI-ReID) in real-world scenarios poses a significant challenge due to the high cost of cross-modality data annotation. Different sensing cameras, such as RGB/IR cameras for good/poor lighting conditions, make it costly and error-prone to identify the same person across modalities. To overcome this, we explore the use of single-modality labeled data for the VI-ReID task, which is more cost-effective and practical. By labeling pedestrians in only one modality (e.g., visible images) and retrieving in another modality (e.g., infrared images), we aim to create a training set containing both originally labeled and modality-translated data using unpaired image-to-image translation techniques. In this paper, we propose VI-Diff, a diffusion model that effectively addresses the task of Visible-Infrared person image translation. Through comprehensive experiments, we demonstrate that VI-Diff outperforms existing diffusion and GAN models, making it a promising solution for VI-ReID with single-modality labeled data. Our approach can be a promising solution to the VI-ReID task with single-modality labeled data and serves as a good starting point for future study. Code will be available.
Fully discrete Galerkin scheme for a semilinear subdiffusion equation with nonsmooth data and time-dependent coefficient
Authors: Łukasz Płociniczak, Kacper Taźbierski
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.04246
Pdf link: https://arxiv.org/pdf/2310.04246
Abstract We couple the L1 discretization of the Caputo fractional derivative in time with the Galerkin scheme to devise a linear numerical method for the semilinear subdiffusion equation. Two important points that we make are: nonsmooth initial data and time-dependent diffusion coefficient. We prove the stability and convergence of the method under weak assumptions concerning regularity of the diffusivity. We find optimal pointwise in space and global in time errors, which are verified with several numerical experiments.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Authors: Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04378
Pdf link: https://arxiv.org/pdf/2310.04378
Abstract Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis
Authors: Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.04414
Pdf link: https://arxiv.org/pdf/2310.04414
Abstract Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.
Keyword: adaptive

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning
Authors: Harsh Chaudhari, Giorgio Severi, Alina Oprea, Jonathan Ullman
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03838
Pdf link: https://arxiv.org/pdf/2310.03838
Abstract The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.
Diffusion Models as Masked Audio-Video Learners
Authors: Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.03937
Pdf link: https://arxiv.org/pdf/2310.03937
Abstract Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.
A Learnable Counter-condition Analysis Framework for Functional Connectivity-based Neurological Disorder Diagnosis
Authors: Eunsong Kang, Da-woon Heo, Jiwon Lee, Heung-Il Suk
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2310.03964
Pdf link: https://arxiv.org/pdf/2310.03964
Abstract To understand the biological characteristics of neurological disorders with functional connectivity (FC), recent studies have widely utilized deep learning-based models to identify the disease and conducted post-hoc analyses via explainable models to discover disease-related biomarkers. Most existing frameworks consist of three stages, namely, feature selection, feature extraction for classification, and analysis, where each stage is implemented separately. However, if the results at each stage lack reliability, it can cause misdiagnosis and incorrect analysis in afterward stages. In this study, we propose a novel unified framework that systemically integrates diagnoses (i.e., feature selection and feature extraction) and explanations. Notably, we devised an adaptive attention network as a feature selection approach to identify individual-specific disease-related connections. We also propose a functional network relational encoder that summarizes the global topological properties of FC by learning the inter-network relations without pre-defined edges between functional networks. Last but not least, our framework provides a novel explanatory power for neuroscientific interpretation, also termed counter-condition analysis. We simulated the FC that reverses the diagnostic information (i.e., counter-condition FC): converting a normal brain to be abnormal and vice versa. We validated the effectiveness of our framework by using two large resting-state functional magnetic resonance imaging (fMRI) datasets, Autism Brain Imaging Data Exchange (ABIDE) and REST-meta-MDD, and demonstrated that our framework outperforms other competing methods for disease identification. Furthermore, we analyzed the disease-related neurological patterns based on counter-condition analysis.
Adaptive Computation of an Elliptic Eigenvalue Optimization Problem with a Phase-Field Approach
Authors: Jing Li, Yifeng Xu, Shengfeng Zhu
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.03970
Pdf link: https://arxiv.org/pdf/2310.03970
Abstract In this paper, we discuss adaptive approximations of an elliptic eigenvalue optimization problem in a phase-field setting by a conforming finite element method. An adaptive algorithm is proposed and implemented in several two dimensional numerical examples for illustration of efficiency and accuracy. Theoretical findings consist in the vanishing limit of a subsequence of estimators and the convergence of the relevant subsequence of adaptively-generated solutions to a solution to the continuous optimality system.
AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement
Authors: Zhenghai Xue, Qingpeng Cai, Tianyou Zuo, Bin Yang, Lantao Hu, Peng Jiang, Kun Gai, Bo An
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03984
Pdf link: https://arxiv.org/pdf/2310.03984
Abstract Growing attention has been paid to Reinforcement Learning (RL) algorithms when optimizing long-term user engagement in sequential recommendation tasks. One challenge in large-scale online recommendation systems is the constant and complicated changes in users' behavior patterns, such as interaction rates and retention tendencies. When formulated as a Markov Decision Process (MDP), the dynamics and reward functions of the recommendation system are continuously affected by these changes. Existing RL algorithms for recommendation systems will suffer from distribution shift and struggle to adapt in such an MDP. In this paper, we introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue. AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories. Such information reflects how RL policy fits to current user behavior patterns, and helps the policy to identify subtle changes in the recommendation system. To make rapid adaptation to these changes, AdaRec encourages exploration with the idea of optimism under uncertainty. The exploration is further guarded by zero-order action optimization to ensure stable recommendation quality in complicated environments. We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks, where AdaRec exhibits superior long-term performance compared to all baseline algorithms.
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.04004
Pdf link: https://arxiv.org/pdf/2310.04004
Abstract Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
Maximizing Performance with Minimal Resources for Real-Time Transition Detection
Authors: Zeynep Ozge Orhan, Andrea Dal Prete, Anastasia Bolotnikova, Marta Gandolla, Auke Ijspeert, Mohamed Bouri
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.04117
Pdf link: https://arxiv.org/pdf/2310.04117
Abstract Assistive devices, such as exoskeletons and prostheses, have revolutionized the field of rehabilitation and mobility assistance. Efficiently detecting transitions between different activities, such as walking, stair ascending and descending, and sitting, is crucial for ensuring adaptive control and enhancing user experience. We here present an approach for real-time transition detection, aimed at optimizing the processing-time performance. By establishing activity-specific threshold values through trained machine learning models, we effectively distinguish motion patterns and we identify transition moments between locomotion modes. This threshold-based method improves real-time embedded processing time performance by up to 11 times compared to machine learning approaches. The efficacy of the developed finite-state machine is validated using data collected from three different measurement systems. Moreover, experiments with healthy participants were conducted on an active pelvis orthosis to validate the robustness and reliability of our approach. The proposed algorithm achieved high accuracy in detecting transitions between activities. These promising results show the robustness and reliability of the method, reinforcing its potential for integration into practical applications.
Dynamic Relation-Attentive Graph Neural Networks for Fraud Detection
Authors: Heehyeon Kim, Jinhyeok Choi, Joyce Jiyoung Whang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2310.04171
Pdf link: https://arxiv.org/pdf/2310.04171
Abstract Fraud detection aims to discover fraudsters deceiving other users by, for example, leaving fake reviews or making abnormal transactions. Graph-based fraud detection methods consider this task as a classification problem with two classes: frauds or normal. We address this problem using Graph Neural Networks (GNNs) by proposing a dynamic relation-attentive aggregation mechanism. Based on the observation that many real-world graphs include different types of relations, we propose to learn a node representation per relation and aggregate the node representations using a learnable attention function that assigns a different attention coefficient to each relation. Furthermore, we combine the node representations from different layers to consider both the local and global structures of a target node, which is beneficial to improving the performance of fraud detection on graphs with heterophily. By employing dynamic graph attention in all the aggregation processes, our method adaptively computes the attention coefficients for each node. Experimental results show that our method, DRAG, outperforms state-of-the-art fraud detection methods on real-world benchmark datasets.
Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution
Authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.04180
Pdf link: https://arxiv.org/pdf/2310.04180
Abstract Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main.
Workload-aware and Learned Z-Indexes
Authors: Sachith Pai, Michael Mathioudakis, Yanhao Wang
Subjects: Databases (cs.DB); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2310.04268
Pdf link: https://arxiv.org/pdf/2310.04268
Abstract In this paper, a learned and workload-aware variant of a Z-index, which jointly optimizes storage layout and search structures, as a viable solution for the above challenges of spatial indexing. Specifically, we first formulate a cost function to measure the performance of a Z-index on a dataset for a range-query workload. Then, we optimize the Z-index structure by minimizing the cost function through adaptive partitioning and ordering for index construction. Moreover, we design a novel page-skipping mechanism to improve its query performance by reducing access to irrelevant data pages. Our extensive experiments show that our index improves range query time by 40% on average over the baselines, while always performing better or comparably to state-of-the-art spatial indexes. Additionally, our index maintains good point query performance while providing favourable construction time and index size tradeoffs.
Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning
Authors: Qing Zhu, Qirong Mao, Jialin Zhang, Xiaohua Huang, Wenming Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04306
Pdf link: https://arxiv.org/pdf/2310.04306
Abstract Group-level emotion recognition (GER) is an inseparable part of human behavior analysis, aiming to recognize an overall emotion in a multi-person scene. However, the existing methods are devoted to combing diverse emotion cues while ignoring the inherent uncertainties under unconstrained environments, such as congestion and occlusion occurring within a group. Additionally, since only group-level labels are available, inconsistent emotion predictions among individuals in one group can confuse the network. In this paper, we propose an uncertainty-aware learning (UAL) method to extract more robust representations for GER. By explicitly modeling the uncertainty of each individual, we utilize stochastic embedding drawn from a Gaussian distribution instead of deterministic point embedding. This representation captures the probabilities of different emotions and generates diverse predictions through this stochasticity during the inference stage. Furthermore, uncertainty-sensitive scores are adaptively assigned as the fusion weights of individuals' face within each group. Moreover, we develop an image enhancement module to enhance the model's robustness against severe noise. The overall three-branch model, encompassing face, object, and scene component, is guided by a proportional-weighted fusion strategy and integrates the proposed uncertainty-aware method to produce the final group-level output. Experimental results demonstrate the effectiveness and generalization ability of our method across three widely used databases.
Confronting Reward Model Overoptimization with Constrained RLHF
Authors: Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.04373
Pdf link: https://arxiv.org/pdf/2310.04373
Abstract Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Authors: Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.04406
Pdf link: https://arxiv.org/pdf/2310.04406
Abstract While large language models (LLMs) have demonstrated impressive performance on a range of decision-making tasks, they rely on simple acting processes and fall short of broad deployment as autonomous agents. We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree search in model-based reinforcement learning, LATS employs LLMs as agents, value functions, and optimizers, repurposing their latent strengths for enhanced decision-making. What is crucial in this method is the use of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism that moves beyond the limitations of existing techniques. Our experimental evaluation across diverse domains, such as programming, HotPotQA, and WebShop, illustrates the applicability of LATS for both reasoning and acting. In particular, LATS achieves 94.4\% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.
Keyword: quantization

Learning A Disentangling Representation For PU Learning
Authors: Omar Zamzam, Haleh Akrami, Mahdi Soltanolkotabi, Richard Leahy
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.03833
Pdf link: https://arxiv.org/pdf/2310.03833
Abstract In this paper, we address the problem of learning a binary (positive vs. negative) classifier given Positive and Unlabeled data commonly referred to as PU learning. Although rudimentary techniques like clustering, out-of-distribution detection, or positive density estimation can be used to solve the problem in low-dimensional settings, their efficacy progressively deteriorates with higher dimensions due to the increasing complexities in the data distribution. In this paper we propose to learn a neural network-based data representation using a loss function that can be used to project the unlabeled data into two (positive and negative) clusters that can be easily identified using simple clustering techniques, effectively emulating the phenomenon observed in low-dimensional settings. We adopt a vector quantization technique for the learned representations to amplify the separation between the learned unlabeled data clusters. We conduct experiments on simulated PU data that demonstrate the improved performance of our proposed method compared to the current state-of-the-art approaches. We also provide some theoretical justification for our two cluster-based approach and our algorithmic choices.
Sub-token ViT Embedding via Stochastic Resonance Transformers
Authors: Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.03967
Pdf link: https://arxiv.org/pdf/2310.03967
Abstract We discover the presence of quantization artifacts in Vision Transformers (ViTs), which arise due to the image tokenization step inherent in these architectures. These artifacts result in coarsely quantized features, which negatively impact performance, especially on downstream dense prediction tasks. We present a zero-shot method to improve how pre-trained ViTs handle spatial quantization. In particular, we propose to ensemble the features obtained from perturbing input images via sub-token spatial translations, inspired by Stochastic Resonance, a method traditionally applied to climate dynamics and signal processing. We term our method ``Stochastic Resonance Transformer" (SRT), which we show can effectively super-resolve features of pre-trained ViTs, capturing more of the local fine-grained structures that might otherwise be neglected as a result of tokenization. SRT can be applied at any layer, on any task, and does not require any fine-tuning. The advantage of the former is evident when applied to monocular depth prediction, where we show that ensembling model outputs are detrimental while applying SRT on intermediate ViT features outperforms the baseline models by an average of 4.7% and 14.9% on the RMSE and RMSE-log metrics across three different architectures. When applied to semi-supervised video object segmentation, SRT also improves over the baseline models uniformly across all metrics, and by an average of 2.4% in F&J score. We further show that these quantization artifacts can be attenuated to some extent via self-distillation. On the unsupervised salient region segmentation, SRT improves upon the base model by an average of 2.1% on the maxF metric. Finally, despite operating purely on pixel-level features, SRT generalizes to non-dense prediction tasks such as image retrieval and object discovery, yielding consistent improvements of up to 2.6% and 1.0% respectively.

A-suozhang / GetArxivDaily

New submissions for Mon, 9 Oct 23 #170

Keyword: efficient

Progressive reduced order modeling: empowering data-driven modeling with selective knowledge transfer

PrIeD-KIE: Towards Privacy Preserved Document Key Information Extraction

Quantitative passive imaging by iterative holography: The example of helioseismic holography

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures

Information Geometry for the Working Information Theorist

Realizing XR Applications Using 5G-Based 3D Holographic Communication and Mobile Edge Computing

RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels

Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust Closed-Loop Control

An Efficient Content-based Time Series Retrieval System

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Multilingual and Low Resource Scenarios

Ultimate limit on learning non-Markovian behavior: Fisher information rate and excess information

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Nonlinear Methods for Shape Optimization Problems in Liquid Crystal Tactoids

A short report on preconditioned Anderson acceleration method

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Maximizing Performance with Minimal Resources for Real-Time Transition Detection

Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning

Amortized Network Intervention to Steer the Excitatory Point Processes

Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching

Dynamic Relation-Attentive Graph Neural Networks for Fraud Detection

Towards 6D MCL for LiDARs in 3D TSDF Maps on Embedded Systems with GPUs

Cross-Edge Orchestration of Serverless Functions with Probabilistic Caching

Sparse Identification of Nonlinear Dynamics with Side Information (SINDy-SI)

Optimization with pattern-avoiding input

Comparing Auxiliary Tasks for Learning Representations for Reinforcement Learning

Fully discrete Galerkin scheme for a semilinear subdiffusion equation with nonsmooth data and time-dependent coefficient

Compositional Servoing by Recombining Demonstrations

Latent Graph Inference with Limited Supervision

Adjustable Robust Reinforcement Learning for Online 3D Bin Packing

Amortizing intractable inference in large language models

Enhanced Backpressure Routing with Wireless Link Features

Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Keyword: faster

Fishnets: Information-Optimal, Scalable Aggregation for Sets and Graphs

Accelerated Neural Network Training with Rooted Logistic Objectives

PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability

How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation

Reinforcement Learning with Fast and Forgetful Memory

Pika: Empowering Non-Programmers to Author Executable Governance Policies in Online Communities

Near-linear Time Dispersion of Mobile Agents

Keyword: mobile

ECAvg: An Edge-Cloud Collaborative Learning Approach using Averaged Weights

Realizing XR Applications Using 5G-Based 3D Holographic Communication and Mobile Edge Computing

Quantized Transformer Language Model Implementations on Edge Devices

DEFT: A new distance-based feature set for keystroke dynamics

Towards 6D MCL for LiDARs in 3D TSDF Maps on Embedded Systems with GPUs

Entropic Score metric: Decoupling Topology and Size in Training-free NAS

Keyword: pruning

Non-Redundant Graph Neural Networks with Improved Expressiveness

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Keyword: diffusion

Generative Hyperelasticity with Physics-Informed Probabilistic Diffusion Fields

Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model

Diffusion Models as Masked Audio-Video Learners

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Observation-Guided Diffusion Probabilistic Models

VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification

Fully discrete Galerkin scheme for a semilinear subdiffusion equation with nonsmooth data and time-dependent coefficient

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

Keyword: adaptive

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Diffusion Models as Masked Audio-Video Learners

A Learnable Counter-condition Analysis Framework for Functional Connectivity-based Neurological Disorder Diagnosis

Adaptive Computation of an Elliptic Eigenvalue Optimization Problem with a Phase-Field Approach

AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Maximizing Performance with Minimal Resources for Real-Time Transition Detection

Dynamic Relation-Attentive Graph Neural Networks for Fraud Detection

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

Workload-aware and Learned Z-Indexes

Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Confronting Reward Model Overoptimization with Constrained RLHF

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Keyword: quantization