New submissions for Fri, 27 Oct 23

Keyword: efficient

Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs

Authors: Chao Qian, Tianheng Ling, Gregor Schiele
Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.16842
Pdf link: https://arxiv.org/pdf/2310.16842
Abstract To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $\mu$J per inference on the FPGA \textit{XC7S15} from \textit{Spartan-7} family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.
Hardware-Algorithm Co-design Enabling Processing-in-Pixel-in-Memory (P2M) for Neuromorphic Vision Sensors
Authors: Md Abdullah-Al Kaiser, Akhilesh R. Jaiswal
Subjects: Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2310.16844
Pdf link: https://arxiv.org/pdf/2310.16844
Abstract The high volume of data transmission between the edge sensor and the cloud processor leads to energy and throughput bottlenecks for resource-constrained edge devices focused on computer vision. Hence, researchers are investigating different approaches (e.g., near-sensor processing, in-sensor processing, in-pixel processing) by executing computations closer to the sensor to reduce the transmission bandwidth. Specifically, in-pixel processing for neuromorphic vision sensors (e.g., dynamic vision sensors (DVS)) involves incorporating asynchronous multiply-accumulate (MAC) operations within the pixel array, resulting in improved energy efficiency. In a CMOS implementation, low overhead energy-efficient analog MAC accumulates charges on a passive capacitor; however, the capacitor's limited charge retention time affects the algorithmic integration time choices, impacting the algorithmic accuracy, bandwidth, energy, and training efficiency. Consequently, this results in a design trade-off on the hardware aspect-creating a need for a low-leakage compute unit while maintaining the area and energy benefits. In this work, we present a holistic analysis of the hardware-algorithm co-design trade-off based on the limited integration time posed by the hardware and techniques to improve the leakage performance of the in-pixel analog MAC operations.
A Type System for Julia
Authors: Benjamin Chung
Subjects: Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2310.16866
Pdf link: https://arxiv.org/pdf/2310.16866
Abstract The Julia programming language was designed to fill the needs of scientific computing by combining the benefits of productivity and performance languages. Julia allows users to write untyped scripts easily without needing to worry about many implementation details, as do other productivity languages. If one just wants to get the work done-regardless of how efficient or general the program might be, such a paradigm is ideal. Simultaneously, Julia also allows library developers to write efficient generic code that can run as fast as implementations in performance languages such as C or Fortran. This combination of user-facing ease and library developer-facing performance has proven quite attractive, and the language has increasing adoption. With adoption comes combinatorial challenges to correctness. Multiple dispatch -- Julia's key mechanism for abstraction -- allows many libraries to compose "out of the box." However, it creates bugs where one library's requirements do not match what another provides. Typing could address this at the cost of Julia's flexibility for scripting. I developed a "best of both worlds" solution: gradual typing for Julia. My system forms the core of a gradual type system for Julia, laying the foundation for improving the correctness of Julia programs while not getting in the way of script writers. My framework allows methods to be individually typed or untyped, allowing users to write untyped code that interacts with typed library code and vice versa. Typed methods then get a soundness guarantee that is robust in the presence of both dynamically typed code and dynamically generated definitions. I additionally describe protocols, a mechanism for typing abstraction over concrete implementation that accommodates one common pattern in Julia libraries, and describe its implementation into my typed Julia framework.
Diagnosing Alzheimer's Disease using Early-Late Multimodal Data Fusion with Jacobian Maps
Authors: Yasmine Mustafa, Tie Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.16936
Pdf link: https://arxiv.org/pdf/2310.16936
Abstract Alzheimer's disease (AD) is a prevalent and debilitating neurodegenerative disorder impacting a large aging population. Detecting AD in all its presymptomatic and symptomatic stages is crucial for early intervention and treatment. An active research direction is to explore machine learning methods that harness multimodal data fusion to outperform human inspection of medical scans. However, existing multimodal fusion models have limitations, including redundant computation, complex architecture, and simplistic handling of missing data. Moreover, the preprocessing pipelines of medical scans remain inadequately detailed and are seldom optimized for individual subjects. In this paper, we propose an efficient early-late fusion (ELF) approach, which leverages a convolutional neural network for automated feature extraction and random forests for their competitive performance on small datasets. Additionally, we introduce a robust preprocessing pipeline that adapts to the unique characteristics of individual subjects and makes use of whole brain images rather than slices or patches. Moreover, to tackle the challenge of detecting subtle changes in brain volume, we transform images into the Jacobian domain (JD) to enhance both accuracy and robustness in our classification. Using MRI and CT images from the OASIS-3 dataset, our experiments demonstrate the effectiveness of the ELF approach in classifying AD into four stages with an accuracy of 97.19%.
The Teenager's Problem: Efficient Garment Decluttering With Grasp Optimization
Authors: Aviv Adler (1), Ayah Ahmad (1), Shengyin Wang (2), Wisdom C. Agboh (1 and 2), Edith Llontop (1), Tianshuang Qiu (1), Jeffrey Ichnowski (3), Mehmet Dogar (2), Thomas Kollar (4), Richard Cheng (4), Ken Goldberg (1) ((1) AUTOLab at the University of California, Berkeley, (2) University of Leeds, (3) Carnegie Mellon University, (4) Toyota Research Institute)
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.16951
Pdf link: https://arxiv.org/pdf/2310.16951
Abstract This paper addresses the ''Teenager's Problem'': efficiently removing scattered garments from a planar surface. As grasping and transporting individual garments is highly inefficient, we propose analytical policies to select grasp locations for multiple garments using an overhead camera. Two classes of methods are considered: depth-based, which use overhead depth data to find efficient grasps, and segment-based, which use segmentation on the RGB overhead image (without requiring any depth data); grasp efficiency is measured by Objects per Transport, which denotes the average number of objects removed per trip to the laundry basket. Experiments suggest that both depth- and segment-based methods easily reduce Objects per Transport (OpT) by $20\%$; furthermore, these approaches complement each other, with combined hybrid methods yielding improvements of $34\%$. Finally, a method employing consolidation (with segmentation) is considered, which manipulates the garments on the work surface to increase OpT; this yields an improvement of $67\%$ over the baseline, though at a cost of additional physical actions.
Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning
Authors: Ananth Balashankar, Xiao Ma, Aradhana Sinha, Ahmad Beirami, Yao Qin, Jilin Chen, Alex Beutel
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.16959
Pdf link: https://arxiv.org/pdf/2310.16959
Abstract As large language models (LLMs) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. If we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? In this paper, we study the novel setting of domain-generalized few-shot learning for LLM-based text safety classifiers. Unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. We demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting training data based on similar examples in prior existing rules. We empirically show that our approach of similarity-based data-augmentation + prompt-tuning (DAPT) consistently outperforms baselines that either do not rely on data augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule is loosely correlated with existing ones.
WaLiN-GUI: a graphical and auditory tool for neuron-based encoding
Authors: Simon F. Müller-Cleve, Fernando M. Quintana, Vittorio Fra, Pedro L. Galindo, Fernando Perez-Peña, Gianvito Urgese, Chiara Bartolozzi
Subjects: Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2310.16983
Pdf link: https://arxiv.org/pdf/2310.16983
Abstract Neuromorphic computing relies on spike-based, energy-efficient communication, inherently implying the need for conversion between real-valued (sensory) data and binary, sparse spiking representation. This is usually accomplished using the real valued data as current input to a spiking neuron model, and tuning the neuron's parameters to match a desired, often biologically inspired behaviour. We developed a tool, the WaLiN-GUI, that supports the investigation of neuron models and parameter combinations to identify suitable configurations for neuron-based encoding of sample-based data into spike trains. Due to the generalized LIF model implemented by default, next to the LIF and Izhikevich neuron models, many spiking behaviors can be investigated out of the box, thus offering the possibility of tuning biologically plausible responses to the input data. The GUI is provided open source and with documentation, being easy to extend with further neuron models and personalize with data analysis functions.
Packed to the Brim: Investigating the Impact of Highly Responsive Prefixes on Internet-wide Measurement Campaigns
Authors: Patrick Sattler, Johannes Zirngibl, Mattijs Jonker, Oliver Gasser, Georg Carle, Ralph Holz
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2310.17012
Pdf link: https://arxiv.org/pdf/2310.17012
Abstract Internet-wide scans are an important tool to evaluate the deployment of services. To enable large-scale application layer scans, a fast, stateless port scan (e.g., using ZMap) is often performed ahead of time to collect responsive targets. It is a common expectation that port scans on the entire IPv4 address space provide a relatively unbiased view as they cover the complete address space. Previous work, however, has found prefixes where all addresses share particular properties. In IPv6, aliased prefixes and fully responsive prefixes, i.e., prefixes where all addresses are responsive, are a well-known phenomenon. However, there is no such in-depth analysis for prefixes with these responsiveness patterns in IPv4. This paper delves into the underlying factors of this phenomenon in the context of IPv4 and evaluates port scans on a total of 161 ports (142 TCP & 19 UDP ports) from three different vantage points. To account for packet loss and other scanning artifacts, we propose the notion of a new category of prefixes, which we call highly responsive prefixes (HRPs). Our findings show that the share of HRPs can make up 70 % of responsive addresses on selected ports. Regarding specific ports, we observe that CDNs contribute to the largest fraction of HRPs on TCP/80 and TCP/443, while TCP proxies emerge as the primary cause of HRPs on other ports. Our analysis also reveals that application layer handshakes to targets outside HRPs are, depending on the chosen service, up to three times more likely to be successful compared to handshakes with targets located in HRPs. To improve future scanning campaigns conducted by the research community, we make our study's data publicly available and provide a tool for detecting HRPs. Furthermore, we propose an approach for a more efficient, ethical, and sustainable application layer target selection.
Streaming Factor Trajectory Learning for Temporal Tensor Decomposition
Authors: Shikai Fang, Xin Yu, Shibo Li, Zheng Wang, Robert Kirby, Shandian Zhe
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17021
Pdf link: https://arxiv.org/pdf/2310.17021
Abstract Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning (SFTL) for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications.
On Surgical Fine-tuning for Language Encoders
Authors: Abhilasha Lodha, Gayatri Belapurkar, Saloni Chalkapurkar, Yuanming Tao, Reshmi Ghosh, Samyadeep Basu, Dmitrii Petrov, Soundararajan Srinivasan
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2310.17041
Pdf link: https://arxiv.org/pdf/2310.17041
Abstract Fine-tuning all the layers of a pre-trained neural language encoder (either using all the parameters or using parameter-efficient methods) is often the de-facto way of adapting it to a new task. We show evidence that for different downstream language tasks, fine-tuning only a subset of layers is sufficient to obtain performance that is close to and often better than fine-tuning all the layers in the language encoder. We propose an efficient metric based on the diagonal of the Fisher information matrix (FIM score), to select the candidate layers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE tasks and across distinct language encoders, that this metric can effectively select layers leading to a strong downstream performance. Our work highlights that task-specific information corresponding to a given downstream task is often localized within a few layers, and tuning only those is sufficient for strong performance. Additionally, we demonstrate the robustness of the FIM score to rank layers in a manner that remains constant during the optimization process.
Learning to Rank for Active Learning via Multi-Task Bilevel Optimization
Authors: Zixin Ding, Si Chen, Ruoxi Jia, Yuxin Chen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17044
Pdf link: https://arxiv.org/pdf/2310.17044
Abstract Active learning is a promising paradigm to reduce the labeling cost by strategically requesting labels to improve model performance. However, existing active learning methods often rely on expensive acquisition function to compute, extensive modeling retraining and multiple rounds of interaction with annotators. To address these limitations, we propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time. Our novel algorithmic contribution is a bilevel multi-task bilevel optimization framework that predicts the relative utility -- measured by the validation accuracy -- of different training sets, and ensures the learned acquisition function generalizes effectively. For cases where validation accuracy is expensive to evaluate, we introduce efficient interpolation-based surrogate models to estimate the utility function, reducing the evaluation cost. We demonstrate the performance of our approach through extensive experiments on standard active classification benchmarks. By employing our learned utility function, we show significant improvements over traditional techniques, paving the way for more efficient and effective utility maximization in active learning applications.
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Authors: Jianwei Zhang, Suren Jayasuriya, Visar Berisha
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.17049
Pdf link: https://arxiv.org/pdf/2310.17049
Abstract A good supervised embedding for a specific machine learning task is only sensitive to changes in the label of interest and is invariant to other confounding factors. We leverage the concept of repeatability from measurement theory to describe this property and propose to use the intra-class correlation coefficient (ICC) to evaluate the repeatability of embeddings. We then propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability. We use simulated data to explain why the ICC regularizer works better on minimizing the intra-class variance than the contrastive loss alone. We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks.
BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation
Authors: Yufei Tian, Felix Zhang, Nanyun Peng
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17054
Pdf link: https://arxiv.org/pdf/2310.17054
Abstract Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.
A Method for Network Intrusion Detection Using Flow Sequence and BERT Framework
Authors: Loc Gia Nguyen, Kohei Watabe
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2310.17127
Pdf link: https://arxiv.org/pdf/2310.17127
Abstract A Network Intrusion Detection System (NIDS) is a tool that identifies potential threats to a network. Recently, different flow-based NIDS designs utilizing Machine Learning (ML) algorithms have been proposed as solutions to detect intrusions efficiently. However, conventional ML-based classifiers have not seen widespread adoption in the real world due to their poor domain adaptation capability. In this research, our goal is to explore the possibility of using sequences of flows to improve the domain adaptation capability of network intrusion detection systems. Our proposal employs natural language processing techniques and Bidirectional Encoder Representations from Transformers framework, which is an effective technique for modeling data with respect to its context. Early empirical results show that our approach has improved domain adaptation capability compared to previous approaches. The proposed approach provides a new research method for building a robust intrusion detection system.
Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness
Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, H. V. Poor, L. Hanzo
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.17155
Pdf link: https://arxiv.org/pdf/2310.17155
Abstract A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We develop a convexsolver based algorithm, which iteratively invokes a convex problem of the same beamformer size for its solution. We then introduce the soft max-min rate objective function and develop a scalable algorithm for its optimization. Our simulation results demonstrate the striking fact that soft max-min rate optimization not only approaches the minimum user rate obtained by max-min rate optimization but it also achieves a sum rate similar to that of sum-rate maximization. Thus, the soft max-min rate optimization based beamforming design conceived offers a new technique of simultaneously achieving a high individual quality-of-service for all users and a high total network throughput.
Content-based Controls For Music Large Language Modeling
Authors: Liwei Lin, Gus Xia, Junyan Jiang, Yixiao Zhang
Subjects: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.17162
Pdf link: https://arxiv.org/pdf/2310.17162
Abstract Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and style transfer. Our source codes and demos are available online.
MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR
Authors: Liao Pan, Yang Feng, Wu Di, Liu Bo, Zhang Xingle
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17170
Pdf link: https://arxiv.org/pdf/2310.17170
Abstract This paper aims to address critical issues in the field of Multi-Object Tracking (MOT) by proposing an efficient and computationally resource-efficient end-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods typically involve two separate steps: object detection and object tracking, leading to computational complexity and error propagation issues. Recent research has demonstrated outstanding performance in end-to-end MOT models based on Transformer architectures, but they require substantial hardware support. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct a high-efficiency, lightweight, and resource-efficient end-to-end multi-object tracking network, offering new opportunities in the multi-object tracking domain. On the MOT17 dataset, MOTR\cite{zeng2022motr} requires training with 8 GeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO only requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve comparable performance.
A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays
Authors: Mrinal Kanti Dhar, Mou Deb, D. Madhab, Zeyun Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17176
Pdf link: https://arxiv.org/pdf/2310.17176
Abstract Accurate teeth segmentation and orientation are fundamental in modern oral healthcare, enabling precise diagnosis, treatment planning, and dental implant design. In this study, we present a comprehensive approach to teeth segmentation and orientation from panoramic X-ray images, leveraging deep learning techniques. We build our model based on FUSegNet, a popular model originally developed for wound segmentation, and introduce modifications by incorporating grid-based attention gates into the skip connections. We introduce oriented bounding box (OBB) generation through principal component analysis (PCA) for precise tooth orientation estimation. Evaluating our approach on the publicly available DNS dataset, comprising 543 panoramic X-ray images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) score of 82.82%. We also conduct detailed analyses of individual tooth labels and categorical performance, shedding light on strengths and weaknesses. The proposed model's accuracy and versatility offer promising prospects for improving dental diagnoses, treatment planning, and personalized healthcare in the oral domain. Our generated OBB coordinates and codes are available at https://github.com/mrinal054/Instance_teeth_segmentation.
Graphical Object-Centric Actor-Critic
Authors: Leonid Ugadiarov, Aleksandr I. Panov
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.17178
Pdf link: https://arxiv.org/pdf/2310.17178
Abstract There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm.
TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World
Authors: Anirudh Khatry, Sumit Gulwani, Priyanshu Gupta, Vu Le, Ananya Singha, Mukul Singh, Gust Verbruggen
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2310.17228
Pdf link: https://arxiv.org/pdf/2310.17228
Abstract Target similarity tuning (TST) is a method of selecting relevant examples in natural language (NL) to code generation through large language models (LLMs) to improve performance. Its goal is to adapt a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs. In this paper, we propose different methods to apply and improve TST in the real world. First, we replace the sentence transformer with embeddings from a larger model, which reduces sensitivity to the language distribution and thus provides more flexibility in synthetic generation of examples, and we train a tiny model that transforms these embeddings to a space where embedding similarity matches code similarity, which allows the model to remain a black box and only requires a few matrix multiplications at inference time. Second, we how to efficiently select a smaller number of training examples to train the TST model. Third, we introduce a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.
fairret: a Framework for Differentiable Fairness Regularization Terms
Authors: Maarten Buyl, MaryBeth Defrance, Tijl De Bie
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17256
Pdf link: https://arxiv.org/pdf/2310.17256
Abstract Current tools for machine learning fairness only admit a limited range of fairness definitions and have seen little integration with automatic differentiation libraries, despite the central role these libraries play in modern machine learning pipelines. We introduce a framework of fairness regularization terms (fairrets) which quantify bias as modular objectives that are easily integrated in automatic differentiation pipelines. By employing a general definition of fairness in terms of linear-fractional statistics, a wide class of fairrets can be computed efficiently. Experiments show the behavior of their gradients and their utility in enforcing fairness with minimal loss of predictive power compared to baselines. Our contribution includes a PyTorch implementation of the fairret framework.
BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds
Authors: Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17281
Pdf link: https://arxiv.org/pdf/2310.17281
Abstract We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.
RAN Functional Split Options for Integrated Terrestrial and Non-Terrestrial 6G Networks
Authors: Mohamed Rihan, Tim Due, MohammadAmin Vakilifard, Dirk Wubben, Armin Dekorsy
Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17317
Pdf link: https://arxiv.org/pdf/2310.17317
Abstract Leveraging non-terrestrial platforms in 6G networks holds immense significance as it opens up opportunities to expand network coverage, enhance connectivity, and support a wide range of innovative applications, including global-scale Internet of Things and ultra-high-definition content delivery. To accomplish the seamless integration between terrestrial and non-terrestrial networks, substantial changes in radio access network (RAN) architecture are required. These changes involve the development of new RAN solutions that can efficiently manage the diverse characteristics of both terrestrial and non-terrestrial components, ensuring smooth handovers, resource allocation, and quality of service across the integrated network ecosystem. Additionally, the establishment of robust interconnection and communication protocols between terrestrial and non-terrestrial elements will be pivotal to utilize the full potential of 6G technology. Additionally, innovative approaches have been introduced to split the functionalities within the RAN into centralized and distributed domains. These novel paradigms are designed to enhance RAN's flexibility while simultaneously lowering the costs associated with infrastructure deployment, all while ensuring that the quality of service for end-users remains unaffected. In this work, we provide an extensive examination of various Non-Terrestrial Networks (NTN) architectures and the necessary adaptations required on the existing 5G RAN architecture to align with the distinct attributes of NTN. Of particular significance, we emphasize the crucial RAN functional split choices essential for the seamless integration of terrestrial and non-terrestrial components within advanced 6G networks.
Mode Selection for Component Mode Synthesis with Guaranteed Assembly Accuracy
Authors: Lars A.L. Janssen, Rob H.B. Fey, Bart Besselink, Nathan van de Wouw
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17320
Pdf link: https://arxiv.org/pdf/2310.17320
Abstract In this work, a modular approach is introduced to select the most important eigenmodes for each component of a composed structural dynamics system to obtain the required accuracy of the reduced-order assembly model. To enable the use of models of complex (structural) dynamical systems in engineering practice, e.g., in a design, optimization and/or control context, the complexity of the models needs to be reduced. When the model consist of an assembly of multiple interconnected structural components, component mode synthesis is often the preferred model reduction method. The standard approach to component mode synthesis for such system is to select the eigenmodes of a component that are most important to accurately model the dynamic behavior of this component in a certain frequency range of interest. However, often, a more relevant goal is to obtain, in this frequency range, an accurate model of the assembly. In the proposed approach, accuracy requirements on the level of the assembly are translated to accuracy requirements on component level, by employing techniques from the field of systems and control. With these component-level requirements, the eigenmodes that are most important to accurately model the dynamic behavior of the assembly can be selected in a modular fashion. We demonstrate with two structural dynamics benchmark systems that this method based on assembly accuracy allows for a computationally efficient selection of eigenmodes that 1) guarantees satisfaction of the assembly accuracy requirements and 2) results in most cases in reduced-order models of significantly lower order with respect to the industrial standard approach in which component eigenmodes are selected using a frequency criterion.
CQM: Curriculum Reinforcement Learning with a Quantized World Model
Authors: Seungjae Lee, Daesol Cho, Jonghae Park, H. Jin Kim
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17330
Pdf link: https://arxiv.org/pdf/2310.17330
Abstract Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.
Acceleration and restart for the randomized Bregman-Kaczmarz method
Authors: Lionel Tondji, Ion Necoara, Dirk A. Lorenz
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.17338
Pdf link: https://arxiv.org/pdf/2310.17338
Abstract Optimizing strongly convex functions subject to linear constraints is a fundamental problem with numerous applications. In this work, we propose a block (accelerated) randomized Bregman-Kaczmarz method that only uses a block of constraints in each iteration to tackle this problem. We consider a dual formulation of this problem in order to deal in an efficient way with the linear constraints. Using convex tools, we show that the corresponding dual function satisfies the Polyak-Lojasiewicz (PL) property, provided that the primal objective function is strongly convex and verifies additionally some other mild assumptions. However, adapting the existing theory on coordinate descent methods to our dual formulation can only give us sublinear convergence results in the dual space. In order to obtain convergence results in some criterion corresponding to the primal (original) problem, we transfer our algorithm to the primal space, which combined with the PL property allows us to get linear convergence rates. More specifically, we provide a theoretical analysis of the convergence of our proposed method under different assumptions on the objective and demonstrate in the numerical experiments its superior efficiency and speed up compared to existing methods for the same problem.
Exploring the Trie of Rules: a fast data structure for the representation of association rules
Authors: Mikhail Kudriavtsev, Dr Marija Bezbradica, Dr Andrew McCarren
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17355
Pdf link: https://arxiv.org/pdf/2310.17355
Abstract Association rule mining techniques can generate a large volume of sequential data when implemented on transactional databases. Extracting insights from a large set of association rules has been found to be a challenging process. When examining a ruleset, the fundamental question is how to summarise and represent meaningful mined knowledge efficiently. Many algorithms and strategies have been developed to address issue of knowledge extraction; however, the effectiveness of this process can be limited by the data structures. A better data structure can sufficiently affect the speed of the knowledge extraction process. This paper proposes a novel data structure, called the Trie of rules, for storing a ruleset that is generated by association rule mining. The resulting data structure is a prefix-tree graph structure made of pre-mined rules. This graph stores the rules as paths within the prefix-tree in a way that similar rules overlay each other. Each node in the tree represents a rule where a consequent is this node, and an antecedent is a path from this node to the root of the tree. The evaluation showed that the proposed representation technique is promising. It compresses a ruleset with almost no data loss and benefits in terms of time for basic operations such as searching for a specific rule and sorting, which is the base for many knowledge discovery methods. Moreover, our method demonstrated a significant improvement in traversing time, achieving an 8-fold increase compared to traditional data structures.
YOLO-BEV: Generating Bird's-Eye View in the Same Way as 2D Object Detection
Authors: Chang Liu, Liguo Zhou, Yanliang Huang, Alois Knoll
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17379
Pdf link: https://arxiv.org/pdf/2310.17379
Abstract Vehicle perception systems strive to achieve comprehensive and rapid visual interpretation of their surroundings for improved safety and navigation. We introduce YOLO-BEV, an efficient framework that harnesses a unique surrounding cameras setup to generate a 2D bird's-eye view of the vehicular environment. By strategically positioning eight cameras, each at a 45-degree interval, our system captures and integrates imagery into a coherent 3x3 grid format, leaving the center blank, providing an enriched spatial representation that facilitates efficient processing. In our approach, we employ YOLO's detection mechanism, favoring its inherent advantages of swift response and compact model structure. Instead of leveraging the conventional YOLO detection head, we augment it with a custom-designed detection head, translating the panoramically captured data into a unified bird's-eye view map of ego car. Preliminary results validate the feasibility of YOLO-BEV in real-time vehicular perception tasks. With its streamlined architecture and potential for rapid deployment due to minimized parameters, YOLO-BEV poses as a promising tool that may reshape future perspectives in autonomous driving systems.
Invariance Measures for Neural Networks
Authors: Facundo Manuel Quiroga, Jordina Torrents-Barrena, Laura Cristina Lanzarini, Domenec Puig-Valls
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2310.17404
Pdf link: https://arxiv.org/pdf/2310.17404
Abstract Invariances in neural networks are useful and necessary for many tasks. However, the representation of the invariance of most neural network models has not been characterized. We propose measures to quantify the invariance of neural networks in terms of their internal representation. The measures are efficient and interpretable, and can be applied to any neural network model. They are also more sensitive to invariance than previously defined measures. We validate the measures and their properties in the domain of affine transformations and the CIFAR10 and MNIST datasets, including their stability and interpretability. Using the measures, we perform a first analysis of CNN models and show that their internal invariance is remarkably stable to random weight initializations, but not to changes in dataset or transformation. We believe the measures will enable new avenues of research in invariance representation.
Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic
Authors: Ritam Raha, Rajarshi Roy, Nathanael Fijalkow, Daniel Neider, Guillermo A. Perez
Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2310.17410
Pdf link: https://arxiv.org/pdf/2310.17410
Abstract In runtime verification, manually formalizing a specification for monitoring system executions is a tedious and error-prone process. To address this issue, we consider the problem of automatically synthesizing formal specifications from system executions. To demonstrate our approach, we consider the popular specification language Metric Temporal Logic (MTL), which is particularly tailored towards specifying temporal properties for cyber-physical systems (CPS). Most of the classical approaches for synthesizing temporal logic formulas aim at minimizing the size of the formula. However, for efficiency in monitoring, along with the size, the amount of "lookahead" required for the specification becomes relevant, especially for safety-critical applications. We formalize this notion and devise a learning algorithm that synthesizes concise formulas having bounded lookahead. To do so, our algorithm reduces the synthesis task to a series of satisfiability problems in Linear Real Arithmetic (LRA) and generates MTL formulas from their satisfying assignments. The reduction uses a novel encoding of a popular MTL monitoring procedure using LRA. Finally, we implement our algorithm in a tool called TEAL and demonstrate its ability to synthesize efficiently monitorable MTL formulas in a CPS application.
LEI2JSON: Schema-based Validation and Conversion of Livestock Event Information
Authors: Mahir Habib, Muhammad Ashad Kabir, Lihong Zheng
Subjects: Systems and Control (eess.SY); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2310.17414
Pdf link: https://arxiv.org/pdf/2310.17414
Abstract Livestock producers often need help in standardising (i.e., converting and validating) their livestock event data. This article introduces a novel solution, LEI2JSON (Livestock Event Information To JSON). The tool is an add-on for Google Sheets, adhering to the livestock event information (LEI) schema. The core objective of LEI2JSON is to provide livestock producers with an efficient mechanism to standardise their data, leading to substantial savings in time and resources. This is achieved by building the spreadsheet template with the appropriate column headers, notes, and validation rules, converting the spreadsheet data into JSON format, and validating the output against the schema. LEI2JSON facilitates the seamless storage of livestock event information locally or on Google Drive in JSON. Additionally, we have conducted an extensive experimental evaluation to assess the effectiveness of the tool.
Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs
Authors: Kaushik Dey, Satheesh K. Perepu, Abir Das
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.17416
Pdf link: https://arxiv.org/pdf/2310.17416
Abstract Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
Authors: You-Ming Chang, Chen Yeh, Wei-Chen Chiu, Ning Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17419
Pdf link: https://arxiv.org/pdf/2310.17419
Abstract Deep generative models can create remarkably photorealistic fake images while raising concerns about misinformation and copyright infringement, known as deepfake threats. Deepfake detection technique is developed to distinguish between real and fake images, where the existing methods typically learn classifiers in the image domain or various feature domains. However, the generalizability of deepfake detection against emerging and more advanced generative models remains challenging. In this paper, being inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach using VLMs (e.g. InstructBLIP) and prompt tuning techniques to improve the deepfake detection accuracy over unseen data. We formulate deepfake detection as a visual question answering problem, and tune soft prompts for InstructBLIP to answer the real/fake information of a query image. We conduct full-spectrum experiments on datasets from 3 held-in and 13 held-out generative models, covering modern text-to-image generation, image editing and image attacks. Results demonstrate that (1) the deepfake detection accuracy can be significantly and consistently improved (from 58.8% to 91.31%, in average accuracy over unseen data) using pretrained vision-language models with prompt tuning; (2) our superior performance is at less cost of trainable parameters, resulting in an effective and efficient solution for deepfake detection. Code and models can be found at https://github.com/nctu-eva-lab/AntifakePrompt.
''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text
Authors: Rishav Hada, Agrima Seth, Harshita Diddee, Kalika Bali
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17428
Pdf link: https://arxiv.org/pdf/2310.17428
Abstract Language serves as a powerful tool for the manifestation of societal belief systems. In doing so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. With LLMs increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. Prior work often treats gender bias as a binary classification task. However, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. Specifically, we create the first dataset of GPT-generated English text with normative ratings of gender bias. Ratings were obtained using Best--Worst Scaling -- an efficient comparative annotation framework. Next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. Finally, we show the performance of existing automated models trained on related concepts on our dataset.
Efficient safe learning for controller tuning with experimental validation
Authors: Marta Zagorowska, Christopher König, Hanlin Yu, Efe C. Balta, Alisa Rupenyan, John Lygeros
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17431
Pdf link: https://arxiv.org/pdf/2310.17431
Abstract Optimization-based controller tuning is challenging because it requires formulating optimization problems explicitly as functions of controller parameters. Safe learning algorithms overcome the challenge by creating surrogate models from measured data. To ensure safety, such data-driven algorithms often rely on exhaustive grid search, which is computationally inefficient. In this paper, we propose a novel approach to safe learning by formulating a series of optimization problems instead of a grid search. We also develop a method for initializing the optimization problems to guarantee feasibility while using numerical solvers. The performance of the new method is first validated in a simulated precision motion system, demonstrating improved computational efficiency, and illustrating the role of exploiting numerical solvers to reach the desired precision. Experimental validation on an industrial-grade precision motion system confirms that the proposed algorithm achieves 30% better tracking at sub-micrometer precision as a state-of-the-art safe learning algorithm, improves the default auto-tuning solution, and reduces the computational cost seven times compared to learning algorithms based on exhaustive search.
Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings
Authors: Yifei Peng, Yu Jin, Zhexu Luo, Yao-Xiang Ding, Wang-Zhou Dai, Zhong Ren, Kun Zhou
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2310.17451
Pdf link: https://arxiv.org/pdf/2310.17451
Abstract Despite the great success of neural visual generative models in recent years, integrating them with strong symbolic knowledge reasoning systems remains a challenging task. The main challenges are two-fold: one is symbol assignment, i.e. bonding latent factors of neural visual generators with meaningful symbols from knowledge reasoning systems. Another is rule learning, i.e. learning new rules, which govern the generative process of the data, to augment the knowledge reasoning systems. To deal with these symbol grounding problems, we propose a neural-symbolic learning approach, Abductive Visual Generation (AbdGen), for integrating logic programming systems with neural visual generative models based on the abductive learning framework. To achieve reliable and efficient symbol assignment, the quantized abduction method is introduced for generating abduction proposals by the nearest-neighbor lookups within semantic codebooks. To achieve precise rule learning, the contrastive meta-abduction method is proposed to eliminate wrong rules with positive cases and avoid less-informative rules with negative cases simultaneously. Experimental results on various benchmark datasets show that compared to the baselines, AbdGen requires significantly fewer instance-level labeling information for symbol assignment. Furthermore, our approach can effectively learn underlying logical generative rules from data, which is out of the capability of existing approaches.
Cross-modal Active Complementary Learning with Self-refining Correspondence
Authors: Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17468
Pdf link: https://arxiv.org/pdf/2310.17468
Abstract Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
Analytical model for large-scale design of sidewalk delivery robot systems
Authors: Hai Yang, Yuchen Du, Tho V. Le, Joseph Y. J. Chow
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2310.17475
Pdf link: https://arxiv.org/pdf/2310.17475
Abstract With the rise in demand for local deliveries and e-commerce, robotic deliveries are being considered as efficient and sustainable solutions. However, the deployment of such systems can be highly complex due to numerous factors involving stochastic demand, stochastic charging and maintenance needs, complex routing, etc. We propose a model that uses continuous approximation methods for evaluating service trade-offs that consider the unique characteristics of large-scale sidewalk delivery robot systems used to serve online food deliveries. The model captures both the initial cost and the operation cost of the delivery system and evaluates the impact of constraints and operation strategies on the deployment. By minimizing the system cost, variables related to the system design can be determined. First, the minimization problem is formulated based on a homogeneous area, and the optimal system cost can be derived as a closed-form expression. By evaluating the expression, relationships between variables and the system cost can be directly obtained. We then apply the model in neighborhoods in New York City to evaluate the cost of deploying the sidewalk delivery robot system in a real-world scenario. The results shed light on the potential of deploying such a system in the future.
FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing
Authors: Terence Jie Chua, Wenhan Yu, Jun Zhao, Kwok-Yan Lam
Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2310.17491
Pdf link: https://arxiv.org/pdf/2310.17491
Abstract The emergence of foundation models, including language and vision models, has reshaped AI's landscape, offering capabilities across various applications. Deploying and fine-tuning these large models, like GPT-3 and BERT, presents challenges, especially in the current foundation model era. We introduce Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning (PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses adapters, emulators, and PEFT for federated model tuning, enhancing model privacy and memory efficiency. Adapters adjust pre-trained models, while emulators give a compact representation of original models, addressing both privacy and efficiency. Adaptable to various neural networks, our approach also uses deep reinforcement learning for hyper-parameter optimization. We tested FedPEAT in a unique scenario with a server participating in collaborative federated tuning, showcasing its potential in tackling foundation model challenges.
Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach
Authors: Wenhan Yu, Terence Jie Chua, Jun Zhao
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2310.17492
Pdf link: https://arxiv.org/pdf/2310.17492
Abstract The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.
The IMS Toucan System for the Blizzard Challenge 2023
Authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2310.17499
Pdf link: https://arxiv.org/pdf/2310.17499
Abstract For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.
A Lightweight, Compiler-Assisted Register File Cache for GPGPU
Authors: Mojtaba Abaie Shoushtary, Jose Maria Arnau, Jordi Tubella Murgadas, Antonio Gonzalez
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2310.17501
Pdf link: https://arxiv.org/pdf/2310.17501
Abstract Modern GPUs require an enormous register file (RF) to store the context of thousands of active threads. It consumes considerable energy and contains multiple large banks to provide enough throughput. Thus, a RF caching mechanism can significantly improve the performance and energy consumption of the GPUs by avoiding reads from the large banks that consume significant energy and may cause port conflicts. This paper introduces an energy-efficient RF caching mechanism called Malekeh that repurposes an existing component in GPUs' RF to operate as a cache in addition to its original functionality. In this way, Malekeh minimizes the overhead of adding a RF cache to GPUs. Besides, Malekeh leverages an issue scheduling policy that utilizes the reuse distance of the values in the RF cache and is controlled by a dynamic algorithm. The goal is to adapt the issue policy to the runtime program characteristics to maximize the GPU's performance and the hit ratio of the RF cache. The reuse distance is approximated by the compiler using profiling and is used at run time by the proposed caching scheme. We show that Malekeh reduces the number of reads to the RF banks by 46.4% and the dynamic energy of the RF by 28.3%. Besides, it improves performance by 6.1% while adding only 2KB of extra storage per core to the baseline RF of 256KB, which represents a negligible overhead of 0.78%.
Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving
Authors: Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Patrick Pérez, Andrei Bursuc, Renaud Marlet
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17504
Pdf link: https://arxiv.org/pdf/2310.17504
Abstract Self-supervised image networks can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. However, self-supervised 3D networks on lidar data do not perform as well for now. A few methods therefore propose to distill high-quality self-supervised 2D features into 3D networks. The most recent ones doing so on autonomous driving data show promising results. Yet, a performance gap persists between these distilled features and fully-supervised ones. In this work, we revisit 2D-to-3D distillation. First, we propose, for semantic segmentation, a simple approach that leads to a significant improvement compared to prior 3D distillation methods. Second, we show that distillation in high capacity 3D networks is key to reach high quality 3D features. This actually allows us to significantly close the gap between unsupervised distilled 3D features and fully-supervised ones. Last, we show that our high-quality distilled representations can also be used for open-vocabulary segmentation and background/foreground discovery.
The Expressive Power of Low-Rank Adaptation
Authors: Yuchen Zeng, Kangwook Lee
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.17513
Pdf link: https://arxiv.org/pdf/2310.17513
Abstract Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
FLARE: Fast Learning of Animatable and Relightable Mesh Avatars
Authors: Shrisha Bharadwaj, Yufeng Zheng, Otmar Hilliges, Michael J. Black, Victoria Fernandez-Abrevaya
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17519
Pdf link: https://arxiv.org/pdf/2310.17519
Abstract Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce \moniker, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.
Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction
Authors: Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, Huaping Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17527
Pdf link: https://arxiv.org/pdf/2310.17527
Abstract In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size.Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. Code is available at https://github.com/masked-spacetime-hashing/msth
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Authors: Fengzhuo Zhang, Vincent Y. F. Tan, Zhaoran Wang, Zhuoran Yang
Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.17531
Pdf link: https://arxiv.org/pdf/2310.17531
Abstract We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.
Test Bench Study on Attitude Estimation in Ground Effect Region Based on Motor Current for In-Flight Inductive Power Transfer of Drones
Authors: Kota Fujimoto, Sakahisa Nagai, Nguyen Binh Minh, Hiroshi Fujimoto
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17541
Pdf link: https://arxiv.org/pdf/2310.17541
Abstract To overcome the short flight duration of drones, research on in-flight inductive power transfer has been recognized as an essential solution. Thus, it is important to accurately estimate and control the attitude of the drones which operate close to the charging surface. To this end, this paper proposes an attitude estimation method based solely on the motor current for precision flight control in the ground effect region. The model for the estimation is derived based on the motor equation when it rotates at a constant rotational speed. The proposed method is verified on the simulations and experiments. It allows simultaneous estimation of altitude and pitch angle with the accuracy of 0.30$\hspace{0.5mm}$m and 0.04 rad, respectively. The minimum transmission efficiency of the in-flight power transfer system based on the proposed estimation is calculated as 95.3 %, which is sufficient for the efficient system.
Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent
Authors: Yixiao Chen, Hao Xie, Han Wang
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2310.17556
Pdf link: https://arxiv.org/pdf/2310.17556
Abstract We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation
Authors: Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17570
Pdf link: https://arxiv.org/pdf/2310.17570
Abstract While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).
An efficient frequency-independent numerical method for computing the far-field pattern induced by polygonal obstacles
Authors: A. Gibbs, S. Langdon
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.17603
Pdf link: https://arxiv.org/pdf/2310.17603
Abstract For problems of time-harmonic scattering by rational polygonal obstacles, embedding formulae express the far-field pattern induced by any incident plane wave in terms of the far-field patterns for a relatively small (frequency-independent) set of canonical incident angles. Although these remarkable formulae are exact in theory, here we demonstrate that: (i) they are highly sensitive to numerical errors in practice, and; (ii) direct calculation of the coefficients in these formulae may be impossible for particular sets of canonical incident angles, even in exact arithmetic. Only by overcoming these practical issues can embedding formulae provide a highly efficient approach to computing the far-field pattern induced by a large number of incident angles. Here we propose solutions for problems (i) and (ii), backed up by theory and numerical experiments. Problem (i) is solved using techniques from computational complex analysis: we reformulate the embedding formula as a complex contour integral and prove that this is much less sensitive to numerical errors. In practice, this contour integral can be efficiently evaluated by residue calculus. Problem (ii) is addressed using techniques from numerical linear algebra: we oversample, considering more canonical incident angles than are necessary, thus expanding the space of valid coefficients vectors. The coefficients vectors can then be selected using either a least squares approach or column subset selection.
Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana
Authors: Owen Henkel, Hannah Horne-Robinson, Libby Hills, Bill Roberts, Joshua McGrane
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17606
Pdf link: https://arxiv.org/pdf/2310.17606
Abstract This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Authors: Lianghui Zhu, Xinggang Wang, Xinlong Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17631
Pdf link: https://arxiv.org/pdf/2310.17631
Abstract Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion
Authors: Laura Smith, Yunhao Cao, Sergey Levine
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17634
Pdf link: https://arxiv.org/pdf/2310.17634
Abstract Deep reinforcement learning (RL) can enable robots to autonomously acquire complex behaviors, such as legged locomotion. However, RL in the real world is complicated by constraints on efficiency, safety, and overall training stability, which limits its practical applicability. We present APRL, a policy regularization framework that modulates the robot's exploration over the course of training, striking a balance between flexible improvement potential and focused, efficient exploration. APRL enables a quadrupedal robot to efficiently learn to walk entirely in the real world within minutes and continue to improve with more training where prior work saturates in performance. We demonstrate that continued training with APRL results in a policy that is substantially more capable of navigating challenging situations and is able to adapt to changes in dynamics with continued training.
High-Dimensional Prediction for Sequential Decision Making
Authors: Georgy Noarov, Ramya Ramalingam, Aaron Roth, Stephan Xie
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17651
Pdf link: https://arxiv.org/pdf/2310.17651
Abstract We study the problem of making predictions of an adversarially chosen high-dimensional state that are unbiased subject to an arbitrary collection of conditioning events, with the goal of tailoring these events to downstream decision makers. We give efficient algorithms for solving this problem, as well as a number of applications that stem from choosing an appropriate set of conditioning events.
Keyword: faster

Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs
Authors: Chao Qian, Tianheng Ling, Gregor Schiele
Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.16842
Pdf link: https://arxiv.org/pdf/2310.16842
Abstract To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $\mu$J per inference on the FPGA \textit{XC7S15} from \textit{Spartan-7} family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.
An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation
Authors: Mehrshad Saadatinia, Armin Salimi-Badr
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2310.16867
Pdf link: https://arxiv.org/pdf/2310.16867
Abstract In this study, we leverage a deep learning-based method for the automatic diagnosis of schizophrenia using EEG brain recordings. This approach utilizes generative data augmentation, a powerful technique that enhances the accuracy of the diagnosis. To enable the utilization of time-frequency features, spectrograms were extracted from the raw signals. After exploring several neural network architectural setups, a proper convolutional neural network (CNN) was used for the initial diagnosis. Subsequently, using Wasserstein GAN with Gradient Penalty (WGAN-GP) and Variational Autoencoder (VAE), two different synthetic datasets were generated in order to augment the initial dataset and address the over-fitting issue. The augmented dataset using VAE achieved a 3.0\% improvement in accuracy reaching up to 99.0\% and yielded a lower loss value as well as a faster convergence. Finally, we addressed the lack of trust in black-box models using the Local Interpretable Model-agnostic Explanations (LIME) algorithm to determine the most important superpixels (frequencies) in the diagnosis process.
Faster Recalibration of an Online Predictor via Approachability
Authors: Princewill Okoroafor, Robert Kleinberg, Wen Sun
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17002
Pdf link: https://arxiv.org/pdf/2310.17002
Abstract Predictive models in ML need to be trustworthy and reliable, which often at the very least means outputting calibrated probabilities. This can be particularly difficult to guarantee in the online prediction setting when the outcome sequence can be generated adversarially. In this paper we introduce a technique using Blackwell's approachability theorem for taking an online predictive model which might not be calibrated and transforming its predictions to calibrated predictions without much increase to the loss of the original model. Our proposed algorithm achieves calibration and accuracy at a faster rate than existing techniques arXiv:1607.03594 and is the first algorithm to offer a flexible tradeoff between calibration error and accuracy in the online setting. We demonstrate this by characterizing the space of jointly achievable calibration and regret using our technique.
HyperFields: Towards Zero-Shot Generation of NeRFs from Text
Authors: Sudarshan Babu, Richard Liu, Avery Zhou, Michael Maire, Greg Shakhnarovich, Rana Hanocka
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17075
Pdf link: https://arxiv.org/pdf/2310.17075
Abstract We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Authors: Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17086
Pdf link: https://arxiv.org/pdf/2310.17086
Abstract Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Authors: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17157
Pdf link: https://arxiv.org/pdf/2310.17157
Abstract Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.
Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Authors: Zhenkai Zhang, Krista A. Ehinger, Tom Drummond
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17167
Pdf link: https://arxiv.org/pdf/2310.17167
Abstract This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.
Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
Authors: Fengyuan Shi, Limin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17177
Pdf link: https://arxiv.org/pdf/2310.17177
Abstract Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.
CuRobo: Parallelized Collision-Free Minimum-Jerk Robot Motion Generation
Authors: Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, Dieter Fox
Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2310.17274
Pdf link: https://arxiv.org/pdf/2310.17274
Abstract This paper explores the problem of collision-free motion generation for manipulators by formulating it as a global motion optimization problem. We develop a parallel optimization technique to solve this problem and demonstrate its effectiveness on massively parallel GPUs. We show that combining simple optimization techniques with many parallel seeds leads to solving difficult motion generation problems within 50ms on average, 60x faster than state-of-the-art (SOTA) trajectory optimization methods. We achieve SOTA performance by combining L-BFGS step direction estimation with a novel parallel noisy line search scheme and a particle-based optimization solver. To further aid trajectory optimization, we develop a parallel geometric planner that plans within 20ms and also introduce a collision-free IK solver that can solve over 7000 queries/s. We package our contributions into a state of the art GPU accelerated motion generation library, CuRobo and release it to enrich the robotics community. Additional details are available at https://curobo.org
Boolean Abstractions for Realizability Modulo Theories (Extended version)
Authors: Andoni Rodriguez, Cesar Sanchez
Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2310.17292
Pdf link: https://arxiv.org/pdf/2310.17292
Abstract In this paper, we address the problem of the (reactive) realizability of specifications of theories richer than Booleans, including arithmetic theories. Our approach transforms theory specifications into purely Boolean specifications by (1) substituting theory literals by Boolean variables, and (2) computing an additional Boolean requirement that captures the dependencies between the new variables imposed by the literals. The resulting specification can be passed to existing Boolean off-the-shelf realizability tools, and is realizable if and only if the original specification is realizable. The first contribution is a brute-force version of our method, which requires a number of SMT queries that is doubly exponential in the number of input literals. Then, we present a faster method that exploits a nested encoding of the search for the extra requirement and uses SAT solving for faster traversing the search space and uses SMT queries internally. Another contribution is a prototype in Z3-Python. Finally, we report an empirical evaluation using specifications inspired in real industrial cases. To the best of our knowledge, this is the first method that succeeds in non-Boolean LTL realizability.
Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs
Authors: Kaushik Dey, Satheesh K. Perepu, Abir Das
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.17416
Pdf link: https://arxiv.org/pdf/2310.17416
Abstract Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.
Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent
Authors: Yixiao Chen, Hao Xie, Han Wang
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2310.17556
Pdf link: https://arxiv.org/pdf/2310.17556
Abstract We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.
Keyword: mobile

How does Module Tracking for Agrivoltaics Differ from Standard Photovoltaics? Performance & Technoeconomic Implications
Authors: Habeel Alam, Nauman Zafar Butt
Subjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2310.16946
Pdf link: https://arxiv.org/pdf/2310.16946
Abstract Spatial-temporal sharing of sunlight between solar modules and crops needs to be designed optimally in agrivoltaics (AV). For AV with fixed module tilts, the sunlight balance is governed through the spatial density and elevation of the modules which cannot be manipulated after the installation. For flexible food-energy balancing across various seasons and crop rotations, modules with single or dual axis mobility can be best suitable. AV tracking must be geared towards ensuring a desired sunlight balance that may depend on many factors including the crop type, module array density, socio-economic factors, and local policies. Here, we explore single axis customized tracking (CT) for the mobile AV using a techno-economic model that incorporates design parameters including crop's shade sensitivity, module to land area ratio, and module types, as well as the economic parameters including soft and hardware costs for modules, feed-in-tariff, and crop income. CT is implemented through standard tracking that tracks the sun around noon hours and its orthogonal, i.e., anti-tracking around sunrise and sunset. We evaluate the optimal CT schemes that can maximize economic performance while ensuring the desired food-energy yield thresholds. Economic feasibility for AV is evaluated in terms of the ratio (ppr) of the price for the module system customizations to the performance benefit due to the crop income. A case study for Punjab, Pakistan shows that CT schemes for moderate shade sensitive crops and typically dense AV module arrays can require 30 to 40 percent increase in the reference FIT to ensure the food-energy yield threshold of 80 percent relative to standalone food-energy farms for high and low value crops, respectively. CT schemes for a lower crop yield threshold of 70 percent require the corresponding increase in FIT to 10 to 20 percent, respectively.
Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs
Authors: Kaushik Dey, Satheesh K. Perepu, Abir Das
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.17416
Pdf link: https://arxiv.org/pdf/2310.17416
Abstract Intent-based management will play a critical role in achieving customers' expectations in the next-generation mobile networks. Traditional methods cannot perform efficient resource management since they tend to handle each expectation independently. Existing approaches, e.g., based on multi-agent reinforcement learning (MARL) allocate resources in an efficient fashion when there are conflicting expectations on the network slice. However, in reality, systems are often far more complex to be addressed by a standalone MARL formulation. Often there exists a hierarchical structure of intent fulfilment where multiple pre-trained, self-interested agents may need to be further orchestrated by a supervisor or controller agent. Such agents may arrive in the system adhoc, which then needs to be orchestrated along with other available agents. Retraining the whole system every time is often infeasible given the associated time and cost. Given the challenges, such adhoc coordination of pre-trained systems could be achieved through an intelligent supervisor agent which incentivizes pre-trained RL/MARL agents through sets of dynamic contracts (goals or bonuses) and encourages them to act as a cohesive unit towards fulfilling a global expectation. Some approaches use a rule-based supervisor agent and deploy the hierarchical constituent agents sequentially, based on human-coded rules. In the current work, we propose a framework whereby pre-trained agents can be orchestrated in parallel leveraging an AI-based supervisor agent. For this, we propose to use Adhoc-Teaming approaches which assign optimal goals to the MARL agents and incentivize them to exhibit certain desired behaviours. Results on the network emulator show that the proposed approach results in faster and improved fulfilment of expectations when compared to rule-based approaches and even generalizes to changes in environments.
Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach
Authors: Wenhan Yu, Terence Jie Chua, Jun Zhao
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2310.17492
Pdf link: https://arxiv.org/pdf/2310.17492
Abstract The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.
Adaptive Resource Management for Edge Network Slicing using Incremental Multi-Agent Deep Reinforcement Learning
Authors: Haiyuan Li, Yuelin Liu, Xueqing Zhou, Xenofon Vasilakos, Reza Nejabati, Shuangyi Yan, Dimitra Simenidou
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17523
Pdf link: https://arxiv.org/pdf/2310.17523
Abstract Multi-access edge computing provides local resources in mobile networks as the essential means for meeting the demands of emerging ultra-reliable low-latency communications. At the edge, dynamic computing requests require advanced resource management for adaptive network slicing, including resource allocations, function scaling and load balancing to utilize only the necessary resources in resource-constraint networks. Recent solutions are designed for a static number of slices. Therefore, the painful process of optimization is required again with any update on the number of slices. In addition, these solutions intend to maximize instant rewards, neglecting long-term resource scheduling. Unlike these efforts, we propose an algorithmic approach based on multi-agent deep deterministic policy gradient (MADDPG) for optimizing resource management for edge network slicing. Our objective is two-fold: (i) maximizing long-term network slicing benefits in terms of delay and energy consumption, and (ii) adapting to slice number changes. Through simulations, we demonstrate that MADDPG outperforms benchmark solutions including a static slicing-based one from the literature, achieving stable and high long-term performance. Additionally, we leverage incremental learning to facilitate a dynamic number of edge slices, with enhanced performance compared to pre-trained base models. Remarkably, this approach yields superior reward performance while saving approximately 90% of training time costs.
Keyword: pruning

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification
Authors: Haoli Yin, Jiayao Li (Emily), Eva Schiller, Luke McDermott, Daniel Cummings
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.16856
Pdf link: https://arxiv.org/pdf/2310.16856
Abstract Object Re-Identification (ReID) is pivotal in computer vision, witnessing an escalating demand for adept multimodal representation learning. Current models, although promising, reveal scalability limitations with increasing modalities as they rely heavily on late fusion, which postpones the integration of specific modality insights. Addressing this, we introduce the \textbf{Gradual Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. Further bolstering its efficacy, we introduce a novel training paradigm combined with an augmented triplet loss, optimizing the ReID feature embedding space. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks. Additionally, aiming for deployment versatility, we've integrated neural network pruning into GraFT, offering a balance between model size and performance.
Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs
Authors: Jianwei Fei, Zhihua Xia, Benedetta Tondi, Mauro Barni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.16919
Pdf link: https://arxiv.org/pdf/2310.16919
Abstract We propose a novel multi-bit box-free watermarking method for the protection of Intellectual Property Rights (IPR) of GANs with improved robustness against white-box attacks like fine-tuning, pruning, quantization, and surrogate model attacks. The watermark is embedded by adding an extra watermarking loss term during GAN training, ensuring that the images generated by the GAN contain an invisible watermark that can be retrieved by a pre-trained watermark decoder. In order to improve the robustness against white-box model-level attacks, we make sure that the model converges to a wide flat minimum of the watermarking loss term, in such a way that any modification of the model parameters does not erase the watermark. To do so, we add random noise vectors to the parameters of the generator and require that the watermarking loss term is as invariant as possible with respect to the presence of noise. This procedure forces the generator to converge to a wide flat minimum of the watermarking loss. The proposed method is architectureand dataset-agnostic, thus being applicable to many different generation tasks and models, as well as to CNN-based image processing architectures. We present the results of extensive experiments showing that the presence of the watermark has a negligible impact on the quality of the generated images, and proving the superior robustness of the watermark against model modification and surrogate model attacks.
Optimal Robotic Assembly Sequence Planning: A Sequential Decision-Making Approach
Authors: Kartik Nagpal, Negar Mehr
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2310.17115
Pdf link: https://arxiv.org/pdf/2310.17115
Abstract The optimal robot assembly planning problem is challenging due to the necessity of finding the optimal solution amongst an exponentially vast number of possible plans, all while satisfying a selection of constraints. Traditionally, robotic assembly planning problems have been solved using heuristics, but these methods are specific to a given objective structure or set of problem parameters. In this paper, we propose a novel approach to robotic assembly planning that poses assembly sequencing as a sequential decision making problem, enabling us to harness methods that far outperform the state-of-the-art. We formulate the problem as a Markov Decision Process (MDP) and utilize Dynamic Programming (DP) to find optimal assembly policies for moderately sized strictures. We further expand our framework to exploit the deterministic nature of assembly planning and introduce a class of optimal Graph Exploration Assembly Planners (GEAPs). For larger structures, we show how Reinforcement Learning (RL) enables us to learn policies that generate high reward assembly sequences. We evaluate our approach on a variety of robotic assembly problems, such as the assembly of the Hubble Space Telescope, the International Space Station, and the James Webb Space Telescope. We further showcase how our DP, GEAP, and RL implementations are capable of finding optimal solutions under a variety of different objective functions and how our formulation allows us to translate precedence constraints to branch pruning and thus further improve performance. We have published our code at https://github.com/labicon/ORASP-Code.
Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
Authors: Fengyuan Shi, Limin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17177
Pdf link: https://arxiv.org/pdf/2310.17177
Abstract Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.
PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent
Authors: Guangliang Liu, Zhiyu Xue, Xitong Zhang, Kristen Marie Johnson, Rongrong Wang
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17588
Pdf link: https://arxiv.org/pdf/2310.17588
Abstract Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale optimization problem, in which the choice of the training algorithm critically determines how well the trained model can generalize to unseen test data, especially in the context of few-shot learning. To achieve good generalization performance and avoid overfitting, techniques such as data augmentation and pruning are often applied. However, adding these regularizations necessitates heavy tuning of the hyperparameters of optimization algorithms, such as the popular Adam optimizer. In this paper, we propose a two-stage fine-tuning method, PAC-tuning, to address this optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly minimizes the PAC-Bayes generalization bound to learn proper parameter distribution. Second, PAC-tuning modifies the gradient by injecting noise with the variance learned in the first stage into the model parameters during training, resulting in a variant of perturbed gradient descent (PGD). In the past, the few-shot scenario posed difficulties for PAC-Bayes training because the PAC-Bayes bound, when applied to large models with limited training data, might not be stringent. Our experimental results across 5 GLUE benchmark tasks demonstrate that PAC-tuning successfully handles the challenges of fine-tuning tasks and outperforms strong baseline methods by a visible margin, further confirming the potential to apply PAC training for any other settings where the Adam optimizer is currently used for training.
Keyword: diffusion

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration
Authors: Longlin Yu, Tianyu Xie, Yu Zhu, Tong Yang, Xiangyu Zhang, Cheng Zhang
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2310.17153
Pdf link: https://arxiv.org/pdf/2310.17153
Abstract Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.
Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Authors: Zhenkai Zhang, Krista A. Ehinger, Tom Drummond
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17167
Pdf link: https://arxiv.org/pdf/2310.17167
Abstract This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.
Exploring Iterative Refinement with Diffusion Models for Video Grounding
Authors: Xiao Liang, Tao Shi, Yaoyuan Liang, Te Tao, Shao-Lun Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17189
Pdf link: https://arxiv.org/pdf/2310.17189
Abstract Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Our DiffusionVG follows the encoder-decoder architecture, which firstly encodes the video-sentence features and iteratively denoises the predicted spans in its specialized span refining decoder. Without bells and whistles, our DiffusionVG demonstrates competitive or even superior performance compared to existing well-crafted models on mainstream Charades-STA and ActivityNet Captions benchmarks.
Attribute Based Interpretable Evaluation Metrics for Generative Models
Authors: Dongkyun Kim, Mingi Kwon, Youngjung Uh
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17261
Pdf link: https://arxiv.org/pdf/2310.17261
Abstract When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.
Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics
Authors: Shuai Yang, Zhifei Chen, Pengguang Chen, Xi Fang, Shu Liu, Yingcong Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17316
Pdf link: https://arxiv.org/pdf/2310.17316
Abstract Defect inspection is paramount within the closed-loop manufacturing system. However, existing datasets for defect inspection often lack precision and semantic granularity required for practical applications. In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. Building on four key industrial benchmarks, our dataset refines existing annotations and introduces rich semantic details, distinguishing multiple defect types within a single image. Furthermore, we introduce Defect-Gen, a two-stage diffusion-based generator designed to create high-quality and diverse defective images, even when working with limited datasets. The synthetic images generated by Defect-Gen significantly enhance the efficacy of defect inspection models. Overall, The Defect Spectrum dataset demonstrates its potential in defect inspection research, offering a solid platform for testing and refining advanced models.
CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling
Authors: Seyedmorteza Sadat, Jakob Buhmann, Derek Bradely, Otmar Hilliges, Romann M. Weber
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17347
Pdf link: https://arxiv.org/pdf/2310.17347
Abstract While conditional diffusion models are known to have good coverage of the data distribution, they still face limitations in output diversity, particularly when sampled with a high classifier-free guidance scale for optimal image quality or when trained on small datasets. We attribute this problem to the role of the conditioning signal in inference and offer an improved sampling strategy for diffusion models that can increase generation diversity, especially at high guidance scales, with minimal loss of sample quality. Our sampling strategy anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise to the conditioning vector during inference to balance diversity and condition alignment. Our Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained model and sampling algorithm, and we show that it boosts the diversity of diffusion models in various conditional generation tasks. Further, using an existing pretrained diffusion model, CADS achieves a new state-of-the-art FID of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 and 512$\times$512 respectively.
SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D Object Pose Estimation
Authors: Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, Jian Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17359
Pdf link: https://arxiv.org/pdf/2310.17359
Abstract In this paper, we introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios. Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud to obtain a precise alignment with the model point cloud. Training our framework involves two operations: An SE(3) diffusion process and an SE(3) reverse process. The SE(3) diffusion process gradually perturbs the optimal rigid transformation of a pair of point clouds by continuously injecting noise (perturbation transformation). By contrast, the SE(3) reverse process focuses on learning a denoising network that refines the noisy transformation step-by-step, bringing it closer to the optimal transformation for accurate pose estimation. Unlike standard diffusion models used in linear Euclidean spaces, our diffusion model operates on the SE(3) manifold. This requires exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to constrain the transformation transitions during the diffusion and reverse processes. Additionally, to effectively train our denoising network, we derive a registration-specific variational lower bound as the optimization objective for model learning. Furthermore, we show that our denoising network can be constructed with a surrogate registration model, making our approach applicable to different deep registration networks. Extensive experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.
Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal Graph Learning
Authors: Junfeng Hu, Xu Liu, Zhencheng Fan, Yuxuan Liang, Roger Zimmermann
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17360
Pdf link: https://arxiv.org/pdf/2310.17360
Abstract Spatio-temporal graph learning is a fundamental problem in the Web of Things era, which enables a plethora of Web applications such as smart cities, human mobility and climate analysis. Existing approaches tackle different learning tasks independently, tailoring their models to unique task characteristics. These methods, however, fall short of modeling intrinsic uncertainties in the spatio-temporal data. Meanwhile, their specialized designs limit their universality as general spatio-temporal learning solutions. In this paper, we propose to model the learning tasks in a unified perspective, viewing them as predictions based on conditional information with shared spatio-temporal patterns. Based on this proposal, we introduce Unified Spatio-Temporal Diffusion Models (USTD) to address the tasks uniformly within the uncertainty-aware diffusion framework. USTD is holistically designed, comprising a shared spatio-temporal encoder and attention-based denoising networks that are task-specific. The shared encoder, optimized by a pre-training strategy, effectively captures conditional spatio-temporal patterns. The denoising networks, utilizing both cross- and self-attention, integrate conditional dependencies and generate predictions. Opting for forecasting and kriging as downstream tasks, we design Gated Attention (SGA) and Temporal Gated Attention (TGA) for each task, with different emphases on the spatial and temporal dimensions, respectively. By combining the advantages of deterministic encoders and probabilistic diffusion models, USTD achieves state-of-the-art performances compared to deterministic and probabilistic baselines in both tasks, while also providing valuable uncertainty estimates.
Exploring the Potential of Generative AI for the World Wide Web
Authors: Nouar AlDahoul, Joseph Hong, Matteo Varvello, Yasir Zaki
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2310.17370
Pdf link: https://arxiv.org/pdf/2310.17370
Abstract Generative Artificial Intelligence (AI) is a cutting-edge technology capable of producing text, images, and various media content leveraging generative models and user prompts. Between 2022 and 2023, generative AI surged in popularity with a plethora of applications spanning from AI-powered movies to chatbots. In this paper, we delve into the potential of generative AI within the realm of the World Wide Web, specifically focusing on image generation. Web developers already harness generative AI to help crafting text and images, while Web browsers might use it in the future to locally generate images for tasks like repairing broken webpages, conserving bandwidth, and enhancing privacy. To explore this research area, we have developed WebDiffusion, a tool that allows to simulate a Web powered by stable diffusion, a popular text-to-image model, from both a client and server perspective. WebDiffusion further supports crowdsourcing of user opinions, which we use to evaluate the quality and accuracy of 409 AI-generated images sourced from 60 webpages. Our findings suggest that generative AI is already capable of producing pertinent and high-quality Web images, even without requiring Web designers to manually input prompts, just by leveraging contextual information available within the webpages. However, we acknowledge that direct in-browser image generation remains a challenge, as only highly powerful GPUs, such as the A40 and A100, can (partially) compete with classic image downloads. Nevertheless, this approach could be valuable for a subset of the images, for example when fixing broken webpages or handling highly private content.
Causal Modeling with Stationary Diffusions
Authors: Lars Lorch, Andreas Krause, Bernhard Schölkopf
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17405
Pdf link: https://arxiv.org/pdf/2310.17405
Abstract We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.
Likelihood-based Out-of-Distribution Detection with Denoising Diffusion Probabilistic Models
Authors: Joseph Goodier, Neill D.F. Campbell
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17432
Pdf link: https://arxiv.org/pdf/2310.17432
Abstract Out-of-Distribution detection between dataset pairs has been extensively explored with generative models. We show that likelihood-based Out-of-Distribution detection can be extended to diffusion models by leveraging the fact that they, like other likelihood-based generative models, are dramatically affected by the input sample complexity. Currently, all Out-of-Distribution detection methods with Diffusion Models are reconstruction-based. We propose a new likelihood ratio for Out-of-Distribution detection with Deep Denoising Diffusion Models, which we call the Complexity Corrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence Lower-Bound evaluations from an individual model at various noising levels. We present results that are comparable to state-of-the-art Out-of-Distribution detection methods with generative models.
The Expressive Power of Low-Rank Adaptation
Authors: Yuchen Zeng, Kangwook Lee
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.17513
Pdf link: https://arxiv.org/pdf/2310.17513
Abstract Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
Authors: Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17569
Pdf link: https://arxiv.org/pdf/2310.17569
Abstract In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation
Authors: Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17570
Pdf link: https://arxiv.org/pdf/2310.17570
Abstract While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).
Global Structure-Aware Diffusion Process for Low-Light Image Enhancement
Authors: Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, Hui Yuan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17577
Pdf link: https://arxiv.org/pdf/2310.17577
Abstract This paper studies a diffusion-based framework to address the low-light image enhancement problem. To harness the capabilities of diffusion models, we delve into this intricate process and advocate for the regularization of its inherent ODE-trajectory. To be specific, inspired by the recent research that low curvature ODE-trajectory results in a stable and effective diffusion process, we formulate a curvature regularization term anchored in the intrinsic non-local structures of image data, i.e., global structure-aware regularization, which gradually facilitates the preservation of complicated details and the augmentation of contrast during the diffusion process. This incorporation mitigates the adverse effects of noise and artifacts resulting from the diffusion process, leading to a more precise and flexible enhancement. To additionally promote learning in challenging regions, we introduce an uncertainty-guided regularization technique, which wisely relaxes constraints on the most extreme regions of the image. Experimental evaluations reveal that the proposed diffusion-based framework, complemented by rank-informed regularization, attains distinguished performance in low-light enhancement. The outcomes indicate substantial advancements in image quality, noise suppression, and contrast amplification in comparison with state-of-the-art methods. We believe this innovative approach will stimulate further exploration and advancement in low-light image processing, with potential implications for other applications of diffusion models. The code is publicly available at https://github.com/jinnh/GSAD.
Noise-Free Score Distillation
Authors: Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17590
Pdf link: https://arxiv.org/pdf/2310.17590
Abstract Score Distillation Sampling (SDS) has emerged as the de facto approach for text-to-content generation in non-image domains. In this paper, we reexamine the SDS process and introduce a straightforward interpretation that demystifies the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the distillation of an undesired noise term. Building upon our interpretation, we propose a novel Noise-Free Score Distillation (NFSD) process, which requires minimal modifications to the original SDS framework. Through this streamlined design, we achieve more effective distillation of pre-trained text-to-image diffusion models while using a nominal CFG scale. This strategic choice allows us to prevent the over-smoothing of results, ensuring that the generated data is both realistic and complies with the desired prompt. To demonstrate the efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as well as several other methods.
Generative Fractional Diffusion Models
Authors: Gabriel Nobis, Marco Aversa, Maximilian Springenberg, Michael Detzel, Stefano Ermon, Shinichi Nakajima, Roderick Murray-Smith, Sebastian Lapuschkin, Christoph Knochenhauer, Luis Oala, Wojciech Samek
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2310.17638
Pdf link: https://arxiv.org/pdf/2310.17638
Abstract We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.
6-DoF Stability Field via Diffusion Models
Authors: Takuma Yoneda, Tianchong Jiang, Gregory Shakhnarovich, Matthew R. Walter
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17649
Pdf link: https://arxiv.org/pdf/2310.17649
Abstract A core capability for robot manipulation is reasoning over where and how to stably place objects in cluttered environments. Traditionally, robots have relied on object-specific, hand-crafted heuristics in order to perform such reasoning, with limited generalizability beyond a small number of object instances and object interaction patterns. Recent approaches instead learn notions of physical interaction, namely motion prediction, but require supervision in the form of labeled object information or come at the cost of high sample complexity, and do not directly reason over stability or object placement. We present 6-DoFusion, a generative model capable of generating 3D poses of an object that produces a stable configuration of a given scene. Underlying 6-DoFusion is a diffusion model that incrementally refines a randomly initialized SE(3) pose to generate a sample from a learned, context-dependent distribution over stable poses. We evaluate our model on different object placement and stacking tasks, demonstrating its ability to construct stable scenes that involve novel object classes as well as to improve the accuracy of state-of-the-art 3D pose estimation methods.
Keyword: adaptive

Optimal approximation of infinite-dimensional holomorphic functions II: recovery from i.i.d. pointwise samples
Authors: Ben Adcock, Nick Dexter, Sebastian Moraga
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2310.16940
Pdf link: https://arxiv.org/pdf/2310.16940
Abstract Infinite-dimensional, holomorphic functions have been studied in detail over the last several decades, due to their relevance to parametric differential equations and computational uncertainty quantification. The approximation of such functions from finitely many samples is of particular interest, due to the practical importance of constructing surrogate models to complex mathematical models of physical processes. In a previous work, [5] we studied the approximation of so-called Banach-valued, $(\boldsymbol{b},\varepsilon)$-holomorphic functions on the infinite-dimensional hypercube $[-1,1]^{\mathbb{N}}$ from $m$ (potentially adaptive) samples. In particular, we derived lower bounds for the adaptive $m$-widths for classes of such functions, which showed that certain algebraic rates of the form $m^{1/2-1/p}$ are the best possible regardless of the sampling-recovery pair. In this work, we continue this investigation by focusing on the practical case where the samples are pointwise evaluations drawn identically and independently from a probability measure. Specifically, for Hilbert-valued $(\boldsymbol{b},\varepsilon)$-holomorphic functions, we show that the same rates can be achieved (up to a small polylogarithmic or algebraic factor) for essentially arbitrary tensor-product Jacobi (ultraspherical) measures. Our reconstruction maps are based on least squares and compressed sensing procedures using the corresponding orthonormal Jacobi polynomials. In doing so, we strengthen and generalize past work that has derived weaker nonuniform guarantees for the uniform and Chebyshev measures (and corresponding polynomials) only. We also extend various best $s$-term polynomial approximation error bounds to arbitrary Jacobi polynomial expansions. Overall, we demonstrate that i.i.d.\ pointwise samples are near-optimal for the recovery of infinite-dimensional, holomorphic functions.
Detecting stealthy cyberattacks on adaptive cruise control vehicles: A machine learning approach
Authors: Tianyi Li, Mingfeng Shang, Shian Wang, Raphael Stern
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.17091
Pdf link: https://arxiv.org/pdf/2310.17091
Abstract With the advent of vehicles equipped with advanced driver-assistance systems, such as adaptive cruise control (ACC) and other automated driving features, the potential for cyberattacks on these automated vehicles (AVs) has emerged. While overt attacks that force vehicles to collide may be easily identified, more insidious attacks, which only slightly alter driving behavior, can result in network-wide increases in congestion, fuel consumption, and even crash risk without being easily detected. To address the detection of such attacks, we first present a traffic model framework for three types of potential cyberattacks: malicious manipulation of vehicle control commands, false data injection attacks on sensor measurements, and denial-of-service (DoS) attacks. We then investigate the impacts of these attacks at both the individual vehicle (micro) and traffic flow (macro) levels. A novel generative adversarial network (GAN)-based anomaly detection model is proposed for real-time identification of such attacks using vehicle trajectory data. We provide numerical evidence {to demonstrate} the efficacy of our machine learning approach in detecting cyberattacks on ACC-equipped vehicles. The proposed method is compared against some recently proposed neural network models and observed to have higher accuracy in identifying anomalous driving behaviors of ACC vehicles.
Adaptive important sampling for Deep Ritz
Authors: Xiaoliang Wan, Tao Zhou, Yuancheng Zhou
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2310.17185
Pdf link: https://arxiv.org/pdf/2310.17185
Abstract We introduce an adaptive sampling method for the Deep Ritz method aimed at solving partial differential equations (PDEs). Two deep neural networks are used. One network is employed to approximate the solution of PDEs, while the other one is a deep generative model used to generate new collocation points to refine the training set. The adaptive sampling procedure consists of two main steps. The first step is solving the PDEs using the Deep Ritz method by minimizing an associated variational loss discretized by the collocation points in the training set. The second step involves generating a new training set, which is then used in subsequent computations to further improve the accuracy of the current approximate solution. We treat the integrand in the variational loss as an unnormalized probability density function (PDF) and approximate it using a deep generative model called bounded KRnet. The new samples and their associated PDF values are obtained from the bounded KRnet. With these new samples and their associated PDF values, the variational loss can be approximated more accurately by importance sampling. Compared to the original Deep Ritz method, the proposed adaptive method improves accuracy, especially for problems characterized by low regularity and high dimensionality. We demonstrate the effectiveness of our new method through a series of numerical experiments.
Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction Network for Tone Mapping
Authors: Feng Zhang, Ming Tian, Zhiqiang Li, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2310.17190
Pdf link: https://arxiv.org/pdf/2310.17190
Abstract Tone mapping aims to convert high dynamic range (HDR) images to low dynamic range (LDR) representations, a critical task in the camera imaging pipeline. In recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained attention due to their ability to strike a favorable balance between enhancement performance and computational efficiency. However, these methods often fail to deliver satisfactory results in local areas since the look-up table is a global operator for tone mapping, which works based on pixel values and fails to incorporate crucial local information. To this end, this paper aims to address this issue by exploring a novel strategy that integrates global and local operators by utilizing closed-form Laplacian pyramid decomposition and reconstruction. Specifically, we employ image-adaptive 3D LUTs to manipulate the tone in the low-frequency image by leveraging the specific characteristics of the frequency information. Furthermore, we utilize local Laplacian filters to refine the edge details in the high-frequency components in an adaptive manner. Local Laplacian filters are widely used to preserve edge details in photographs, but their conventional usage involves manual tuning and fixed implementation within camera imaging pipelines or photo editing tools. We propose to learn parameter value maps progressively for local Laplacian filters from annotated data using a lightweight network. Our model achieves simultaneous global tone manipulation and local edge detail preservation in an end-to-end manner. Extensive experimental results on two benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.
Quickest Change Detection with Controlled Sensing
Authors: Venugopal V. Veeravalli, Georgios Fellouris, George V. Moustakides
Subjects: Information Theory (cs.IT); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/2310.17223
Pdf link: https://arxiv.org/pdf/2310.17223
Abstract In the problem of quickest change detection, a change occurs at some unknown time in the distribution of a sequence of random vectors that are monitored in real time, and the goal is to detect this change as quickly as possible subject to a certain false alarm constraint. In this work we consider this problem in the presence of parametric uncertainty in the post-change regime and controlled sensing. That is, the post-change distribution contains an unknown parameter, and the distribution of each observation, before and after the change, is affected by a control action. In this context, in addition to a stopping rule that determines the time at which it is declared that the change has occurred, one also needs to determine a sequential control policy, which chooses the control action at each time based on the already collected observations. We formulate this problem mathematically using Lorden's minimax criterion, and assuming that there are finitely many possible actions and post-change parameter values. We then propose a specific procedure for this problem that employs an adaptive CuSum statistic in which (i) the estimate of the parameter is based on a fixed number of the more recent observations, and (ii) each action is selected to maximize the Kullback-Leibler divergence of the next observation based on the current parameter estimate, apart from a small number of exploration times. We show that this procedure, which we call the Windowed Chernoff-CuSum (WCC), is first-order asymptotically optimal under Lorden's minimax criterion, for every possible possible value of the unknown post-change parameter, as the mean time to false alarm goes to infinity. We also provide simulation results to illustrate the performance of the WCC procedure.
Towards the decentralized coordination of multiple self-adaptive systems
Authors: Paul-Andrei Dragan, Andreas Metzger, Klaus Pohl
Subjects: Software Engineering (cs.SE); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.17224
Pdf link: https://arxiv.org/pdf/2310.17224
Abstract When multiple self-adaptive systems share the same environment and have common goals, they may coordinate their adaptations at runtime to avoid conflicts and to satisfy their goals. There are two approaches to coordination. (1) Logically centralized, where a supervisor has complete control over the individual self-adaptive systems. Such approach is infeasible when the systems have different owners or administrative domains. (2) Logically decentralized, where coordination is achieved through direct interactions. Because the individual systems have control over the information they share, decentralized coordination accommodates multiple administrative domains. However, existing techniques do not account simultaneously for both local concerns, e.g., preferences, and shared concerns, e.g., conflicts, which may lead to goals not being achieved as expected. Our idea to address this shortcoming is to express both types of concerns within the same constraint optimization problem. We propose CoADAPT, a decentralized coordination technique introducing two types of constraints: preference constraints, expressing local concerns, and consistency constraints, expressing shared concerns. At runtime, the problem is solved in a decentralized way using distributed constraint optimization algorithms implemented by each self-adaptive system. As a first step in realizing CoADAPT, we focus in this work on the coordination of adaptation planning strategies, traditionally addressed only with centralized techniques. We show the feasibility of CoADAPT in an exemplar from cloud computing and analyze experimentally its scalability.
Computationally Feasible Strategies
Authors: Catalin Dima, Wojciech Jamroga
Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2310.17234
Pdf link: https://arxiv.org/pdf/2310.17234
Abstract Real-life agents seldom have unlimited reasoning power. In this paper, we propose and study a new formal notion of computationally bounded strategic ability in multi-agent systems. The notion characterizes the ability of a set of agents to synthesize an executable strategy in the form of a Turing machine within a given complexity class, that ensures the satisfaction of a temporal objective in a parameterized game arena. We show that the new concept induces a proper hierarchy of strategic abilities -- in particular, polynomial-time abilities are strictly weaker than the exponential-time ones. We also propose an ``adaptive'' variant of computational ability which allows for different strategies for each parameter value, and show that the two notions do not coincide. Finally, we define and study the model-checking problem for computational strategies. We show that the problem is undecidable even for severely restricted inputs, and present our first steps towards decidable fragments.
Generalizing to Unseen Domains in Diabetic Retinopathy Classification
Authors: Chamuditha Jayanga Galappaththige, Gayal Kuruppu, Muhammad Haris Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17255
Pdf link: https://arxiv.org/pdf/2310.17255
Abstract Diabetic retinopathy (DR). is caused by long-standing diabetes and is among the fifth leading cause for visual impairments. The process of early diagnosis and treatments could be helpful in curing the disease, however, the detection procedure is rather challenging and mostly tedious. Therefore, automated diabetic retinopathy classification using deep learning techniques has gained interest in the medical imaging community. Akin to several other real-world applications of deep learning, the typical assumption of i.i.d data is also violated in DR classification that relies on deep learning. Therefore, developing DR classification methods robust to unseen distributions is of great value. In this paper, we study the problem of generalizing a model to unseen distributions or domains (a.k.a domain generalization) in DR classification. To this end, we propose a simple and effective domain generalization (DG) approach that achieves self-distillation in vision transformers (ViT) via a novel prediction softening mechanism. This prediction softening is an adaptive convex combination one-hot labels with the model's own knowledge. We perform extensive experiments on challenging open-source DR classification datasets under both multi-source and single-source DG settings with three different ViT backbones to establish the efficacy and applicability of our approach against competing methods. For the first time, we report the performance of several state-of-the-art DG methods on open-source DR classification datasets after conducting thorough experiments. Finally, our method is also capable of delivering improved calibration performance than other methods, showing its suitability for safety-critical applications, including healthcare. We hope that our contributions would investigate more DG research across the medical imaging community.
Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution
Authors: Zhewei Huang, Ailin Huang, Xiaotao Hu, Chen Hu, Jun Xu, Shuchang Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17294
Pdf link: https://arxiv.org/pdf/2310.17294
Abstract The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual quality of videos, by simultaneously performing video frame interpolation (VFI) and video super-resolution (VSR). However, facing the challenge of the additional temporal dimension and scale inconsistency, most existing STVSR methods are complex and inflexible in dynamically modeling different motion amplitudes. In this work, we find that choosing an appropriate processing scale achieves remarkable benefits in flow-based feature propagation. We propose a novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects sub-networks with different processing scales for individual samples. Experiments on four public STVSR benchmarks demonstrate that SAFA achieves state-of-the-art performance. Our SAFA network outperforms recent state-of-the-art methods such as TMNet and VideoINR by an average improvement of over 0.5dB on PSNR, while requiring less than half the number of parameters and only 1/3 computational costs.
A multi-artifact EEG denoising by frequency-based deep learning
Authors: Matteo Gabardi, Aurora Saibene, Francesca Gasparini, Daniele Rizzo, Fabio Antonio Stella
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.17335
Pdf link: https://arxiv.org/pdf/2310.17335
Abstract Electroencephalographic (EEG) signals are fundamental to neuroscience research and clinical applications such as brain-computer interfaces and neurological disorder diagnosis. These signals are typically a combination of neurological activity and noise, originating from various sources, including physiological artifacts like ocular and muscular movements. Under this setting, we tackle the challenge of distinguishing neurological activity from noise-related sources. We develop a novel EEG denoising model that operates in the frequency domain, leveraging prior knowledge about noise spectral features to adaptively compute optimal convolutional filters for noise separation. The model is trained to learn an empirical relationship connecting the spectral characteristics of noise and noisy signal to a non-linear transformation which allows signal denoising. Performance evaluation on the EEGdenoiseNet dataset shows that the proposed model achieves optimal results according to both temporal and spectral metrics. The model is found to remove physiological artifacts from input EEG data, thus achieving effective EEG denoising. Indeed, the model performance either matches or outperforms that achieved by benchmark models, proving to effectively remove both muscle and ocular artifacts without the need to perform any training on the particular type of artifact.
Distributed Adaptive Control for Uncertain Networks
Authors: Venkatraman Renganathan, Anders Rantzer, Olle Kjellqvist
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17364
Pdf link: https://arxiv.org/pdf/2310.17364
Abstract Control of network systems with uncertain local dynamics has remained an open problem for a long time. In this paper, a distributed minimax adaptive control algorithm is proposed for such networks whose local dynamics has an uncertain parameter possibly taking finite number of values. To hedge against this uncertainty, each node in the network collects the historical data of its neighboring nodes to decide its control action along its edges by finding the parameter that best describes the observed disturbance trajectory. Our proposed distributed adaptive controller is scalable and we give both lower and upper bounds for its $\ell{2}$ gain. Numerical simulations demonstrate that once each node has sufficiently estimated its local uncertainty, the distributed minimax adaptive controller behaves like the optimal distributed $H{\infty}$ controller in hindsight.
A near-autonomous and incremental intrusion detection system through active learning of known and unknown attacks
Authors: Lynda Boukela, Gongxuan Zhang, Meziane Yacoub, Samia Bouzefrane
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2310.17430
Pdf link: https://arxiv.org/pdf/2310.17430
Abstract Intrusion detection is a traditional practice of security experts, however, there are several issues which still need to be tackled. Therefore, in this paper, after highlighting these issues, we present an architecture for a hybrid Intrusion Detection System (IDS) for an adaptive and incremental detection of both known and unknown attacks. The IDS is composed of supervised and unsupervised modules, namely, a Deep Neural Network (DNN) and the K-Nearest Neighbors (KNN) algorithm, respectively. The proposed system is near-autonomous since the intervention of the expert is minimized through the active learning (AL) approach. A query strategy for the labeling process is presented, it aims at teaching the supervised module to detect unknown attacks and improve the detection of the already-known attacks. This teaching is achieved through sliding windows (SW) in an incremental fashion where the DNN is retrained when the data is available over time, thus rendering the IDS adaptive to cope with the evolutionary aspect of the network traffic. A set of experiments was conducted on the CICIDS2017 dataset in order to evaluate the performance of the IDS, promising results were obtained.
Adaptive Resource Management for Edge Network Slicing using Incremental Multi-Agent Deep Reinforcement Learning
Authors: Haiyuan Li, Yuelin Liu, Xueqing Zhou, Xenofon Vasilakos, Reza Nejabati, Shuangyi Yan, Dimitra Simenidou
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2310.17523
Pdf link: https://arxiv.org/pdf/2310.17523
Abstract Multi-access edge computing provides local resources in mobile networks as the essential means for meeting the demands of emerging ultra-reliable low-latency communications. At the edge, dynamic computing requests require advanced resource management for adaptive network slicing, including resource allocations, function scaling and load balancing to utilize only the necessary resources in resource-constraint networks. Recent solutions are designed for a static number of slices. Therefore, the painful process of optimization is required again with any update on the number of slices. In addition, these solutions intend to maximize instant rewards, neglecting long-term resource scheduling. Unlike these efforts, we propose an algorithmic approach based on multi-agent deep deterministic policy gradient (MADDPG) for optimizing resource management for edge network slicing. Our objective is two-fold: (i) maximizing long-term network slicing benefits in terms of delay and energy consumption, and (ii) adapting to slice number changes. Through simulations, we demonstrate that MADDPG outperforms benchmark solutions including a static slicing-based one from the literature, achieving stable and high long-term performance. Additionally, we leverage incremental learning to facilitate a dynamic number of edge slices, with enhanced performance compared to pre-trained base models. Remarkably, this approach yields superior reward performance while saving approximately 90% of training time costs.
Positivity-preserving and entropy-bounded discontinuous Galerkin method for the chemically reacting, compressible Navier-Stokes equations
Authors: Eric J. Ching, Ryan F. Johnson, Sarah Burrows, Jacklyn Higgs, Andrew D. Kercher
Subjects: Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2310.17637
Pdf link: https://arxiv.org/pdf/2310.17637
Abstract This article concerns the development of a fully conservative, positivity-preserving, and entropy-bounded discontinuous Galerkin scheme for simulating the multicomponent, chemically reacting, compressible Navier-Stokes equations with complex thermodynamics. In particular, we extend to viscous flows the fully conservative, positivity-preserving, and entropy-bounded discontinuous Galerkin method for the chemically reacting Euler equations that we previously introduced. An important component of the formulation is the positivity-preserving Lax-Friedrichs-type viscous flux function devised by Zhang [J. Comput. Phys., 328 (2017), pp. 301-343], which was adapted to multicomponent flows by Du and Yang [J. Comput. Phys., 469 (2022), pp. 111548] in a manner that treats the inviscid and viscous fluxes as a single flux. Here, we similarly extend the aforementioned flux function to multicomponent flows but separate the inviscid and viscous fluxes. This separation of the fluxes allows for use of other inviscid flux functions, as well as enforcement of entropy boundedness on only the convective contribution to the evolved state, as motivated by physical and mathematical principles. We also discuss in detail how to account for boundary conditions and incorporate previously developed pressure-equilibrium-preserving techniques into the positivity-preserving framework. Comparisons between the Lax-Friedrichs-type viscous flux function and more conventional flux functions are provided, the results of which motivate an adaptive solution procedure that employs the former only when the element-local solution average has negative species concentrations, nonpositive density, or nonpositive pressure. A variety of multicomponent, viscous flows is computed, ranging from a one-dimensional shock tube problem to multidimensional detonation waves and shock/mixing-layer interaction.
Keyword: quantization

General Point Model with Autoencoding and Autoregressive
Authors: Zhe Li, Zhangyang Gao, Cheng Tan, Stan Z. Li, Laurence T. Yang
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.16861
Pdf link: https://arxiv.org/pdf/2310.16861
Abstract The pre-training architectures of large language models encompass various types, including autoencoding models, autoregressive models, and encoder-decoder models. We posit that any modality can potentially benefit from a large language model, as long as it undergoes vector quantization to become discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which seamlessly integrates autoencoding and autoregressive tasks in point cloud transformer. This model is versatile, allowing fine-tuning for downstream point cloud representation tasks, as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks, leading to improved performance in point cloud understanding. Additionally, GPM demonstrates highly competitive results in unconditional point cloud generation tasks, even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves superior performance in point cloud understanding tasks. Furthermore, the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.
Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs
Authors: Jianwei Fei, Zhihua Xia, Benedetta Tondi, Mauro Barni
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2310.16919
Pdf link: https://arxiv.org/pdf/2310.16919
Abstract We propose a novel multi-bit box-free watermarking method for the protection of Intellectual Property Rights (IPR) of GANs with improved robustness against white-box attacks like fine-tuning, pruning, quantization, and surrogate model attacks. The watermark is embedded by adding an extra watermarking loss term during GAN training, ensuring that the images generated by the GAN contain an invisible watermark that can be retrieved by a pre-trained watermark decoder. In order to improve the robustness against white-box model-level attacks, we make sure that the model converges to a wide flat minimum of the watermarking loss term, in such a way that any modification of the model parameters does not erase the watermark. To do so, we add random noise vectors to the parameters of the generator and require that the watermarking loss term is as invariant as possible with respect to the presence of noise. This procedure forces the generator to converge to a wide flat minimum of the watermarking loss. The proposed method is architectureand dataset-agnostic, thus being applicable to many different generation tasks and models, as well as to CNN-based image processing architectures. We present the results of extensive experiments showing that the presence of the watermark has a negligible impact on the quality of the generated images, and proving the superior robustness of the watermark against model modification and surrogate model attacks.
Neural Distributed Compressor Discovers Binning
Authors: Ezgi Ozyilkan, Johannes Ballé, Elza Erkip
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2310.16961
Pdf link: https://arxiv.org/pdf/2310.16961
Abstract We consider lossy compression of an information source when the decoder has lossless access to a correlated one. This setup, also known as the Wyner-Ziv problem, is a special case of distributed source coding. To this day, practical approaches for the Wyner-Ziv problem have neither been fully developed nor heavily investigated. We propose a data-driven method based on machine learning that leverages the universal function approximation capability of artificial neural networks. We find that our neural network-based compression scheme, based on variational vector quantization, recovers some principles of the optimum theoretical solution of the Wyner-Ziv setup, such as binning in the source space as well as optimal combination of the quantization index and side information, for exemplary sources. These behaviors emerge although no structure exploiting knowledge of the source distributions was imposed. Binning is a widely used tool in information theoretic proofs and methods, and to our knowledge, this is the first time it has been explicitly observed to emerge from data-driven learning.
Deep Imbalanced Regression via Hierarchical Classification Adjustment
Authors: Haipeng Xiong, Angela Yao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2310.17154
Pdf link: https://arxiv.org/pdf/2310.17154
Abstract Regression tasks in computer vision, such as age estimation or counting, are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced -- the majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. By selecting the class quantization, one can adjust imbalanced regression targets into balanced classification outputs, though there are trade-offs in balancing classification accuracy and quantization error. To improve regression performance over the entire range of data, we propose to construct hierarchical classifiers for solving imbalanced regression tasks. The fine-grained classifiers limit the quantization error while being modulated by the coarse predictions to ensure high accuracy. Standard hierarchical classification approaches, however, when applied to the regression problem, fail to ensure that predicted ranges remain consistent across the hierarchy. As such, we propose a range-preserving distillation process that can effectively learn a single classifier from the set of hierarchical classifiers. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks: age estimation, crowd counting and depth estimation. We will release the source code upon acceptance.
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Authors: Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2310.17230
Pdf link: https://arxiv.org/pdf/2310.17230
Abstract Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.

A-suozhang / GetArxivDaily

New submissions for Fri, 27 Oct 23 #186

Keyword: efficient

Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs

Hardware-Algorithm Co-design Enabling Processing-in-Pixel-in-Memory (P2M) for Neuromorphic Vision Sensors

A Type System for Julia

Diagnosing Alzheimer's Disease using Early-Late Multimodal Data Fusion with Jacobian Maps

The Teenager's Problem: Efficient Garment Decluttering With Grasp Optimization

Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning

WaLiN-GUI: a graphical and auditory tool for neuron-based encoding

Packed to the Brim: Investigating the Impact of Highly Responsive Prefixes on Internet-wide Measurement Campaigns

Streaming Factor Trajectory Learning for Temporal Tensor Decomposition

On Surgical Fine-tuning for Language Encoders

Learning to Rank for Active Learning via Multi-Task Bilevel Optimization

Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer

BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation

A Method for Network Intrusion Detection Using Flow Sequence and BERT Framework

Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness

Content-based Controls For Music Large Language Modeling

MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR

A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays

Graphical Object-Centric Actor-Critic

TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World

fairret: a Framework for Differentiable Fairness Regularization Terms

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

RAN Functional Split Options for Integrated Terrestrial and Non-Terrestrial 6G Networks

Mode Selection for Component Mode Synthesis with Guaranteed Assembly Accuracy

CQM: Curriculum Reinforcement Learning with a Quantized World Model

Acceleration and restart for the randomized Bregman-Kaczmarz method

Exploring the Trie of Rules: a fast data structure for the representation of association rules

YOLO-BEV: Generating Bird's-Eye View in the Same Way as 2D Object Detection

Invariance Measures for Neural Networks

Synthesizing Efficiently Monitorable Formulas in Metric Temporal Logic

LEI2JSON: Schema-based Validation and Conversion of Livestock Event Information

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text

Efficient safe learning for controller tuning with experimental validation

Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings

Cross-modal Active Complementary Learning with Self-refining Correspondence

Analytical model for large-scale design of sidewalk delivery robot systems

FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing

Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach

The IMS Toucan System for the Blizzard Challenge 2023

A Lightweight, Compiler-Assisted Register File Cache for GPGPU

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

The Expressive Power of Low-Rank Adaptation

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction

Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Test Bench Study on Attitude Estimation in Ground Effect Region Based on Motor Current for In-Flight Inductive Power Transfer of Drones

Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

An efficient frequency-independent numerical method for computing the far-field pattern induced by polygonal obstacles

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Grow Your Limits: Continuous Improvement with Real-World RL for Robotic Locomotion

High-Dimensional Prediction for Sequential Decision Making

Keyword: faster

Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs

An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation

Faster Recalibration of an Online Predictor via Approachability

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

CuRobo: Parallelized Collision-Free Minimum-Jerk Robot Motion Generation

Boolean Abstractions for Realizability Modulo Theories (Extended version)

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent

Keyword: mobile

How does Module Tracking for Agrivoltaics Differ from Standard Photovoltaics? Performance & Technoeconomic Implications

Goals are Enough: Inducing AdHoc cooperation among unseen Multi-Agent systems in IMFs

Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach

Adaptive Resource Management for Edge Network Slicing using Incremental Multi-Agent Deep Reinforcement Learning

Keyword: pruning

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification

Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs

Optimal Robotic Assembly Sequence Planning: A Sequential Decision-Making Approach