Abstract
Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process.
CLNeRF: Continual Learning Meets NeRF
Authors: Zhipeng Cai, Matthias Mueller
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Novel view synthesis aims to render unseen views given a set of calibrated images. In practical applications, the coverage, appearance or geometry of the scene may change over time, with new images continuously being captured. Efficiently incorporating such continuous change is an open challenge. Standard NeRF benchmarks only involve scene coverage expansion. To study other practical scene changes, we propose a new dataset, World Across Time (WAT), consisting of scenes that change in appearance and geometry over time. We also propose a simple yet effective method, CLNeRF, which introduces continual learning (CL) to Neural Radiance Fields (NeRFs). CLNeRF combines generative replay and the Instant Neural Graphics Primitives (NGP) architecture to effectively prevent catastrophic forgetting and efficiently update the model when new data arrives. We also add trainable appearance and geometry embeddings to NGP, allowing a single compact model to handle complex scene changes. Without the need to store historical images, CLNeRF trained sequentially over multiple scans of a changing scene performs on-par with the upper bound model trained on all scans at once. Compared to other CL baselines CLNeRF performs much better across standard benchmarks and WAT. The source code, and the WAT dataset are available at https://github.com/IntelLabs/CLNeRF. Video presentation is available at: https://youtu.be/nLRt6OoDGq0?si=8yD6k-8MMBJInQPs
Continual Learning with Dynamic Sparse Training: Exploring Algorithms for Effective Model Updates
Authors: Murat Onur Yildirim, Elif Ceren Gok Yildirim, Ghada Sokar, Decebal Constantin Mocanu, Joaquin Vanschoren
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Continual learning (CL) refers to the ability of an intelligent system to sequentially acquire and retain knowledge from a stream of data with as little computational overhead as possible. To this end; regularization, replay, architecture, and parameter isolation approaches were introduced to the literature. Parameter isolation using a sparse network which enables to allocate distinct parts of the neural network to different tasks and also allows to share of parameters between tasks if they are similar. Dynamic Sparse Training (DST) is a prominent way to find these sparse networks and isolate them for each task. This paper is the first empirical study investigating the effect of different DST components under the CL paradigm to fill a critical research gap and shed light on the optimal configuration of DST for CL if it exists. Therefore, we perform a comprehensive study in which we investigate various DST components to find the best topology per task on well-known CIFAR100 and miniImageNet benchmarks in a task-incremental CL setup since our primary focus is to evaluate the performance of various DST criteria, rather than the process of mask selection. We found that, at a low sparsity level, Erdos-Renyi Kernel (ERK) initialization utilizes the backbone more efficiently and allows to effectively learn increments of tasks. At a high sparsity level, however, uniform initialization demonstrates more reliable and robust performance. In terms of growth strategy; performance is dependent on the defined initialization strategy, and the extent of sparsity. Finally, adaptivity within DST components is a promising way for better continual learners.
Abstract
We consider the problem of graph analytics on evolving graphs. In this scenario, a query typically needs to be applied to different snapshots of the graph over an extended time window. We propose CommonGraph, an approach for efficient processing of queries on evolving graphs. We first observe that edge deletions are significantly more expensive than addition operations. CommonGraph converts all deletions to additions by finding a common graph that exists across all snapshots. After computing the query on this graph, to reach any snapshot, we simply need to add the missing edges and incrementally update the query results. CommonGraph also allows sharing of common additions among snapshots that require them, and breaks the sequential dependency inherent in the traditional streaming approach where snapshots are processed in sequence, enabling additional opportunities for parallelism. We incorporate the CommonGraph approach by extending the KickStarter streaming framework. CommonGraph achieves 1.38x-8.17x improvement in performance over Kickstarter across multiple benchmarks.
Scalable and Configurable Tracking for Any Rowhammer Threshold
Authors: Anish Saxena, Moinuddin Qureshi
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Abstract
The Rowhammer vulnerability continues to get worse, with the Rowhammer Threshold (TRH) reducing from 139K activations to 4.8K activations over the last decade. Typical Rowhammer mitigations rely on tracking aggressor rows. The number of possible aggressors increases with lowering thresholds, making it difficult to reliably track such rows in a storage-efficient manner. At lower thresholds, academic trackers such as Graphene require prohibitive SRAM overheads (hundreds of KBs to MB). Recent in-DRAM trackers from industry, such as DSAC-TRR, perform approximate tracking, sacrificing guaranteed protection for reduced storage overheads, leaving DRAM vulnerable to Rowhammer attacks. Ideally, we seek a scalable tracker that tracks securely and precisely, and incurs negligible dedicated SRAM and performance overheads, while still being able to track arbitrarily low thresholds. To that end, we propose START - a Scalable Tracker for Any Rowhammer Threshold. Rather than relying on dedicated SRAM structures, START dynamically repurposes a small fraction the Last-Level Cache (LLC) to store tracking metadata. START is based on the observation that while the memory contains millions of rows, typical workloads touch only a small subset of rows within a refresh period of 64ms, so allocating tracking entries on demand significantly reduces storage. If the application does not access many rows in memory, START does not reserve any LLC capacity. Otherwise, START dynamically uses 1-way, 2-way, or 8-way of the cache set based on demand. START consumes, on average, 9.4% of the LLC capacity to store metadata, which is 5X lower compared to dedicating a counter in LLC for each row in memory. We also propose START-M, a memory-mapped START for large-memory systems. Our designs require only 4KB SRAM for newly added structures and perform within 1% of idealized tracking even at TRH of less than 100.
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
Authors: Zijia Lu, Ehsan Elhamifar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two branches. We apply and extend a set-prediction objective to allow each action token to represent one or multiple action segments, thus can avoid learning a large number of tokens over long videos with many segments. Thanks to the design of our action branch, we can also seamlessly leverage textual transcripts of videos (when available) to help action segmentation by using them to initialize the action tokens. We evaluate our model on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts, showing that BIT significantly improves the state-of-the-art accuracy with much lower computational cost (30 times faster) compared to existing transformer-based methods.
Uncertainty-driven Affordance Discovery for Efficient Robotics Manipulation
Authors: Pietro Mazzaglia, Taco Cohen, Daniel Dijkman
Abstract
Robotics affordances, providing information about what actions can be taken in a given situation, can aid robotics manipulation. However, learning about affordances requires expensive large annotated datasets of interactions or demonstrations. In this work, we show active learning can mitigate this problem and propose the use of uncertainty to drive an interactive affordance discovery process. We show that our method enables the efficient discovery of visual affordances for several action primitives, such as grasping, stacking objects, or opening drawers, strongly improving data efficiency and allowing us to learn grasping affordances on a real-world setup with an xArm 6 robot arm in a small number of trials.
Abstract
A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of ``costs'' associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we studied the sample complexity of policy evaluation and developed a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refined the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provided a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we modeled hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modified a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we developed a planning algorithm that computationally efficiently finds Pareto optimal stochastic policies.
Optimal Economic Gas Turbine Dispatch with Deep Reinforcement Learning
Authors: Manuel Sage, Martin Staniszewski, Yaoyao Fiona Zhao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract
Dispatching strategies for gas turbines (GTs) are changing in modern electricity grids. A growing incorporation of intermittent renewable energy requires GTs to operate more but shorter cycles and more frequently on partial loads. Deep reinforcement learning (DRL) has recently emerged as a tool that can cope with this development and dispatch GTs economically. The key advantages of DRL are a model-free optimization and the ability to handle uncertainties, such as those introduced by varying loads or renewable energy production. In this study, three popular DRL algorithms are implemented for an economic GT dispatch problem on a case study in Alberta, Canada. We highlight the benefits of DRL by incorporating an existing thermodynamic software provided by Siemens Energy into the environment model and by simulating uncertainty via varying electricity prices, loads, and ambient conditions. Among the tested algorithms and baseline methods, Deep Q-Networks (DQN) obtained the highest rewards while Proximal Policy Optimization (PPO) was the most sample efficient. We further propose and implement a method to assign GT operation and maintenance cost dynamically based on operating hours and cycles. Compared to existing methods, our approach better approximates the true cost of modern GT dispatch and hence leads to more realistic policies.
Monus semantics in vector addition systems with states
Authors: Pascal Baumann, Khushraj Madnani, Filip Mazowiecki, Georg Zetzsche
Subjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)
Abstract
Vector addition systems with states (VASS) are a popular model for concurrent systems. However, many decision problems have prohibitively high complexity. Therefore, it is sometimes useful to consider overapproximating semantics in which these problems can be decided more efficiently. We study an overapproximation, called monus semantics, that slightly relaxes the semantics of decrements: A key property of a vector addition systems is that in order to decrement a counter, this counter must have a positive value. In contrast, our semantics allows decrements of zero-valued counters: If such a transition is executed, the counter just remains zero. It turns out that if only a subset of transitions is used with monus semantics (and the others with classical semantics), then reachability is undecidable. However, we show that if monus semantics is used throughout, reachability remains decidable. In particular, we show that reachability for VASS with monus semantics is as hard as that of classical VASS (i.e. Ackermann-hard), while the zero-reachability and coverability are easier (i.e. EXPSPACE-complete and NP-complete, respectively). We provide a comprehensive account of the complexity of the general reachability problem, reachability of zero configurations, and coverability under monus semantics. We study these problems in general VASS, two-dimensional VASS, and one-dimensional VASS, with unary and binary counter updates.
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Authors: Samuel Horvath, Stefanos Laskaridis, Shashank Rajput, Hongyi Wang
Abstract
Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities.
Auto-Prompting SAM for Mobile Friendly 3D Medical Image Segmentation
Authors: Chengyin Li, Prashant Khanduri, Yao Qiang, Rafi Ibn Sultan, Indrin Chetty, Dongxiao Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
The Segment Anything Model (SAM) has rapidly been adopted for segmenting a wide range of natural images. However, recent studies have indicated that SAM exhibits subpar performance on 3D medical image segmentation tasks. In addition to the domain gaps between natural and medical images, disparities in the spatial arrangement between 2D and 3D images, the substantial computational burden imposed by powerful GPU servers, and the time-consuming manual prompt generation impede the extension of SAM to a broader spectrum of medical image segmentation applications. To address these challenges, in this work, we introduce a novel method, AutoSAM Adapter, designed specifically for 3D multi-organ CT-based segmentation. We employ parameter-efficient adaptation techniques in developing an automatic prompt learning paradigm to facilitate the transformation of the SAM model's capabilities to 3D medical image segmentation, eliminating the need for manually generated prompts. Furthermore, we effectively transfer the acquired knowledge of the AutoSAM Adapter to other lightweight models specifically tailored for 3D medical image analysis, achieving state-of-the-art (SOTA) performance on medical image segmentation tasks. Through extensive experimental evaluation, we demonstrate the AutoSAM Adapter as a critical foundation for effectively leveraging the emerging ability of foundation models in 2D natural image segmentation for 3D medical image segmentation.
Entropy-based Guidance of Deep Neural Networks for Accelerated Convergence and Improved Performance
Authors: Mackenzie J. Meni, Ryan T. White, Michael Mayo, Kevin Pilkiewicz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
Neural networks have dramatically increased our capacity to learn from large, high-dimensional datasets across innumerable disciplines. However, their decisions are not easily interpretable, their computational costs are high, and building and training them are uncertain processes. To add structure to these efforts, we derive new mathematical results to efficiently measure the changes in entropy as fully-connected and convolutional neural networks process data, and introduce entropy-based loss terms. Experiments in image compression and image classification on benchmark datasets demonstrate these losses guide neural networks to learn rich latent data representations in fewer dimensions, converge in fewer training epochs, and achieve better test metrics.
Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation
Abstract
Graph Neural Network (GNN) training and inference involve significant challenges of scalability with respect to both model sizes and number of layers, resulting in degradation of efficiency and accuracy for large and deep GNNs. We present an end-to-end solution that aims to address these challenges for efficient GNNs in resource constrained environments while avoiding the oversmoothing problem in deep GNNs. We introduce a quantization based approach for all stages of GNNs, from message passing in training to node classification, compressing the model and enabling efficient processing. The proposed GNN quantizer learns quantization ranges and reduces the model size with comparable accuracy even under low-bit quantization. To scale with the number of layers, we devise a message propagation mechanism in training that controls layer-wise changes of similarities between neighboring nodes. This objective is incorporated into a Lagrangian function with constraints and a differential multiplier method is utilized to iteratively find optimal embeddings. This mitigates oversmoothing and suppresses the quantization error to a bound. Significant improvements are demonstrated over state-of-the-art quantization methods and deep GNN approaches in both full-precision and quantized models. The proposed quantizer demonstrates superior performance in INT2 configurations across all stages of GNN, achieving a notable level of accuracy. In contrast, existing quantization approaches fail to generate satisfactory accuracy levels. Finally, the inference with INT2 and INT4 representations exhibits a speedup of 5.11 $\times$ and 4.70 $\times$ compared to full precision counterparts, respectively.
Robust topology optimisation of lattice structures with spatially correlated uncertainties
Authors: Ismael Ben-Yelun, Ahmet Oguzhan Yuksel, Fehmi Cirak
Abstract
The uncertainties in material and other properties of structures are usually spatially correlated. We introduce an efficient technique for representing and processing spatially correlated random fields in robust topology optimisation of lattice structures. Robust optimisation considers the statistics of the structural response to obtain a design whose performance is less sensitive to the specific realisation of the random field. We represent Gaussian random fields on lattices by leveraging the established link between random fields and stochastic partial differential equations (SPDEs). It is known that the precision matrix, i.e. the inverse of the covariance matrix, of a random field with Mat\'ern covariance is equal to the finite element stiffness matrix of a possibly fractional PDE with a second-order elliptic operator. We consider the discretisation of the PDE on the lattice to obtain a random field which, by design, considers its geometry and connectivity. The so-obtained random field can be interpreted as a physics-informed prior by the hypothesis that the elliptic SPDE models the physical processes occurring during manufacturing, like heat and mass diffusion. Although the proposed approach is general, we demonstrate its application to lattices modelled as pin-jointed trusses with uncertainties in member Young's moduli. We consider as a cost function the weighted sum of the expectation and standard deviation of the structural compliance. To compute the expectation and standard deviation and their gradients with respect to member cross-sections we use a first-order Taylor series approximation. The cost function and its gradient are computed using only sparse matrix operations. We demonstrate the efficiency of the proposed approach using several lattice examples with isotropic, anisotropic and non-stationary random fields and up to eighty thousand random and optimisation variables.
Streaming Compression of Scientific Data via weak-SINDy
Authors: Benjamin P. Russo, M. Paul Laiu, Richard Archibald
Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS)
Abstract
In this paper a streaming weak-SINDy algorithm is developed specifically for compressing streaming scientific data. The production of scientific data, either via simulation or experiments, is undergoing an stage of exponential growth, which makes data compression important and often necessary for storing and utilizing large scientific data sets. As opposed to classical offline" compression algorithms that perform compression on a readily available data set, streaming compression algorithms compress dataonline" while the data generated from simulation or experiments is still flowing through the system. This feature makes streaming compression algorithms well-suited for scientific data compression, where storing the full data set offline is often infeasible. This work proposes a new streaming compression algorithm, streaming weak-SINDy, which takes advantage of the underlying data characteristics during compression. The streaming weak-SINDy algorithm constructs feature matrices and target vectors in the online stage via a streaming integration method in a memory efficient manner. The feature matrices and target vectors are then used in the offline stage to build a model through a regression process that aims to recover equations that govern the evolution of the data. For compressing high-dimensional streaming data, we adopt a streaming proper orthogonal decomposition (POD) process to reduce the data dimension and then use the streaming weak-SINDy algorithm to compress the temporal data of the POD expansion. We propose modifications to the streaming weak-SINDy algorithm to accommodate the dynamically updated POD basis. By combining the built model from the streaming weak-SINDy algorithm and a small amount of data samples, the full data flow could be reconstructed accurately at a low memory cost, as shown in the numerical tests.
CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction
Authors: Umar Khalid, Hasan Iqbal, Saeed Vahidian, Jing Hua, Chen Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Human-robot interaction (HRI) is a rapidly growing field that encompasses social and industrial applications. Machine learning plays a vital role in industrial HRI by enhancing the adaptability and autonomy of robots in complex environments. However, data privacy is a crucial concern in the interaction between humans and robots, as companies need to protect sensitive data while machine learning algorithms require access to large datasets. Federated Learning (FL) offers a solution by enabling the distributed training of models without sharing raw data. Despite extensive research on Federated learning (FL) for tasks such as natural language processing (NLP) and image classification, the question of how to use FL for HRI remains an open research problem. The traditional FL approach involves transmitting large neural network parameter matrices between the server and clients, which can lead to high communication costs and often becomes a bottleneck in FL. This paper proposes a communication-efficient FL framework for human-robot interaction (CEFHRI) to address the challenges of data heterogeneity and communication costs. The framework leverages pre-trained models and introduces a trainable spatiotemporal adapter for video understanding tasks in HRI. Experimental results on three human-robot interaction benchmark datasets: HRI30, InHARD, and COIN demonstrate the superiority of CEFHRI over full fine-tuning in terms of communication costs. The proposed methodology provides a secure and efficient approach to HRI federated learning, particularly in industrial environments with data privacy concerns and limited communication bandwidth. Our code is available at https://github.com/umarkhalidAI/CEFHRI-Efficient-Federated-Learning.
Reprogramming under constraints: Revisiting efficient and reliable transferability of lottery tickets
Abstract
In the era of foundation models with huge pre-training budgets, the downstream tasks have been shifted to the narrative of efficient and fast adaptation. For classification-based tasks in the domain of computer vision, the two most efficient approaches have been linear probing (LP) and visual prompting/reprogramming (VP); the former aims to learn a classifier in the form of a linear head on the features extracted by the pre-trained model, while the latter maps the input data to the domain of the source data on which the model was originally pre-trained on. Although extensive studies have demonstrated the differences between LP and VP in terms of downstream performance, we explore the capabilities of the two aforementioned methods via the sparsity axis: (a) Data sparsity: the impact of few-shot adaptation and (b) Model sparsity: the impact of lottery tickets (LT). We demonstrate that LT are not universal reprogrammers, i.e., for certain target datasets, reprogramming an LT yields significantly lower performance than the reprogrammed dense model although their corresponding upstream performance is similar. Further, we demonstrate that the calibration of dense models is always superior to that of their lottery ticket counterparts under both LP and VP regimes. Our empirical study opens a new avenue of research into VP for sparse models and encourages further understanding of the performance beyond the accuracy achieved by VP under constraints of sparsity. Code and logs can be accessed at \url{https://github.com/landskape-ai/Reprogram_LT}.
Distributed multi-agent target search and tracking with Gaussian process and reinforcement learning
Abstract
Deploying multiple robots for target search and tracking has many practical applications, yet the challenge of planning over unknown or partially known targets remains difficult to address. With recent advances in deep learning, intelligent control techniques such as reinforcement learning have enabled agents to learn autonomously from environment interactions with little to no prior knowledge. Such methods can address the exploration-exploitation tradeoff of planning over unknown targets in a data-driven manner, eliminating the reliance on heuristics typical of traditional approaches and streamlining the decision-making pipeline with end-to-end training. In this paper, we propose a multi-agent reinforcement learning technique with target map building based on distributed Gaussian process. We leverage the distributed Gaussian process to encode belief over the target locations and efficiently plan over unknown targets. We evaluate the performance and transferability of the trained policy in simulation and demonstrate the method on a swarm of micro unmanned aerial vehicles with hardware experiments.
Generative Model for Models: Rapid DNN Customization for Diverse Tasks and Resource Constraints
Authors: Wenxing Xu, Yuanchun Li, Jiacheng Liu, Yi Sun, Zhengyang Cao, Yixuan Li, Hao Wen, Yunxin Liu
Abstract
Unlike cloud-based deep learning models that are often large and uniform, edge-deployed models usually demand customization for domain-specific tasks and resource-limited environments. Such customization processes can be costly and time-consuming due to the diversity of edge scenarios and the training load for each scenario. Although various approaches have been proposed for rapid resource-oriented customization and task-oriented customization respectively, achieving both of them at the same time is challenging. Drawing inspiration from the generative AI and the modular composability of neural networks, we introduce NN-Factory, an one-for-all framework to generate customized lightweight models for diverse edge scenarios. The key idea is to use a generative model to directly produce the customized models, instead of training them. The main components of NN-Factory include a modular supernet with pretrained modules that can be conditionally activated to accomplish different tasks and a generative module assembler that manipulate the modules according to task and sparsity requirements. Given an edge scenario, NN-Factory can efficiently customize a compact model specialized in the edge task while satisfying the edge resource constraints by searching for the optimal strategy to assemble the modules. Based on experiments on image classification and object detection tasks with different edge devices, NN-Factory is able to generate high-quality task- and resource-specific models within few seconds, faster than conventional model customization approaches by orders of magnitude.
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
Abstract
We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' positions and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets.
Fast immersed boundary method based on weighted quadrature
Authors: Benjamin Marussig, René Hiemstra, Dominik Schillinger
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Combining sum factorization, weighted quadrature, and row-based assembly enables efficient higher-order computations for tensor product splines. We aim to transfer these concepts to immersed boundary methods, which perform simulations on a regular background mesh cut by a boundary representation that defines the domain of interest. Therefore, we present a novel concept to divide the support of cut basis functions to obtain regular parts suited for sum factorization. These regions require special discontinuous weighted quadrature rules, while Gauss-like quadrature rules integrate the remaining support. Two linear elasticity benchmark problems confirm the derived estimate for the computational costs of the different integration routines and their combination. Although the presence of cut elements reduces the speed-up, its contribution to the overall computation time declines with h-refinement.
R^3: On-device Real-Time Deep Reinforcement Learning for Autonomous Robotics
Authors: Zexin Li, Aritra Samanta, Yufei Li, Andrea Soltoggio, Hyoseung Kim, Cong Liu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract
Autonomous robotic systems, like autonomous vehicles and robotic search and rescue, require efficient on-device training for continuous adaptation of Deep Reinforcement Learning (DRL) models in dynamic environments. This research is fundamentally motivated by the need to understand and address the challenges of on-device real-time DRL, which involves balancing timing and algorithm performance under memory constraints, as exposed through our extensive empirical studies. This intricate balance requires co-optimizing two pivotal parameters of DRL training -- batch size and replay buffer size. Configuring these parameters significantly affects timing and algorithm performance, while both (unfortunately) require substantial memory allocation to achieve near-optimal performance. This paper presents R^3, a holistic solution for managing timing, memory, and algorithm performance in on-device real-time DRL training. R^3 employs (i) a deadline-driven feedback loop with dynamic batch sizing for optimizing timing, (ii) efficient memory management to reduce memory footprint and allow larger replay buffer sizes, and (iii) a runtime coordinator guided by heuristic analysis and a runtime profiler for dynamically adjusting memory resource reservations. These components collaboratively tackle the trade-offs in on-device DRL training, improving timing and algorithm performance while minimizing the risk of out-of-memory (OOM) errors. We implemented and evaluated R^3 extensively across various DRL frameworks and benchmarks on three hardware platforms commonly adopted by autonomous robotic systems. Additionally, we integrate R^3 with a popular realistic autonomous car simulator to demonstrate its real-world applicability. Evaluation results show that R^3 achieves efficacy across diverse platforms, ensuring consistent latency performance and timing predictability with minimal overhead.
Motion Priority Optimization Framework towards Automated and Teleoperated Robot Cooperation in Industrial Recovery Scenarios
Abstract
In this study, we present an optimization framework for efficient motion priority design between automated and teleoperated robots in an industrial recovery scenario. Although robots have recently become increasingly common in industrial sites, there are still challenges in achieving human-robot collaboration/cooperation (HRC), where human workers and robots are engaged in collaborative and cooperative tasks in a shared workspace. For example, the corresponding factory cell must be suspended for safety when an industrial robot drops an assembling part in the workspace. After that, a human worker is allowed to enter the robot workspace to address the robot recovery. This process causes non-continuous manufacturing, which leads to a productivity reduction. Recently, robotic teleoperation technology has emerged as a promising solution to enable people to perform tasks remotely and safely. This technology can be used in the recovery process in manufacturing failure scenarios. Our proposition involves the design of an appropriate priority function that aids in collision avoidance between the manufacturing and recovery robots and facilitates continuous processes with minimal production loss within an acceptable risk level. This paper presents a framework, including an HRC simulator and an optimization formulation, for finding optimal parameters of the priority function. Through quantitative and qualitative experiments, we address the proof of our novel concept and demonstrate its feasibility.
Better Prefix Authentication
Authors: Aljoscha Meyer
Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Abstract
We present new schemes for solving prefix authentication and secure relative timestamping. By casting a new light on antimonotone linking schemes, we improve upon the state of the art in prefix authentication, and in timestamping with rounds of bounded length. Our designs can serve as more efficient alternatives to certificate transparency logs.
Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli
Authors: Robin Müller, Willi Meier, Christoph F. Wildfeuer
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Performance (cs.PF)
Abstract
Modular reduction is a crucial operation in many post-quantum cryptographic schemes, including the Kyber key exchange method or Dilithium signature scheme. However, it can be computationally expensive and pose a performance bottleneck in hardware implementations. To address this issue, we propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli. Unlike other commonly used methods such as Barrett or Montgomery reduction, the method does not require any multiplications. It is not dependent on properties of any particular choice of modulus for good performance and low area consumption. Its major strength lies in its low area consumption, which was reduced by 60% for optimized and up to 90% for generic Barrett implementations for Kyber and Dilithium. Additionally, it is well suited for parallelization and pipelining and scales linearly in hardware resource consumption with increasing operation width. All operations can be performed in the bit-width of the modulus, rather than the size of the number being reduced. This shortens carry chains and allows for faster clocking. Moreover, our method can be executed in constant time, which is essential for cryptography applications where timing attacks can be used to obtain information about the secret key.
Learning to Upsample by Learning to Sample
Authors: Wenze Liu, Hao Lu, Hongtao Fu, Zhiguo Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We present DySample, an ultra-lightweight and effective dynamic upsampler. While impressive performance gains have been witnessed from recent kernel-based dynamic upsamplers such as CARAFE, FADE, and SAPA, they introduce much workload, mostly due to the time-consuming dynamic convolution and the additional sub-network used to generate dynamic kernels. Further, the need for high-res feature guidance of FADE and SAPA somehow limits their application scenarios. To address these concerns, we bypass dynamic convolution and formulate upsampling from the perspective of point sampling, which is more resource-efficient and can be easily implemented with the standard built-in function in PyTorch. We first showcase a naive design, and then demonstrate how to strengthen its upsampling behavior step by step towards our new upsampler, DySample. Compared with former kernel-based dynamic upsamplers, DySample requires no customized CUDA package and has much fewer parameters, FLOPs, GPU memory, and latency. Besides the light-weight characteristics, DySample outperforms other upsamplers across five dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, and monocular depth estimation. Code is available at https://github.com/tiny-smart/dysample.
FedChain: An Efficient and Secure Consensus Protocol based on Proof of Useful Federated Learning for Blockchain
Authors: Peiran Wang
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Blockchain has become a popular decentralized paradigm for various applications in the zero-trust environment. The core of the blockchain is the consensus protocol, which establishes consensus among all the participants. PoW (Proof-of-Work) is one of the most popular consensus protocols. However, the PoW consensus protocol which incentives the participants to use their computing power to solve a meaningless hash puzzle is continuously questioned as energy-wasting. To address these issues, we propose an efficient and secure consensus protocol based on proof of useful federated learning for blockchain (called FedChain). We first propose a secure and robust blockchain architecture that takes federated learning tasks as proof of work. Then a pool aggregation mechanism is integrated to improve the efficiency of the FedChain architecture. To protect model parameter privacy for each participant within a mining pool, a secret sharing-based ring-all reduce architecture is designed. We also introduce a data distribution-based federated learning model optimization algorithm to improve the model performance of FedChain. At last, a zero-knowledge proof-based federated learning model verification is introduced to preserve the privacy of federated learning participants while proving the model performance of federated learning participants. Our approach has been tested and validated through extensive experiments, demonstrating its performance.
Probabilistic Dataset Reconstruction from Interpretable Models
Abstract
Interpretability is often pointed out as a key requirement for trustworthy machine learning. However, learning and releasing models that are inherently interpretable leaks information regarding the underlying training data. As such disclosure may directly conflict with privacy, a precise quantification of the privacy impact of such breach is a fundamental problem. For instance, previous work have shown that the structure of a decision tree can be leveraged to build a probabilistic reconstruction of its training dataset, with the uncertainty of the reconstruction being a relevant metric for the information leak. In this paper, we propose of a novel framework generalizing these probabilistic reconstructions in the sense that it can handle other forms of interpretable models and more generic types of knowledge. In addition, we demonstrate that under realistic assumptions regarding the interpretable models' structure, the uncertainty of the reconstruction can be computed efficiently. Finally, we illustrate the applicability of our approach on both decision trees and rule lists, by comparing the theoretical information leak associated to either exact or heuristic learning algorithms. Our results suggest that optimal interpretable models are often more compact and leak less information regarding their training data than greedily-built ones, for a given accuracy level.
Mixup-Augmented Meta-Learning for Sample-Efficient Fine-Tuning of Protein Simulators
Authors: Jingbang Chen, Yian Wang, Xingwei Qu, Shuangjia Zheng, Yaodong Yang, Hao Dong, Jie Fu
Abstract
Molecular dynamics simulations have emerged as a fundamental instrument for studying biomolecules. At the same time, it is desirable to perform simulations of a collection of particles under various conditions in which the molecules can fluctuate. In this paper, we explore and adapt the soft prompt-based learning method to molecular dynamics tasks. Our model can remarkably generalize to unseen and out-of-distribution scenarios with limited training data. While our work focuses on temperature as a test case, the versatility of our approach allows for efficient simulation through any continuous dynamic conditions, such as pressure and volumes. Our framework has two stages: 1) Pre-trains with data mixing technique, augments molecular structure data and temperature prompts, then applies a curriculum learning method by increasing the ratio of them smoothly. 2) Meta-learning-based fine-tuning framework improves sample-efficiency of fine-tuning process and gives the soft prompt-tuning better initialization points. Comprehensive experiments reveal that our framework excels in accuracy for in-domain data and demonstrates strong generalization capabilities for unseen and out-of-distribution samples.
SpikeBERT: A Language Spikformer Trained with Two-Stage Knowledge Distillation from BERT
Abstract
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way. However, the network architectures of existing SNNs for language tasks are too simplistic, and deep architectures have not been fully explored, resulting in a significant performance gap compared to mainstream transformer-based networks such as BERT. To this end, we improve a recently-proposed spiking transformer (i.e., Spikformer) to make it possible to process language tasks and propose a two-stage knowledge distillation method for training it, which combines pre-training by distilling knowledge from BERT with a large collection of unlabelled texts and fine-tuning with task-specific instances via knowledge distillation again from the BERT fine-tuned on the same training examples. Through extensive experimentation, we show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese with much less energy consumption.
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
Authors: Hiroyuki Ootomo, Rio Yokota
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, which has higher bandwidth than global memory. However, the Bytes-per-Flops (B/F) ratio of the shared memory and Tensor Cores is small since the performance of Tensor Cores is high. Thus, it is important to reduce the shared memory footprint for efficient Tensor Cores usage. In this paper, we analyze the simple matrix-matrix multiplication on Tensor Cores by the roofline model and figure out that the bandwidth of shared memory might be a limitation of the performance when using WMMA API. To alleviate this issue, we provide a WMMA API extension library to boost the throughput of the computation, which has two components. The first one allows for manipulating the array of registers input to Tensor Cores flexibly. We evaluate the performance improvement of this library. The outcome of our evaluation shows that our library reduces the shared memory footprint and speeds up the computation using Tensor Cores. The second one is an API for the SGEMM emulation on Tensor Cores without additional shared memory usage. We have demonstrated that the single-precision emulating batch SGEMM implementation on Tensor Cores using this library achieves 54.2 TFlop/s on A100 GPU, which outperforms the theoretical peak performance of FP32 SIMT Cores while achieving the same level of accuracy as cuBLAS. The achieved throughput can not be achieved without reducing the shared memory footprint done by our library with the same amount of register usage.
MCMS-RBM: Multi-Component Multi-State Reduced Basis Method toward Efficient Transition Pathway Identification for Crystals and Quasicrystals
Authors: Yajie Ji, Lijie Ji, Yanlai Chen, Zhenli Xu
Abstract
Due to quasicrystals having long-range orientational order but without translational symmetry, traditional numerical methods usually suffer when applied as is. In the past decade, the projection method has emerged as a prominent solver for quasiperiodic problems. Transforming them into a higher dimensional but periodic ones, the projection method facilitates the application of the fast Fourier transform. However, the computational complexity inevitably becomes high which significantly impedes e.g. the generation of the phase diagram since a high-fidelity simulation of a problem whose dimension is doubled must be performed for numerous times. To address the computational challenge of quasiperiodic problems based on the projection method, this paper proposes a multi-component multi-state reduced basis method (MCMS-RBM). Featuring multiple components with each providing reduction functionality for one branch of the problem induced by one part of the parameter domain, the MCMS-RBM does not resort to the parameter domain configurations (e.g. phase diagrams) a priori. It enriches each component in a greedy fashion via a phase-transition guided exploration of the multiple states inherent to the problem. Adopting the empirical interpolation method, the resulting online-efficient method vastly accelerates the generation of a delicate phase diagram to a matter of minutes for a parametrized two-turn-four dimensional Lifshitz-Petrich model with two length scales. Moreover, it furnishes surrogate and equally accurate field variables anywhere in the parameter domain.
Efficient Almost-Egalitarian Allocation of Goods and Bads
Authors: Israel Jacobovich, Erel Segal-Halevi
Subjects: Computer Science and Game Theory (cs.GT)
Abstract
We consider the allocation of indivisible objects among agents with different valuations, which can be positive or negative. An egalitarian allocation is an allocation that maximizes the smallest value given to an agent; finding such an allocation is NP-hard. We present a simple polynomial-time algorithm that finds an allocation that is Pareto-efficient and almost-egalitarian: each agent's value is at least his value in an egalitarian allocation, minus the absolute value of a single object. The main tool is an algorithm for rounding a fractional allocation to a discrete allocation, by which each agent loses at most one good or gains at most one chore. Our algorithm generalizes and simplifies three previous algorithms. We discuss several aspects and observations about the algorithm and the problem at hand that open doors for efficient and robust implementations.
Benchmarking the Generation of Fact Checking Explanations
Authors: Daniel Russo, Serra Sinem Tekiroglu, Marco Guerini
Abstract
Fighting misinformation is a challenging, yet crucial, task. Despite the growing number of experts being involved in manual fact-checking, this activity is time-consuming and cannot keep up with the ever-increasing amount of Fake News produced daily. Hence, automating this process is necessary to help curb misinformation. Thus far, researchers have mainly focused on claim veracity classification. In this paper, instead, we address the generation of justifications (textual explanation of why a claim is classified as either true or false) and benchmark it with novel datasets and advanced baselines. In particular, we focus on summarization approaches over unstructured knowledge (i.e. news articles) and we experiment with several extractive and abstractive strategies. We employed two datasets with different styles and structures, in order to assess the generalizability of our findings. Results show that in justification production summarization benefits from the claim information, and, in particular, that a claim-driven extractive step improves abstractive summarization performances. Finally, we show that although cross-dataset experiments suffer from performance degradation, a unique model trained on a combination of the two datasets is able to retain style information in an efficient manner.
Structural Node Embeddings with Homomorphism Counts
Authors: Hinrikus Wolf, Luca Oeljeklaus, Pascal Kühner, Martin Grohe
Abstract
Graph homomorphism counts, first explored by Lov\'asz in 1967, have recently garnered interest as a powerful tool in graph-based machine learning. Grohe (PODS 2020) proposed the theoretical foundations for using homomorphism counts in machine learning on graph level as well as node level tasks. By their very nature, these capture local structural information, which enables the creation of robust structural embeddings. While a first approach for graph level tasks has been made by Nguyen and Maehara (ICML 2020), we experimentally show the effectiveness of homomorphism count based node embeddings. Enriched with node labels, node weights, and edge weights, these offer an interpretable representation of graph data, allowing for enhanced explainability of machine learning models. We propose a theoretical framework for isomorphism-invariant homomorphism count based embeddings which lend themselves to a wide variety of downstream tasks. Our approach capitalises on the efficient computability of graph homomorphism counts for bounded treewidth graph classes, rendering it a practical solution for real-world applications. We demonstrate their expressivity through experiments on benchmark datasets. Although our results do not match the accuracy of state-of-the-art neural architectures, they are comparable to other advanced graph learning models. Remarkably, our approach demarcates itself by ensuring explainability for each individual feature. By integrating interpretable machine learning algorithms like SVMs or Random Forests, we establish a seamless, end-to-end explainable pipeline. Our study contributes to the advancement of graph-based techniques that offer both performance and interpretability.
MSFlow: Multi-Scale Flow-based Framework for Unsupervised Anomaly Detection
Abstract
Unsupervised anomaly detection (UAD) attracts a lot of research interest and drives widespread applications, where only anomaly-free samples are available for training. Some UAD applications intend to further locate the anomalous regions without any anomaly information. Although the absence of anomalous samples and annotations deteriorates the UAD performance, an inconspicuous yet powerful statistics model, the normalizing flows, is appropriate for anomaly detection and localization in an unsupervised fashion. The flow-based probabilistic models, only trained on anomaly-free data, can efficiently distinguish unpredictable anomalies by assigning them much lower likelihoods than normal data. Nevertheless, the size variation of unpredictable anomalies introduces another inconvenience to the flow-based methods for high-precision anomaly detection and localization. To generalize the anomaly size variation, we propose a novel Multi-Scale Flow-based framework dubbed MSFlow composed of asymmetrical parallel flows followed by a fusion flow to exchange multi-scale perceptions. Moreover, different multi-scale aggregation strategies are adopted for image-wise anomaly detection and pixel-wise anomaly localization according to the discrepancy between them. The proposed MSFlow is evaluated on three anomaly detection datasets, significantly outperforming existing methods. Notably, on the challenging MVTec AD benchmark, our MSFlow achieves a new state-of-the-art with a detection AUORC score of up to 99.7%, localization AUCROC score of 98.8%, and PRO score of 97.1%. The reproducible code is available at https://github.com/cool-xuan/msflow.
Compositional maps for registration in complex geometries
Abstract
We develop and analyze a parametric registration procedure for manifolds associated with the solutions to parametric partial differential equations in two-dimensional domains. Given the domain $\Omega \subset \mathbb{R}^2$ and the manifold $M={ u{\mu} : \mu\in P}$ associated with the parameter domain $P \subset \mathbb{R}^P$ and the parametric field $\mu\mapsto u{\mu} \in L^2(\Omega)$, our approach takes as input a set of snapshots from $M$ and returns a parameter-dependent mapping $\Phi: \Omega \times P \to \Omega$, which tracks coherent features (e.g., shocks, shear layers) of the solution field and ultimately simplifies the task of model reduction. We consider mappings of the form $\Phi=\texttt{N}(\mathbf{a})$ where $\texttt{N}:\mathbb{R}^M \to {\rm Lip}(\Omega; \mathbb{R}^2)$ is a suitable linear or nonlinear operator; then, we state the registration problem as an unconstrained optimization statement for the coefficients $\mathbf{a}$. We identify minimal requirements for the operator $\texttt{N}$ to ensure the satisfaction of the bijectivity constraint; we propose a class of compositional maps that satisfy the desired requirements and enable non-trivial deformations over curved boundaries of $\Omega$; we develop a thorough analysis of the proposed ansatz for polytopal domains and we discuss the approximation properties for general curved domains. We perform numerical experiments for a parametric inviscid transonic compressible flow past a cascade of turbine blades to illustrate the many features of the method.
On-Device Learning with Binary Neural Networks
Authors: Lorenzo Vorabbi, Davide Maltoni, Stefano Santi
Abstract
Existing Continual Learning (CL) solutions only partially address the constraints on power, memory and computation of the deep learning models when deployed on low-power embedded CPUs. In this paper, we propose a CL solution that embraces the recent advancements in CL field and the efficiency of the Binary Neural Networks (BNN), that use 1-bit for weights and activations to efficiently execute deep learning models. We propose a hybrid quantization of CWR* (an effective CL approach) that considers differently forward and backward pass in order to retain more precision during gradient update step and at the same time minimizing the latency overhead. The choice of a binary network as backbone is essential to meet the constraints of low power devices and, to the best of authors' knowledge, this is the first attempt to prove on-device learning with BNN. The experimental validation carried out confirms the validity and the suitability of the proposed method.
Practice of Alibaba Cloud on Elastic Resource Provisioning for Large-scale Microservices Cluster
Authors: Minxian Xu, Lei Yang, Yang Wang, Chengxi Gao, Linfeng Wen, Guoyao Xu, Liping Zhang, Kejiang Ye, Chengzhong Xu
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Cloud-native architecture is becoming increasingly crucial for today's cloud computing environments due to the need for speed and flexibility in developing applications. It utilizes microservice technology to break down traditional monolithic applications into light-weight and self-contained microservice components. However, as microservices grow in scale and have dynamic inter-dependencies, they also pose new challenges in resource provisioning that cannot be fully addressed by traditional resource scheduling approaches. The various microservices with different resource needs and latency requirements can create complex calling chains, making it difficult to provide fine-grained and accurate resource allocation to each component while maintaining the overall quality of service in the chain. In this work, we aim to address the research problem on how to efficiently provision resources for the growing scale of microservice platform and ensure the performance of latency-critical microservices. To address the problem, we present in-depth analyses of Alibaba's microservice cluster and propose optimized resource provisioning algorithms to enhance resource utilization while ensuring the latency requirement. First, we analyze the distinct features of microservices in Alibaba's cluster compared to traditional applications. Then we present Alibaba's resource capacity provisioning workflow and framework to address challenges in resource provisioning for large-scale and latency-critical microservice clusters. Finally, we propose enhanced resource provisioning algorithms over Alibaba's current practice by making both proactive and reactive scheduling decisions based on different workloads patterns, which can improve resource usage by 10-15% in Alibaba's clusters, while maintaining the necessary latency for microservices.
Adaptivity in Local Kernel Based Methods for Approximating the Action of Linear Operators
Abstract
Building on the successes of local kernel methods for approximating the solutions to partial differential equations (PDE) and the evaluation of definite integrals (quadrature/cubature), a local estimate of the error in such approximations is developed. This estimate is useful for determining locations in the solution domain where increased node density (equivalently, reduction in the spacing between nodes) can decrease the error in the solution. An adaptive procedure for adding nodes to the domain for both the approximation of derivatives and the approximate evaluation of definite integrals is described. This method efficiently computes the error estimate at a set of prescribed points and adds new nodes for approximation where the error is too large. Computational experiments demonstrate close agreement between the error estimate and actual absolute error in the approximation. Such methods are necessary or desirable when approximating solutions to PDE (or in the case of quadrature/cubature), where the initial data and subsequent solution (or integrand) exhibit localized features that require significant refinement to resolve and where uniform increases in the density of nodes across the entire computational domain is not possible or too burdensome.
Enhancing Robot Learning through Learned Human-Attention Feature Maps
Authors: Daniel Scheuchenstuhl, Stefan Ulmer, Felix Resch, Luigi Berducci, Radu Grosu
Abstract
Robust and efficient learning remains a challenging problem in robotics, in particular with complex visual inputs. Inspired by human attention mechanism, with which we quickly process complex visual scenes and react to changes in the environment, we think that embedding auxiliary information about focus point into robot learning would enhance efficiency and robustness of the learning process. In this paper, we propose a novel approach to model and emulate the human attention with an approximate prediction model. We then leverage this output and feed it as a structured auxiliary feature map into downstream learning tasks. We validate this idea by learning a prediction model from human-gaze recordings of manual driving in the real world. We test our approach on two learning tasks - object detection and imitation learning. Our experiments demonstrate that the inclusion of predicted human attention leads to improved robustness of the trained models to out-of-distribution samples and faster learning in low-data regime settings. Our work highlights the potential of incorporating structured auxiliary information in representation learning for robotics and opens up new avenues for research in this direction. All code and data are available online.
IndGIC: Supervised Action Recognition under Low Illumination
Authors: Jingbo Zeng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Technologies of human action recognition in the dark are gaining more and more attention as huge demand in surveillance, motion control and human-computer interaction. However, because of limitation in image enhancement method and low-lighting video datasets, e.g. labeling cost, existing methods meet some problems. Some video-based approached are effect and efficient in specific datasets but cannot generalize to most cases while others methods using multiple sensors rely heavily to prior knowledge to deal with noisy nature from video stream. In this paper, we proposes action recognition method using deep multi-input network. Furthermore, we proposed a Independent Gamma Intensity Corretion (Ind-GIC) to enhance poor-illumination video, generating one gamma for one frame to increase enhancement performance. To prove our method is effective, there is some evaluation and comparison between our method and existing methods. Experimental results show that our model achieves high accuracy in on ARID dataset.
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
Abstract
Large language models (LLMs) have emerged as a new paradigm for Text-to-SQL task. However, the absence of a systematical benchmark inhibits the development of designing effective, efficient and economic LLM-based Text-to-SQL solutions. To address this challenge, in this paper, we first conduct a systematical and extensive comparison over existing prompt engineering methods, including question representation, example selection and example organization, and with these experimental results, we elaborates their pros and cons. Based on these findings, we propose a new integrated solution, named DAIL-SQL, which refreshes the Spider leaderboard with 86.6% execution accuracy and sets a new bar. Towards an efficient and economic LLM-based Text-to-SQL solution, we emphasize the token efficiency in prompt engineering and compare the prior studies under this metric. Additionally, we investigate open-source LLMs in in-context learning, and further enhance their performance with task-specific supervised fine-tuning. Our explorations highlight open-source LLMs' potential in Text-to-SQL, as well as the advantages and disadvantages of the task-specific supervised fine-tuning. We hope that our work provides a deeper understanding of Text-to-SQL with LLMs, and inspire further investigations and broad applications.
Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation
Authors: Fu-En Yang, Chien-Yi Wang, Yu-Chiang Frank Wang
Abstract
Federated learning (FL) emerges as a decentralized learning framework which trains models from multiple distributed clients without sharing their data to preserve privacy. Recently, large-scale pre-trained models (e.g., Vision Transformer) have shown a strong capability of deriving robust representations. However, the data heterogeneity among clients, the limited computation resources, and the communication bandwidth restrict the deployment of large-scale models in FL frameworks. To leverage robust representations from large-scale models while enabling efficient model personalization for heterogeneous clients, we propose a novel personalized FL framework of client-specific Prompt Generation (pFedPG), which learns to deploy a personalized prompt generator at the server for producing client-specific visual prompts that efficiently adapts frozen backbones to local data distributions. Our proposed framework jointly optimizes the stages of personalized prompt adaptation locally and personalized prompt generation globally. The former aims to train visual prompts that adapt foundation models to each client, while the latter observes local optimization directions to generate personalized prompts for all clients. Through extensive experiments on benchmark datasets, we show that our pFedPG is favorable against state-of-the-art personalized FL methods under various types of data heterogeneity, allowing computation and communication efficient model personalization.
A Reduced-Order Model for Nonlinear Radiative Transfer Problems Based on Moment Equations and POD-Petrov-Galerkin Projection of the Normalized Boltzmann Transport Equation
Abstract
A data-driven projection-based reduced-order model (ROM) for nonlinear thermal radiative transfer (TRT) problems is presented. The TRT ROM is formulated by (i) a hierarchy of low-order quasidiffusion (aka variable Eddington factor) equations for moments of the radiation intensity and (ii) the normalized Boltzmann transport equation (BTE). The multilevel system of moment equations is derived by projection of the BTE onto a sequence of subspaces which represent elements of the phase space of the problem. Exact closure for the moment equations is provided by the Eddington tensor. A Petrov-Galerkin (PG) projection of the normalized BTE is formulated using a proper orthogonal decomposition (POD) basis representing the normalized radiation intensity over the whole phase space and time. The Eddington tensor linearly depends on the solution of the normalized BTE. By linear superposition of the POD basis functions, a low-rank expansion of the Eddington tensor is constructed with coefficients defined by the PG projected normalized BTE. The material energy balance (MEB) equation is coupled with the effective grey low-order equations which exist on the same dimensional scale as the MEB equation. The resulting TRT ROM is structure and asymptotic preserving. A detailed analysis of the ROM is performed on the classical Fleck-Cummings (F-C) TRT multigroup test problem in 2D geometry. Numerical results are presented to demonstrate the ROM's effectiveness in the simulation of radiation wave phenomena. The ROM is shown to produce solutions with sufficiently high accuracy while using low-rank approximation of the normalized BTE solution. Essential physical characteristics of supersonic radiation wave are preserved in the ROM solutions.
Bayesian Integration of Information Using Top-Down Modulated WTA Networks
Authors: Otto van der Himst, Leila Bagheriye, Johan Kwisthout
Abstract
Winner Take All (WTA) circuits a type of Spiking Neural Networks (SNN) have been suggested as facilitating the brain's ability to process information in a Bayesian manner. Research has shown that WTA circuits are capable of approximating hierarchical Bayesian models via Expectation Maximization (EM). So far, research in this direction has focused on bottom up processes. This is contrary to neuroscientific evidence that shows that, besides bottom up processes, top down processes too play a key role in information processing by the human brain. Several functions ascribed to top down processes include direction of attention, adjusting for expectations, facilitation of encoding and recall of learned information, and imagery. This paper explores whether WTA circuits are suitable for further integrating information represented in separate WTA networks. Furthermore, it explores whether, and under what circumstances, top down processes can improve WTA network performance with respect to inference and learning. The results show that WTA circuits are capable of integrating the probabilistic information represented by other WTA networks, and that top down processes can improve a WTA network's inference and learning performance. Notably, it is able to do this according to key neuromorphic principles, making it ideal for low-latency and energy efficient implementation on neuromorphic hardware.
Adversarial Low Degree Testing
Authors: Dor Minzer, Kai Zhe Zheng
Subjects: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)
Abstract
In the $t$-online-erasure model in property testing, an adversary is allowed to erase $t$ values of a queried function for each query the tester makes. This model was recently formulated by Kalemaj, Raskhodnikova andVarma, who showed that the properties of linearity of functions as well as quadraticity can be tested in$O_t(1)$ many queries: $O(\log (t))$ for linearity and $2^{2^{O(t)}}$ for quadraticity. They asked whether the more general property of low-degreeness can be tested in the online erasure model, whether better testers exist for quadraticity, and if similar results hold when erasures'' are replaced withcorruptions''. We show that, in the $t$-online-erasure model, for a prime power $q$, given query access to a function $f: \mathbb{F}_q^n \xrightarrow[]{} \mathbb{F}_q$, one can distinguish in $\mathrm{poly}(\log^{d+q}(t)/\delta)$ queries between the case that $f$ is degree at most $d$, and the case that $f$ is $\delta$-far from any degree $d$ function (with respect to the fractional hamming distance). This answers the aforementioned questions and brings the query complexity to nearly match the query complexity of low-degree testing in the classical property testing model. Our results are based on the observation that the property of low-degreeness admits a large and versatile family of query efficient testers. Our testers operates by querying a uniformly random, sufficiently large set of points in a large enough affine subspace, and finding a tester for low-degreeness that only utilizes queries from that set of points. We believe that this tester may find other applications to algorithms in the online-erasure model or other related models, and may be of independent interest.
On the hardness of inclusion-wise minimal separators enumeration
Authors: Caroline Brosse, Oscar Defrain, Kazuhiro Kurita, Vincent Limouzy, Takeaki Uno, Kunihiro Wasa
Abstract
Enumeration problems are often encountered as key subroutines in the exact computation of graph parameters such as chromatic number, treewidth, or treedepth. In the case of treedepth computation, the enumeration of inclusion-wise minimal separators plays a crucial role. However and quite surprisingly, the complexity status of this problem has not been settled since it has been posed as an open direction by Kloks and Kratsch in 1998. Recently at the PACE 2020 competition dedicated to treedepth computation, solvers have been circumventing that by listing all minimal $a$-$b$ separators and filtering out those that are not inclusion-wise minimal, at the cost of efficiency. Naturally, having an efficient algorithm for listing inclusion-wise minimal separators would drastically improve such practical algorithms. In this note, however, we show that no efficient algorithm is to be expected from an output-sensitive perspective, namely, we prove that there is no output-polynomial time algorithm for inclusion-wise minimal separators enumeration unless P = NP.
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer
Abstract
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g, formality) to authorship (e.g, Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
Canonical Factors for Hybrid Neural Fields
Authors: Brent Yi, Weijia Zeng, Sam Buchanan, Yi Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract
Factored feature volumes offer a simple way to build more compact, efficient, and intepretable neural fields, but also introduce biases that are not necessarily beneficial for real-world data. In this work, we (1) characterize the undesirable biases that these architectures have for axis-aligned signals -- they can lead to radiance field reconstruction differences of as high as 2 PSNR -- and (2) explore how learning a set of canonicalizing transformations can improve representations by removing these biases. We prove in a two-dimensional model problem that simultaneously learning these transformations together with scene appearance succeeds with drastically improved efficiency. We validate the resulting architectures, which we call TILTED, using image, signed distance, and radiance field reconstruction tasks, where we observe improvements across quality, robustness, compactness, and runtime. Results demonstrate that TILTED can enable capabilities comparable to baselines that are 2x larger, while highlighting weaknesses of neural field evaluation procedures.
A Comparative Study of Loss Functions: Traffic Predictions in Regular and Congestion Scenarios
Abstract
Spatiotemporal graph neural networks have achieved state-of-the-art performance in traffic forecasting. However, they often struggle to forecast congestion accurately due to the limitations of traditional loss functions. While accurate forecasting of regular traffic conditions is crucial, a reliable AI system must also accurately forecast congestion scenarios to maintain safe and efficient transportation. In this paper, we explore various loss functions inspired by heavy tail analysis and imbalanced classification problems to address this issue. We evaluate the efficacy of these loss functions in forecasting traffic speed, with an emphasis on congestion scenarios. Through extensive experiments on real-world traffic datasets, we discovered that when optimizing for Mean Absolute Error (MAE), the MAE-Focal Loss function stands out as the most effective. When optimizing Mean Squared Error (MSE), Gumbel Loss proves to be the superior choice. These choices effectively forecast traffic congestion events without compromising the accuracy of regular traffic speed forecasts. This research enhances deep learning models' capabilities in forecasting sudden speed changes due to congestion and underscores the need for more research in this direction. By elevating the accuracy of congestion forecasting, we advocate for AI systems that are reliable, secure, and resilient in practical traffic management scenarios.
Graph Theory and its Uses in Graph Algorithms and Beyond
Authors: Rachit Nimavat
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)
Abstract
Graphs are fundamental objects that find widespread applications across computer science and beyond. Graph Theory has yielded deep insights about structural properties of various families of graphs, which are leveraged in the design and analysis of algorithms for graph optimization problems and other computational optimization problems. These insights have also proved helpful in understanding the limits of efficient computation by providing constructions of hard problem instances. At the same time, algorithmic tools and techniques provide a fresh perspective on graph theoretic problems, often leading to novel discoveries. In this thesis, we exploit this symbiotic relationship between graph theory and algorithms for graph optimization problems and beyond. This thesis consists of three parts. In the first part, we study a graph routing problem called the Node-Disjoint Paths (NDP) problem. Given a graph and a set of source-destination pairs of its vertices, the goal is to route the maximum number of pairs via node-disjoint paths. We come close to resolving the approximability of NDP by showing that it is $n^{\Omega(1/poly\log\log n)}$-hard to approximate, even on grid graphs, where n is the number of vertices. In the second part of this thesis, we use graph decomposition techniques developed for efficient algorithms to derive a graph theoretic result. We show that for every n-vertex expander graph G, if H is any graph with at most $O(n/\log n)$ vertices and edges, then H is a minor of G. In the last part, we show that the graph theoretic tools and graph algorithmic techniques can shed light on problems seemingly unrelated to graphs. We show that the randomized space complexity of the Longest Increasing Subsequence (LIS) problem in the streaming model is intrinsically tied to the query-complexity of the Non-Crossing Matching problem on graphs in a new model of computation that we define.
A General-Purpose Self-Supervised Model for Computational Pathology
Authors: Richard J. Chen, Tong Ding, Ming Y. Lu, Drew F. K. Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H. Song, Muhammad Shaban, Mane Williams, Anurag Vaidya, Sharifa Sahai, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Walt Williams, Long Phi Le, Georg Gerber, Faisal Mahmood
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Tissues and Organs (q-bio.TO)
Abstract
Tissue phenotyping is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts have proposed using pretrained image encoders with either transfer learning from natural image datasets or self-supervised pretraining on publicly-available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. We introduce UNI, a general-purpose self-supervised model for pathology, pretrained using over 100 million tissue patches from over 100,000 diagnostic haematoxylin and eosin-stained WSIs across 20 major tissue types, and evaluated on 33 representative CPath clinical tasks in CPath of varying diagnostic difficulties. In addition to outperforming previous state-of-the-art models, we demonstrate new modeling capabilities in CPath such as resolution-agnostic tissue classification, slide classification using few-shot class prototypes, and disease subtyping generalization in classifying up to 108 cancer types in the OncoTree code classification system. UNI advances unsupervised representation learning at scale in CPath in terms of both pretraining data and downstream evaluation, enabling data-efficient AI models that can generalize and transfer to a gamut of diagnostically-challenging tasks and clinical workflows in anatomic pathology.
Keyword: faster
CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering
Authors: Sami Diaf
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract
Document scaling has been a key component in text-as-data applications for social scientists and a major field of interest for political researchers, who aim at uncovering differences between speakers or parties with the help of different probabilistic and non-probabilistic approaches. Yet, most of these techniques are either built upon the agnostically bag-of-word hypothesis or use prior information borrowed from external sources that might embed the results with a significant bias. If the corpus has long been considered as a collection of documents, it can also be seen as a dense network of connected words whose structure could be clustered to differentiate independent groups of words, based on their co-occurrences in documents, known as communities. This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus and use them as an input to Wordfish method, instead of considering the word space. This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model. Aside from yielding communities, assumed to be subtopic proxies, the application of this technique outperforms the classic Wordfish model by highlighting historical developments in the U.S. State of the Union addresses and was found to replicate the prevailing political stance in Germany when using the corpus of parties' legislative manifestos.
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
Authors: Zijia Lu, Ehsan Elhamifar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two branches. We apply and extend a set-prediction objective to allow each action token to represent one or multiple action segments, thus can avoid learning a large number of tokens over long videos with many segments. Thanks to the design of our action branch, we can also seamlessly leverage textual transcripts of videos (when available) to help action segmentation by using them to initialize the action tokens. We evaluate our model on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts, showing that BIT significantly improves the state-of-the-art accuracy with much lower computational cost (30 times faster) compared to existing transformer-based methods.
Generative Model for Models: Rapid DNN Customization for Diverse Tasks and Resource Constraints
Authors: Wenxing Xu, Yuanchun Li, Jiacheng Liu, Yi Sun, Zhengyang Cao, Yixuan Li, Hao Wen, Yunxin Liu
Abstract
Unlike cloud-based deep learning models that are often large and uniform, edge-deployed models usually demand customization for domain-specific tasks and resource-limited environments. Such customization processes can be costly and time-consuming due to the diversity of edge scenarios and the training load for each scenario. Although various approaches have been proposed for rapid resource-oriented customization and task-oriented customization respectively, achieving both of them at the same time is challenging. Drawing inspiration from the generative AI and the modular composability of neural networks, we introduce NN-Factory, an one-for-all framework to generate customized lightweight models for diverse edge scenarios. The key idea is to use a generative model to directly produce the customized models, instead of training them. The main components of NN-Factory include a modular supernet with pretrained modules that can be conditionally activated to accomplish different tasks and a generative module assembler that manipulate the modules according to task and sparsity requirements. Given an edge scenario, NN-Factory can efficiently customize a compact model specialized in the edge task while satisfying the edge resource constraints by searching for the optimal strategy to assemble the modules. Based on experiments on image classification and object detection tasks with different edge devices, NN-Factory is able to generate high-quality task- and resource-specific models within few seconds, faster than conventional model customization approaches by orders of magnitude.
Massively Parallel Continuous Local Search for Hybrid SAT Solving on GPUs
Authors: Yunuo Cen, Zhiwei Zhang, Xuanyao Fong
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Abstract
Although state-of-the-art (SOTA) SAT solvers based on conflict-driven clause learning (CDCL) have achieved remarkable engineering success, their sequential nature limits the parallelism that may be extracted for acceleration on platforms such as the graphics processing unit (GPU). In this work, we propose FastFourierSAT, a highly parallel hybrid SAT solver based on gradient-driven continuous local search (CLS). This is realized by a novel parallel algorithm inspired by the Fast Fourier Transform (FFT)-based convolution for computing the elementary symmetric polynomials (ESPs), which is the major computational task in previous CLS methods. The complexity of our algorithm matches the best previous result. Furthermore, the substantial parallelism inherent in our algorithm can leverage the GPU for acceleration, demonstrating significant improvement over the previous CLS approaches. We also propose to incorporate the restart heuristics in CLS to improve search efficiency. We compare our approach with the SOTA parallel SAT solvers on several benchmarks. Our results show that FastFourierSAT computes the gradient 100+ times faster than previous prototypes implemented on CPU. Moreover, FastFourierSAT solves most instances and demonstrates promising performance on larger-size instances.
Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli
Authors: Robin Müller, Willi Meier, Christoph F. Wildfeuer
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Performance (cs.PF)
Abstract
Modular reduction is a crucial operation in many post-quantum cryptographic schemes, including the Kyber key exchange method or Dilithium signature scheme. However, it can be computationally expensive and pose a performance bottleneck in hardware implementations. To address this issue, we propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli. Unlike other commonly used methods such as Barrett or Montgomery reduction, the method does not require any multiplications. It is not dependent on properties of any particular choice of modulus for good performance and low area consumption. Its major strength lies in its low area consumption, which was reduced by 60% for optimized and up to 90% for generic Barrett implementations for Kyber and Dilithium. Additionally, it is well suited for parallelization and pipelining and scales linearly in hardware resource consumption with increasing operation width. All operations can be performed in the bit-width of the modulus, rather than the size of the number being reduced. This shortens carry chains and allows for faster clocking. Moreover, our method can be executed in constant time, which is essential for cryptography applications where timing attacks can be used to obtain information about the secret key.
Best Memory Architecture Exploration under Parameters Variations accelerated with Machine Learning
Abstract
The design of effective memory architecture is of utmost importance in modern computing systems. However, the design of memory subsystems is even more difficult today because process variation and modern design techniques like dynamic voltage scaling make performance metrics for memory assessment be treated as random variables instead of scalars at design time. Most of the previous works have studied the design of memory design from the yield analysis perspective leaving the question of the best memory organization on average open. Because examining all possible combinations of design parameter values of a memory chip would require prohibitively much time, in this work, we propose Best Arm Identification (BAI) algorithms to accelerate the exploration for the best memory architecture on average under parameter variations. Our experimental results demonstrate that we can arrive at the best memory organization 99% of the time in x5 faster than an exhaustive search of all possible conditions.
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
Authors: Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, Yong Wang
Subjects: Data Structures and Algorithms (cs.DS); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
Abstract
Approximate Nearest Neighbor Search (ANNS) plays a critical role in various disciplines spanning data mining and artificial intelligence, from information retrieval and computer vision to natural language processing and recommender systems. Data volumes have soared in recent years and the computational cost of an exhaustive exact nearest neighbor search is often prohibitive, necessitating the adoption of approximate techniques. The balanced performance and recall of graph-based approaches have more recently garnered significant attention in ANNS algorithms, however, only a few studies have explored harnessing the power of GPUs and multi-core processors despite the widespread use of massively parallel and general-purpose computing. To bridge this gap, we introduce a novel parallel computing hardware-based proximity graph and search algorithm. By leveraging the high-performance capabilities of modern hardware, our approach achieves remarkable efficiency gains. In particular, our method surpasses existing CPU and GPU-based methods in constructing the proximity graph, demonstrating higher throughput in both large- and small-batch searches while maintaining compatible accuracy. In graph construction time, our method, CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA implementations. In large-batch query throughput in the 90% to 95% recall range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the SOTA implementations for GPU. For a single query, our method is 3.4~53x faster than HNSW at 95% recall.
Enhancing Robot Learning through Learned Human-Attention Feature Maps
Authors: Daniel Scheuchenstuhl, Stefan Ulmer, Felix Resch, Luigi Berducci, Radu Grosu
Abstract
Robust and efficient learning remains a challenging problem in robotics, in particular with complex visual inputs. Inspired by human attention mechanism, with which we quickly process complex visual scenes and react to changes in the environment, we think that embedding auxiliary information about focus point into robot learning would enhance efficiency and robustness of the learning process. In this paper, we propose a novel approach to model and emulate the human attention with an approximate prediction model. We then leverage this output and feed it as a structured auxiliary feature map into downstream learning tasks. We validate this idea by learning a prediction model from human-gaze recordings of manual driving in the real world. We test our approach on two learning tasks - object detection and imitation learning. Our experiments demonstrate that the inclusion of predicted human attention leads to improved robustness of the trained models to out-of-distribution samples and faster learning in low-data regime settings. Our work highlights the potential of incorporating structured auxiliary information in representation learning for robotics and opens up new avenues for research in this direction. All code and data are available online.
Keyword: mobile
SynthDistill: Face Recognition with Knowledge Distillation from Synthetic Data
Abstract
State-of-the-art face recognition networks are often computationally expensive and cannot be used for mobile applications. Training lightweight face recognition models also requires large identity-labeled datasets. Meanwhile, there are privacy and ethical concerns with collecting and using large face recognition datasets. While generating synthetic datasets for training face recognition models is an alternative option, it is challenging to generate synthetic data with sufficient intra-class variations. In addition, there is still a considerable gap between the performance of models trained on real and synthetic data. In this paper, we propose a new framework (named SynthDistill) to train lightweight face recognition models by distilling the knowledge of a pretrained teacher face recognition model using synthetic data. We use a pretrained face generator network to generate synthetic face images and use the synthesized images to learn a lightweight student network. We use synthetic face images without identity labels, mitigating the problems in the intra-class variation generation of synthetic datasets. Instead, we propose a novel dynamic sampling strategy from the intermediate latent space of the face generator network to include new variations of the challenging images while further exploring new face images in the training batch. The results on five different face recognition datasets demonstrate the superiority of our lightweight model compared to models trained on previous synthetic datasets, achieving a verification accuracy of 99.52% on the LFW dataset with a lightweight network. The results also show that our proposed framework significantly reduces the gap between training with real and synthetic data. The source code for replicating the experiments is publicly released.
Improving Reinforcement Learning Training Regimes for Social Robot Navigation
Authors: Adam Sigal, Hsiu-Chin Lin, AJung Moon
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract
In order for autonomous mobile robots to navigate in human spaces, they must abide by our social norms. Reinforcement learning (RL) has emerged as an effective method to train robot navigation policies that are able to respect these norms. However, a large portion of existing work in the field conducts both RL training and testing in simplistic environments. This limits the generalization potential of these models to unseen environments, and the meaningfulness of their reported results. We propose a method to improve the generalization performance of RL social navigation methods using curriculum learning. By employing multiple environment types and by modeling pedestrians using multiple dynamics models, we are able to progressively diversify and escalate difficulty in training. Our results show that the use of curriculum learning in training can be used to achieve better generalization performance than previous training methods. We also show that results presented in many existing state-of-the art RL social navigation works do not evaluate their methods outside of their training environments, and thus do not reflect their policies' failure to adequately generalize to out-of-distribution scenarios. In response, we validate our training approach on larger and more crowded testing environments than those used in training, allowing for more meaningful measurements of model performance.
Satellite-MEC Integration for 6G Internet of Things: Minimal Structures, Advances, and Prospects
Authors: Yueshan Lin, Wei Feng, Yanmin Wang, Yunfei Chen, Yongxu Zhu, Ximu Zhang, Ning Ge
Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Abstract
The sixth-generation (6G) network is envisioned to shift its focus from the service requirements of human beings' to those of Internet-of-Things (IoT) devices'. Satellite communications are indispensable in 6G to support IoT devices operating in rural or disastrous areas. However, satellite networks face the inherent challenges of low data rate and large latency, which may not support computation-intensive and delay-sensitive IoT applications. Mobile Edge Computing (MEC) is a burgeoning paradigm by extending cloud computing capabilities to the network edge. By utilizing MEC technologies, the resource-limited IoT devices can access abundant computation resources with low latency, which enables the highly demanding applications while meeting strict delay requirements. Therefore, an integration of satellite communications and MEC technologies is necessary to better enable 6G IoT. In this survey, we provide a holistic overview of satellite-MEC integration. We first discuss the main challenges of the integrated satellite-MEC network and propose three minimal integrating structures. For each minimal structure, we summarize the current advances in terms of their research topics, after which we discuss the lessons learned and future directions of the minimal structure. Finally, we outline potential research issues to envision a more intelligent, more secure, and greener integrated satellite-MEC network.
LAMBO: Large Language Model Empowered Edge Intelligence
Authors: Li Dong, Feibo Jiang, Yubo Peng, Kezhi Wang, Kun Yang, Cunhua Pan, Robert Schober
Subjects: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract
Next-generation edge intelligence is anticipated to bring huge benefits to various applications, e.g., offloading systems. However, traditional deep offloading architectures face several issues, including heterogeneous constraints, partial perception, uncertain generalization, and lack of tractability. In this context, the integration of offloading with large language models (LLMs) presents numerous advantages. Therefore, we propose an LLM-Based Offloading (LAMBO) framework for mobile edge computing (MEC), which comprises four components: (i) Input embedding (IE), which is used to represent the information of the offloading system with constraints and prompts through learnable vectors with high quality; (ii) Asymmetric encoderdecoder (AED) model, which is a decision-making module with a deep encoder and a shallow decoder. It can achieve high performance based on multi-head self-attention schemes; (iii) Actor-critic reinforcement learning (ACRL) module, which is employed to pre-train the whole AED for different optimization tasks under corresponding prompts; and (iv) Active learning from expert feedback (ALEF), which can be used to finetune the decoder part of the AED while adapting to dynamic environmental changes. Our simulation results corroborate the advantages of the proposed LAMBO framework.
A lightweight 3D dense facial landmark estimation model from position map data
Authors: Shubhajit Basak, Sathish Mangapuram, Gabriel Costache, Rachel McDonnell, Michael Schukat
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
The incorporation of 3D data in facial analysis tasks has gained popularity in recent years. Though it provides a more accurate and detailed representation of the human face, accruing 3D face data is more complex and expensive than 2D face images. Either one has to rely on expensive 3D scanners or depth sensors which are prone to noise. An alternative option is the reconstruction of 3D faces from uncalibrated 2D images in an unsupervised way without any ground truth 3D data. However, such approaches are computationally expensive and the learned model size is not suitable for mobile or other edge device applications. Predicting dense 3D landmarks over the whole face can overcome this issue. As there is no public dataset available containing dense landmarks, we propose a pipeline to create a dense keypoint training dataset containing 520 key points across the whole face from an existing facial position map data. We train a lightweight MobileNet-based regressor model with the generated data. As we do not have access to any evaluation dataset with dense landmarks in it we evaluate our model against the 68 keypoint detection task. Experimental results show that our trained model outperforms many of the existing methods in spite of its lower model size and minimal computational cost. Also, the qualitative evaluation shows the efficiency of our trained models in extreme head pose angles as well as other facial variations and occlusions.
Empowering LLM to use Smartphone for Intelligent Task Automation
Authors: Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
Abstract
Mobile task automation is an attractive technique that aims to enable voice-based hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to the limited language understanding ability and the non-trivial manual efforts required from developers or end-users. The recent advance of large language models (LLMs) in language understanding and reasoning inspires us to rethink the problem from a model-centric perspective, where task preparation, comprehension, and execution are handled by a unified language model. In this work, we introduce AutoDroid, a mobile task automation system that can handle arbitrary tasks on any Android application without manual efforts. The key insight is to combine the commonsense knowledge of LLMs and domain-specific knowledge of apps through automated dynamic analysis. The main components include a functionality-aware UI representation method that bridges the UI with the LLM, exploration-based memory injection techniques that augment the app-specific domain knowledge of LLM, and a multi-granularity query optimization module that reduces the cost of model inference. We integrate AutoDroid with off-the-shelf LLMs including online GPT-4/GPT-3.5 and on-device Vicuna, and evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks. The results demonstrated that AutoDroid is able to precisely generate actions with an accuracy of 90.9%, and complete tasks with a success rate of 71.3%, outperforming the GPT-4-powered baselines by 36.4% and 39.7%. The demo, benchmark suites, and source code of AutoDroid will be released at https://autodroid-sys.github.io/.
Keyword: pruning
Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Authors: Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang
Abstract
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Authors: Samuel Horvath, Stefanos Laskaridis, Shashank Rajput, Hongyi Wang
Abstract
Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities.
Keyword: diffusion
Unified Concept Editing in Diffusion Models
Authors: Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method, Unified Concept Editing (UCE), edits the model without training using a closed-form solution, and scales seamlessly to concurrent edits on text-conditional diffusion models. We demonstrate scalable simultaneous debiasing, style erasure, and content moderation by editing text-to-image projections, and we present extensive experiments demonstrating improved efficacy and scalability over prior work. Our code is available at https://unified.baulab.info
Generating tabular datasets under differential privacy
Abstract
Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process.
Identifying and Mitigating the Security Risks of Generative AI
Authors: Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang
Abstract
Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.
Robust topology optimisation of lattice structures with spatially correlated uncertainties
Authors: Ismael Ben-Yelun, Ahmet Oguzhan Yuksel, Fehmi Cirak
Abstract
The uncertainties in material and other properties of structures are usually spatially correlated. We introduce an efficient technique for representing and processing spatially correlated random fields in robust topology optimisation of lattice structures. Robust optimisation considers the statistics of the structural response to obtain a design whose performance is less sensitive to the specific realisation of the random field. We represent Gaussian random fields on lattices by leveraging the established link between random fields and stochastic partial differential equations (SPDEs). It is known that the precision matrix, i.e. the inverse of the covariance matrix, of a random field with Mat\'ern covariance is equal to the finite element stiffness matrix of a possibly fractional PDE with a second-order elliptic operator. We consider the discretisation of the PDE on the lattice to obtain a random field which, by design, considers its geometry and connectivity. The so-obtained random field can be interpreted as a physics-informed prior by the hypothesis that the elliptic SPDE models the physical processes occurring during manufacturing, like heat and mass diffusion. Although the proposed approach is general, we demonstrate its application to lattices modelled as pin-jointed trusses with uncertainties in member Young's moduli. We consider as a cost function the weighted sum of the expectation and standard deviation of the structural compliance. To compute the expectation and standard deviation and their gradients with respect to member cross-sections we use a first-order Taylor series approximation. The cost function and its gradient are computed using only sparse matrix operations. We demonstrate the efficiency of the proposed approach using several lattice examples with isotropic, anisotropic and non-stationary random fields and up to eighty thousand random and optimisation variables.
C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model
Authors: Longbin Ji, Pengfei Wei, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines. The link of the project demo page can be found at https://c2g2-gesture.github.io/c2_gesture
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
Authors: Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, Chao Dong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We present DiffBIR, which leverages pretrained text-to-image diffusion models for blind image restoration problem. Our framework adopts a two-stage pipeline. In the first stage, we pretrain a restoration module across diversified degradations to improve generalization capability in real-world scenarios. The second stage leverages the generative ability of latent diffusion models, to achieve realistic image restoration. Specifically, we introduce an injective modulation sub-network -- LAControlNet for finetuning, while the pre-trained Stable Diffusion is to maintain its generative ability. Finally, we introduce a controllable module that allows users to balance quality and fidelity by introducing the latent image guidance in the denoising process during inference. Extensive experiments have demonstrated its superiority over state-of-the-art approaches for both blind image super-resolution and blind face restoration tasks on synthetic and real-world datasets. The code is available at https://github.com/XPixelGroup/DiffBIR.
DiffusionVMR: Diffusion Model for Video Moment Retrieval
Authors: Henghao Zhao, Kevin Qinghong Lin, Rui Yan, Zechao Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Video moment retrieval is a fundamental visual-language task that aims to retrieve target moments from an untrimmed video based on a language query. Existing methods typically generate numerous proposals manually or via generative networks in advance as the support set for retrieval, which is not only inflexible but also time-consuming. Inspired by the success of diffusion models on object detection, this work aims at reformulating video moment retrieval as a denoising generation process to get rid of the inflexible and time-consuming proposal generation. To this end, we propose a novel proposal-free framework, namely DiffusionVMR, which directly samples random spans from noise as candidates and introduces denoising learning to ground target moments. During training, Gaussian noise is added to the real moments, and the model is trained to learn how to reverse this process. In inference, a set of time spans is progressively refined from the initial noise to the final output. Notably, the training and inference of DiffusionVMR are decoupled, and an arbitrary number of random spans can be used in inference without being consistent with the training phase. Extensive experiments conducted on three widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate the effectiveness of the proposed DiffusionVMR by comparing it with state-of-the-art methods.
Elucidating the Exposure Bias in Diffusion Models
Authors: Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, Itir Onal Ertugrul
Abstract
Diffusion models have demonstrated impressive generative capabilities, but their 'exposure bias' problem, described as the input mismatch between training and sampling, lacks in-depth exploration. In this paper, we systematically investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue. Furthermore, we discuss potential solutions to this issue and propose an intuitive metric for it. Along with the elucidation of exposure bias, we propose a simple, yet effective, training-free method called Epsilon Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly moves the sampling trajectory closer to the vector field learned in the training phase by scaling down the network output (Epsilon), mitigating the input mismatch between training and sampling. Experiments on various diffusion frameworks (ADM, DDPM/DDIM, LDM), unconditional and conditional settings, and deterministic vs. stochastic sampling verify the effectiveness of our method.
A Reduced-Order Model for Nonlinear Radiative Transfer Problems Based on Moment Equations and POD-Petrov-Galerkin Projection of the Normalized Boltzmann Transport Equation
Abstract
A data-driven projection-based reduced-order model (ROM) for nonlinear thermal radiative transfer (TRT) problems is presented. The TRT ROM is formulated by (i) a hierarchy of low-order quasidiffusion (aka variable Eddington factor) equations for moments of the radiation intensity and (ii) the normalized Boltzmann transport equation (BTE). The multilevel system of moment equations is derived by projection of the BTE onto a sequence of subspaces which represent elements of the phase space of the problem. Exact closure for the moment equations is provided by the Eddington tensor. A Petrov-Galerkin (PG) projection of the normalized BTE is formulated using a proper orthogonal decomposition (POD) basis representing the normalized radiation intensity over the whole phase space and time. The Eddington tensor linearly depends on the solution of the normalized BTE. By linear superposition of the POD basis functions, a low-rank expansion of the Eddington tensor is constructed with coefficients defined by the PG projected normalized BTE. The material energy balance (MEB) equation is coupled with the effective grey low-order equations which exist on the same dimensional scale as the MEB equation. The resulting TRT ROM is structure and asymptotic preserving. A detailed analysis of the ROM is performed on the classical Fleck-Cummings (F-C) TRT multigroup test problem in 2D geometry. Numerical results are presented to demonstrate the ROM's effectiveness in the simulation of radiation wave phenomena. The ROM is shown to produce solutions with sufficiently high accuracy while using low-rank approximation of the normalized BTE solution. Essential physical characteristics of supersonic radiation wave are preserved in the ROM solutions.
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer
Abstract
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g, formality) to authorship (e.g, Shakespeare). Previous unsupervised style-transfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
Keyword: adaptive
Robust Activity Recognition for Adaptive Worker-Robot Interaction using Transfer Learning
Abstract
Human activity recognition (HAR) using machine learning has shown tremendous promise in detecting construction workers' activities. HAR has many applications in human-robot interaction research to enable robots' understanding of human counterparts' activities. However, many existing HAR approaches lack robustness, generalizability, and adaptability. This paper proposes a transfer learning methodology for activity recognition of construction workers that requires orders of magnitude less data and compute time for comparable or better classification accuracy. The developed algorithm transfers features from a model pre-trained by the original authors and fine-tunes them for the downstream task of activity recognition in construction. The model was pre-trained on Kinetics-400, a large-scale video-based human activity recognition dataset with 400 distinct classes. The model was fine-tuned and tested using videos captured from manual material handling (MMH) activities found on YouTube. Results indicate that the fine-tuned model can recognize distinct MMH tasks in a robust and adaptive manner which is crucial for the widespread deployment of collaborative robots in construction.
NAS-X: Neural Adaptive Smoothing via Twisting
Authors: Dieterich Lawson, Michael Li, Scott Linderman
Abstract
We present Neural Adaptive Smoothing via Twisting (NAS-X), a method for learning and inference in sequential latent variable models based on reweighted wake-sleep (RWS). NAS-X works with both discrete and continuous latent variables, and leverages smoothing SMC to fit a broader range of models than traditional RWS methods. We test NAS-X on discrete and continuous tasks and find that it substantially outperforms previous variational and RWS-based methods in inference and parameter recovery.
Continual Learning for Generative Retrieval over Dynamic Corpora
Abstract
Generative retrieval (GR) directly predicts the identifiers of relevant documents (i.e., docids) based on a parametric model. It has achieved solid performance on many ad-hoc retrieval tasks. So far, these tasks have assumed a static document collection. In many practical scenarios, however, document collections are dynamic, where new documents are continuously added to the corpus. The ability to incrementally index new documents while preserving the ability to answer queries with both previously and newly indexed relevant documents is vital to applying GR models. In this paper, we address this practical continual learning problem for GR. We put forward a novel Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major contributions to continual learning for GR: (i) To encode new documents into docids with low computational cost, we present Incremental Product Quantization, which updates a partial quantization codebook according to two adaptive thresholds; and (ii) To memorize new documents for querying without forgetting previous knowledge, we propose a memory-augmented learning mechanism, to form meaningful connections between old and new documents. Empirical results demonstrate the effectiveness and efficiency of the proposed model.
Constructive Incremental Learning for Fault Diagnosis of Rolling Bearings with Ensemble Domain Adaptation
Authors: Jiang Liu, Wei Dai
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract
Given the prevalence of rolling bearing fault diagnosis as a practical issue across various working conditions, the limited availability of samples compounds the challenge. Additionally, the complexity of the external environment and the structure of rolling bearings often manifests faults characterized by randomness and fuzziness, hindering the effective extraction of fault characteristics and restricting the accuracy of fault diagnosis. To overcome these problems, this paper presents a novel approach termed constructive Incremental learning-based ensemble domain adaptation (CIL-EDA) approach. Specifically, it is implemented on stochastic configuration networks (SCN) to constructively improve its adaptive performance in multi-domains. Concretely, a cloud feature extraction method is employed in conjunction with wavelet packet decomposition (WPD) to capture the uncertainty of fault information from multiple resolution aspects. Subsequently, constructive Incremental learning-based domain adaptation (CIL-DA) is firstly developed to enhance the cross-domain learning capability of each hidden node through domain matching and construct a robust fault classifier by leveraging limited labeled data from both target and source domains. Finally, fault diagnosis results are obtained by a majority voting of CIL-EDA which integrates CIL-DA and parallel ensemble learning. Experimental results demonstrate that our CIL-DA outperforms several domain adaptation methods and CIL-EDA consistently outperforms state-of-art fault diagnosis methods in few-shot scenarios.
Incorporating Neuro-Inspired Adaptability for Continual Learning in Artificial Intelligence
Authors: Liyuan Wang, Xingxing Zhang, Qian Li, Mingtian Zhang, Hang Su, Jun Zhu, Yi Zhong
Abstract
Continual learning aims to empower artificial intelligence (AI) with strong adaptability to the real world. For this purpose, a desirable solution should properly balance memory stability with learning plasticity, and acquire sufficient compatibility to capture the observed distributions. Existing advances mainly focus on preserving memory stability to overcome catastrophic forgetting, but remain difficult to flexibly accommodate incremental changes as biological intelligence (BI) does. By modeling a robust Drosophila learning system that actively regulates forgetting with multiple learning modules, here we propose a generic approach that appropriately attenuates old memories in parameter distributions to improve learning plasticity, and accordingly coordinates a multi-learner architecture to ensure solution compatibility. Through extensive theoretical and empirical validation, our approach not only clearly enhances the performance of continual learning, especially over synaptic regularization methods in task-incremental settings, but also potentially advances the understanding of neurological adaptive mechanisms, serving as a novel paradigm to progress AI and BI together.
Exploiting Problem Geometry in Safe Linear Bandits
Abstract
The safe linear bandit problem is a version of the classic linear bandit problem where the learner's actions must satisfy an uncertain linear constraint at all rounds. Due its applicability to many real-world settings, this problem has received considerable attention in recent years. We find that by exploiting the geometry of the specific problem setting, we can achieve improved regret guarantees for both well-separated problem instances and action sets that are finite star convex sets. Additionally, we propose a novel algorithm for this setting that chooses problem parameters adaptively and enjoys at least as good regret guarantees as existing algorithms. Lastly, we introduce a generalization of the safe linear bandit setting where the constraints are convex and adapt our algorithms and analyses to this setting by leveraging a novel convex-analysis based approach. Simulation results show improved performance over existing algorithms for a variety of randomly sampled settings.
SALI: A Scalable Adaptive Learned Index Framework based on Probability Models
Authors: Jiake Ge, Huanchen Zhang, Boyu Shi, Yuanhui Luo, Yunda Guo, Yunpeng Chai, Yuxing Chen, Anqun Pan
Abstract
The growth in data storage capacity and the increasing demands for high performance have created several challenges for concurrent indexing structures. One promising solution is learned indexes, which use a learning-based approach to fit the distribution of stored data and predictively locate target keys, significantly improving lookup performance. Despite their advantages, prevailing learned indexes exhibit constraints and encounter issues of scalability on multi-core data storage. This paper introduces SALI, the Scalable Adaptive Learned Index framework, which incorporates two strategies aimed at achieving high scalability, improving efficiency, and enhancing the robustness of the learned index. Firstly, a set of node-evolving strategies is defined to enable the learned index to adapt to various workload skews and enhance its concurrency performance in such scenarios. Secondly, a lightweight strategy is proposed to maintain statistical information within the learned index, with the goal of further improving the scalability of the index. Furthermore, to validate their effectiveness, SALI applied the two strategies mentioned above to the learned index structure that utilizes fine-grained write locks, known as LIPP. The experimental results have demonstrated that SALI significantly enhances the insertion throughput with 64 threads by an average of 2.04x compared to the second-best learned index. Furthermore, SALI accomplishes a lookup throughput similar to that of LIPP+.
Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping
Abstract
Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. The optimal configuration of Parameter Committee is found offline by a profiling-guided committee planner, and expert swapping and request handling at runtime are managed by an adaptive committee scheduler. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.
Optimization via conformal Hamiltonian systems on manifolds
Abstract
In this work we propose a method to perform optimization on manifolds. We assume to have an objective function $f$ defined on a manifold and think of it as the potential energy of a mechanical system. By adding a momentum-dependent kinetic energy we define its Hamiltonian function, which allows us to write the corresponding Hamiltonian system. We make it conformal by introducing a dissipation term: the result is the continuous model of our scheme. We solve it via splitting methods (Lie-Trotter and leapfrog): we combine the RATTLE scheme, approximating the conserved flow, with the exact dissipated flow. The result is a conformal symplectic method for constant stepsizes. We also propose an adaptive stepsize version of it. We test it on an example, the minimization of a function defined on a sphere, and compare it with the usual gradient descent method.
Group-Conditional Conformal Prediction via Quantile Regression Calibration for Crop and Weed Classification
Authors: Paul Melki (IMS), Lionel Bombrun (IMS), Boubacar Diallo, Jérôme Dias, Jean-Pierre da Costa (IMS)
Abstract
As deep learning predictive models become an integral part of a large spectrum of precision agricultural systems, a barrier to the adoption of such automated solutions is the lack of user trust in these highly complex, opaque and uncertain models. Indeed, deep neural networks are not equipped with any explicit guarantees that can be used to certify the system's performance, especially in highly varying uncontrolled environments such as the ones typically faced in computer vision for agriculture.Fortunately, certain methods developed in other communities can prove to be important for agricultural applications. This article presents the conformal prediction framework that provides valid statistical guarantees on the predictive performance of any black box prediction machine, with almost no assumptions, applied to the problem of deep visual classification of weeds and crops in real-world conditions. The framework is exposed with a focus on its practical aspects and special attention accorded to the Adaptive Prediction Sets (APS) approach that delivers marginal guarantees on the model's coverage. Marginal results are then shown to be insufficient to guarantee performance on all groups of individuals in the population as characterized by their environmental and pedo-climatic auxiliary data gathered during image acquisition.To tackle this shortcoming, group-conditional conformal approaches are presented: the ''classical'' method that consists of iteratively applying the APS procedure on all groups, and a proposed elegant reformulation and implementation of the procedure using quantile regression on group membership indicators. Empirical results showing the validity of the proposed approach are presented and compared to the marginal APS then discussed.
Unleashing the Potential of Spiking Neural Networks for Sequential Modeling with Contextual Embedding
Authors: Xinyi Chen, Jibin Wu, Huajin Tang, Qinyuan Ren, Kay Chen Tan
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
The human brain exhibits remarkable abilities in integrating temporally distant sensory inputs for decision-making. However, existing brain-inspired spiking neural networks (SNNs) have struggled to match their biological counterpart in modeling long-term temporal relationships. To address this problem, this paper presents a novel Contextual Embedding Leaky Integrate-and-Fire (CE-LIF) spiking neuron model. Specifically, the CE-LIF model incorporates a meticulously designed contextual embedding component into the adaptive neuronal firing threshold, thereby enhancing the memory storage of spiking neurons and facilitating effective sequential modeling. Additionally, theoretical analysis is provided to elucidate how the CE-LIF model enables long-term temporal credit assignment. Remarkably, when compared to state-of-the-art recurrent SNNs, feedforward SNNs comprising the proposed CE-LIF neurons demonstrate superior performance across extensive sequential modeling tasks in terms of classification accuracy, network convergence speed, and memory capacity.
ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters
Authors: Xin Zhou, Ling Chen, Houming Wu
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational resources and poor convergence in heterogeneous clusters. In this paper, we propose a delayed synchronous SGD algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters. In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters. While workers are performing global synchronization for delayed gradients, they perform the computation of the next batch without specifying batch size in advance, which lasts until the next global synchronization starts, realizing the full utilization of computational resources. Since the gradient delay is only one iteration, the stale gradient problem can be alleviated. We theoretically prove the convergence of ABS-SGD in heterogeneous clusters. Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources and accelerate model convergence: When training ResNet18 network with 4 workers, ABS-SGD increases the convergence speed by 1.30x on average compared with the best baseline algorithm.
Knowledge-based Multiple Adaptive Spaces Fusion for Recommendation
Abstract
Since Knowledge Graphs (KGs) contain rich semantic information, recently there has been an influx of KG-enhanced recommendation methods. Most of existing methods are entirely designed based on euclidean space without considering curvature. However, recent studies have revealed that a tremendous graph-structured data exhibits highly non-euclidean properties. Motivated by these observations, in this work, we propose a knowledge-based multiple adaptive spaces fusion method for recommendation, namely MCKG. Unlike existing methods that solely adopt a specific manifold, we introduce the unified space that is compatible with hyperbolic, euclidean and spherical spaces. Furthermore, we fuse the multiple unified spaces in an attention manner to obtain the high-quality embeddings for better knowledge propagation. In addition, we propose a geometry-aware optimization strategy which enables the pull and push processes benefited from both hyperbolic and spherical spaces. Specifically, in hyperbolic space, we set smaller margins in the area near to the origin, which is conducive to distinguishing between highly similar positive items and negative ones. At the same time, we set larger margins in the area far from the origin to ensure the model has sufficient error tolerance. The similar manner also applies to spherical spaces. Extensive experiments on three real-world datasets demonstrate that the MCKG has a significant improvement over state-of-the-art recommendation methods. Further ablation experiments verify the importance of multi-space fusion and geometry-aware optimization strategy, justifying the rationality and effectiveness of MCKG.
Bearing-based Formation with Disturbance Rejection
Authors: Haoshu Cheng, Jie Huang
Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA)
Abstract
This paper considers the problem of the bearing-based formation control with disturbance rejection for a group of agents under the leader-follower structure. The disturbances are in the form of a trigonometric polynomial with arbitrary unknown amplitudes, unknown initial phases, and known or unknown frequencies. For the case of the known frequencies, we employ the canonical internal model to solve the problem, and, for the case of the unknown frequencies, we combine the canonical internal model and {some} distributed adaptive control technique to deal with the problem. It is noted that the existing results can only handle constant input disturbances by continuous control laws or disturbances with known {bounds} by discontinuous control laws. The first case is a special case of our result. The second case cannot cover our results because the bound of our disturbance is unknown. Moreover, our control law is smooth.
Adaptivity in Local Kernel Based Methods for Approximating the Action of Linear Operators
Abstract
Building on the successes of local kernel methods for approximating the solutions to partial differential equations (PDE) and the evaluation of definite integrals (quadrature/cubature), a local estimate of the error in such approximations is developed. This estimate is useful for determining locations in the solution domain where increased node density (equivalently, reduction in the spacing between nodes) can decrease the error in the solution. An adaptive procedure for adding nodes to the domain for both the approximation of derivatives and the approximate evaluation of definite integrals is described. This method efficiently computes the error estimate at a set of prescribed points and adds new nodes for approximation where the error is too large. Computational experiments demonstrate close agreement between the error estimate and actual absolute error in the approximation. Such methods are necessary or desirable when approximating solutions to PDE (or in the case of quadrature/cubature), where the initial data and subsequent solution (or integrand) exhibit localized features that require significant refinement to resolve and where uniform increases in the density of nodes across the entire computational domain is not possible or too burdensome.
RED: A Systematic Real-Time Scheduling Approach for Robotic Environmental Dynamics
Authors: Zexin Li, Tao Ren, Xiaoxi He, Cong Liu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract
Intelligent robots are designed to effectively navigate dynamic and unpredictable environments laden with moving mechanical elements and objects. Such environment-induced dynamics, including moving obstacles, can readily alter the computational demand (e.g., the creation of new tasks) and the structure of workloads (e.g., precedence constraints among tasks) during runtime, thereby adversely affecting overall system performance. This challenge is amplified when multi-task inference is expected on robots operating under stringent resource and real-time constraints. To address such a challenge, we introduce RED, a systematic real-time scheduling approach designed to support multi-task deep neural network workloads in resource-limited robotic systems. It is designed to adaptively manage the Robotic Environmental Dynamics (RED) while adhering to real-time constraints. At the core of RED lies a deadline-based scheduler that employs an intermediate deadline assignment policy, effectively managing to change workloads and asynchronous inference prompted by complex, unpredictable environments. This scheduling framework also facilitates the flexible deployment of MIMONet (multi-input multi-output neural networks), which are commonly utilized in multi-tasking robotic systems to circumvent memory bottlenecks. Building on this scheduling framework, RED recognizes and leverages a unique characteristic of MIMONet: its weight-shared architecture. To further accommodate and exploit this feature, RED devises a novel and effective workload refinement and reconstruction process. This process ensures the scheduling framework's compatibility with MIMONet and maximizes efficiency.
An Adaptive Tangent Feature Perspective of Neural Networks
Authors: Daniel LeJeune, Sina Alemohammad
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In order to better understand feature learning in neural networks, we propose a framework for understanding linear models in tangent feature space where the features are allowed to be transformed during training. We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear interpolation constraint. We show that this optimization problem has an equivalent linearly constrained optimization with structured regularization that encourages approximately low rank solutions. Specializing to neural network structure, we gain insights into how the features and thus the kernel function change, providing additional nuance to the phenomenon of kernel alignment when the target function is poorly represented using tangent features. In addition to verifying our theoretical observations in real neural networks on a simple regression problem, we empirically show that an adaptive feature implementation of tangent feature classification has an order of magnitude lower sample complexity than the fixed tangent feature model on MNIST and CIFAR-10.
Keyword: quantization
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Authors: Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie
Abstract
Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Authors: Samuel Horvath, Stefanos Laskaridis, Shashank Rajput, Hongyi Wang
Abstract
Deep Neural Networks (DNNs) have been a large driver and enabler for AI breakthroughs in recent years. These models have been getting larger in their attempt to become more accurate and tackle new upcoming use-cases, including AR/VR and intelligent assistants. However, the training process of such large models is a costly and time-consuming process, which typically yields a single model to fit all targets. To mitigate this, various techniques have been proposed in the literature, including pruning, sparsification or quantization of the model weights and updates. While able to achieve high compression rates, they often incur computational overheads or accuracy penalties. Alternatively, factorization methods have been leveraged to incorporate low-rank compression in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely on the computationally expensive decomposition of layers and are potentially sub-optimal for non-linear models, such as DNNs. In this work, we take a further step in designing efficient low-rank models and propose Maestro, a framework for trainable low-rank layers. Instead of regularly applying a priori decompositions such as SVD, the low-rank structure is built into the training process through a generalized variant of Ordered Dropout. This method imposes an importance ordering via sampling on the decomposed DNN structure. Our theoretical analysis demonstrates that our method recovers the SVD decomposition of linear mapping on uniformly distributed data and PCA for linear autoencoders. We further apply our technique on DNNs and empirically illustrate that Maestro enables the extraction of lower footprint models that preserve model performance while allowing for graceful accuracy-latency tradeoff for the deployment to devices of different capabilities.
Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation
Abstract
Graph Neural Network (GNN) training and inference involve significant challenges of scalability with respect to both model sizes and number of layers, resulting in degradation of efficiency and accuracy for large and deep GNNs. We present an end-to-end solution that aims to address these challenges for efficient GNNs in resource constrained environments while avoiding the oversmoothing problem in deep GNNs. We introduce a quantization based approach for all stages of GNNs, from message passing in training to node classification, compressing the model and enabling efficient processing. The proposed GNN quantizer learns quantization ranges and reduces the model size with comparable accuracy even under low-bit quantization. To scale with the number of layers, we devise a message propagation mechanism in training that controls layer-wise changes of similarities between neighboring nodes. This objective is incorporated into a Lagrangian function with constraints and a differential multiplier method is utilized to iteratively find optimal embeddings. This mitigates oversmoothing and suppresses the quantization error to a bound. Significant improvements are demonstrated over state-of-the-art quantization methods and deep GNN approaches in both full-precision and quantized models. The proposed quantizer demonstrates superior performance in INT2 configurations across all stages of GNN, achieving a notable level of accuracy. In contrast, existing quantization approaches fail to generate satisfactory accuracy levels. Finally, the inference with INT2 and INT4 representations exhibits a speedup of 5.11 $\times$ and 4.70 $\times$ compared to full precision counterparts, respectively.
Continual Learning for Generative Retrieval over Dynamic Corpora
Abstract
Generative retrieval (GR) directly predicts the identifiers of relevant documents (i.e., docids) based on a parametric model. It has achieved solid performance on many ad-hoc retrieval tasks. So far, these tasks have assumed a static document collection. In many practical scenarios, however, document collections are dynamic, where new documents are continuously added to the corpus. The ability to incrementally index new documents while preserving the ability to answer queries with both previously and newly indexed relevant documents is vital to applying GR models. In this paper, we address this practical continual learning problem for GR. We put forward a novel Continual-LEarner for generatiVE Retrieval (CLEVER) model and make two major contributions to continual learning for GR: (i) To encode new documents into docids with low computational cost, we present Incremental Product Quantization, which updates a partial quantization codebook according to two adaptive thresholds; and (ii) To memorize new documents for querying without forgetting previous knowledge, we propose a memory-augmented learning mechanism, to form meaningful connections between old and new documents. Empirical results demonstrate the effectiveness and efficiency of the proposed model.
On-Device Learning with Binary Neural Networks
Authors: Lorenzo Vorabbi, Davide Maltoni, Stefano Santi
Abstract
Existing Continual Learning (CL) solutions only partially address the constraints on power, memory and computation of the deep learning models when deployed on low-power embedded CPUs. In this paper, we propose a CL solution that embraces the recent advancements in CL field and the efficiency of the Binary Neural Networks (BNN), that use 1-bit for weights and activations to efficiently execute deep learning models. We propose a hybrid quantization of CWR* (an effective CL approach) that considers differently forward and backward pass in order to retain more precision during gradient update step and at the same time minimizing the latency overhead. The choice of a binary network as backbone is essential to meet the constraints of low power devices and, to the best of authors' knowledge, this is the first attempt to prove on-device learning with BNN. The experimental validation carried out confirms the validity and the suitability of the proposed method.
Keyword: efficient
Generating tabular datasets under differential privacy
CLNeRF: Continual Learning Meets NeRF
Continual Learning with Dynamic Sparse Training: Exploring Algorithms for Effective Model Updates
Graph Analytics on Evolving Data (Abstract)
Scalable and Configurable Tracking for Any Rowhammer Threshold
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
Uncertainty-driven Affordance Discovery for Efficient Robotics Manipulation
On Reward Structures of Markov Decision Processes
Optimal Economic Gas Turbine Dispatch with Deep Reinforcement Learning
Monus semantics in vector addition systems with states
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Auto-Prompting SAM for Mobile Friendly 3D Medical Image Segmentation
Entropy-based Guidance of Deep Neural Networks for Accelerated Convergence and Improved Performance
Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation
Robust topology optimisation of lattice structures with spatially correlated uncertainties
Streaming Compression of Scientific Data via weak-SINDy
offline" compression algorithms that perform compression on a readily available data set, streaming compression algorithms compress data
online" while the data generated from simulation or experiments is still flowing through the system. This feature makes streaming compression algorithms well-suited for scientific data compression, where storing the full data set offline is often infeasible. This work proposes a new streaming compression algorithm, streaming weak-SINDy, which takes advantage of the underlying data characteristics during compression. The streaming weak-SINDy algorithm constructs feature matrices and target vectors in the online stage via a streaming integration method in a memory efficient manner. The feature matrices and target vectors are then used in the offline stage to build a model through a regression process that aims to recover equations that govern the evolution of the data. For compressing high-dimensional streaming data, we adopt a streaming proper orthogonal decomposition (POD) process to reduce the data dimension and then use the streaming weak-SINDy algorithm to compress the temporal data of the POD expansion. We propose modifications to the streaming weak-SINDy algorithm to accommodate the dynamically updated POD basis. By combining the built model from the streaming weak-SINDy algorithm and a small amount of data samples, the full data flow could be reconstructed accurately at a low memory cost, as shown in the numerical tests.CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction
Reprogramming under constraints: Revisiting efficient and reliable transferability of lottery tickets
Distributed multi-agent target search and tracking with Gaussian process and reinforcement learning
Generative Model for Models: Rapid DNN Customization for Diverse Tasks and Resource Constraints
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
Fast immersed boundary method based on weighted quadrature
R^3: On-device Real-Time Deep Reinforcement Learning for Autonomous Robotics
Motion Priority Optimization Framework towards Automated and Teleoperated Robot Cooperation in Industrial Recovery Scenarios
Better Prefix Authentication
Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli
Learning to Upsample by Learning to Sample
FedChain: An Efficient and Secure Consensus Protocol based on Proof of Useful Federated Learning for Blockchain
Probabilistic Dataset Reconstruction from Interpretable Models
Mixup-Augmented Meta-Learning for Sample-Efficient Fine-Tuning of Protein Simulators
SpikeBERT: A Language Spikformer Trained with Two-Stage Knowledge Distillation from BERT
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
MCMS-RBM: Multi-Component Multi-State Reduced Basis Method toward Efficient Transition Pathway Identification for Crystals and Quasicrystals
Efficient Almost-Egalitarian Allocation of Goods and Bads
Benchmarking the Generation of Fact Checking Explanations
Structural Node Embeddings with Homomorphism Counts
MSFlow: Multi-Scale Flow-based Framework for Unsupervised Anomaly Detection
Compositional maps for registration in complex geometries
On-Device Learning with Binary Neural Networks
Practice of Alibaba Cloud on Elastic Resource Provisioning for Large-scale Microservices Cluster
Adaptivity in Local Kernel Based Methods for Approximating the Action of Linear Operators
Enhancing Robot Learning through Learned Human-Attention Feature Maps
IndGIC: Supervised Action Recognition under Low Illumination
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation
A Reduced-Order Model for Nonlinear Radiative Transfer Problems Based on Moment Equations and POD-Petrov-Galerkin Projection of the Normalized Boltzmann Transport Equation
Bayesian Integration of Information Using Top-Down Modulated WTA Networks
Adversarial Low Degree Testing
erasures'' are replaced with
corruptions''. We show that, in the $t$-online-erasure model, for a prime power $q$, given query access to a function $f: \mathbb{F}_q^n \xrightarrow[]{} \mathbb{F}_q$, one can distinguish in $\mathrm{poly}(\log^{d+q}(t)/\delta)$ queries between the case that $f$ is degree at most $d$, and the case that $f$ is $\delta$-far from any degree $d$ function (with respect to the fractional hamming distance). This answers the aforementioned questions and brings the query complexity to nearly match the query complexity of low-degree testing in the classical property testing model. Our results are based on the observation that the property of low-degreeness admits a large and versatile family of query efficient testers. Our testers operates by querying a uniformly random, sufficiently large set of points in a large enough affine subspace, and finding a tester for low-degreeness that only utilizes queries from that set of points. We believe that this tester may find other applications to algorithms in the online-erasure model or other related models, and may be of independent interest.On the hardness of inclusion-wise minimal separators enumeration
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer
Canonical Factors for Hybrid Neural Fields
A Comparative Study of Loss Functions: Traffic Predictions in Regular and Congestion Scenarios
Graph Theory and its Uses in Graph Algorithms and Beyond
A General-Purpose Self-Supervised Model for Computational Pathology
Keyword: faster
CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
Generative Model for Models: Rapid DNN Customization for Diverse Tasks and Resource Constraints
Massively Parallel Continuous Local Search for Hybrid SAT Solving on GPUs
Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli
Best Memory Architecture Exploration under Parameters Variations accelerated with Machine Learning
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
Enhancing Robot Learning through Learned Human-Attention Feature Maps
Keyword: mobile
SynthDistill: Face Recognition with Knowledge Distillation from Synthetic Data
Improving Reinforcement Learning Training Regimes for Social Robot Navigation
Satellite-MEC Integration for 6G Internet of Things: Minimal Structures, Advances, and Prospects
LAMBO: Large Language Model Empowered Edge Intelligence
A lightweight 3D dense facial landmark estimation model from position map data
Empowering LLM to use Smartphone for Intelligent Task Automation
Keyword: pruning
Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Keyword: diffusion
Unified Concept Editing in Diffusion Models
Generating tabular datasets under differential privacy
Identifying and Mitigating the Security Risks of Generative AI
Robust topology optimisation of lattice structures with spatially correlated uncertainties
C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
DiffusionVMR: Diffusion Model for Video Moment Retrieval
Elucidating the Exposure Bias in Diffusion Models
A Reduced-Order Model for Nonlinear Radiative Transfer Problems Based on Moment Equations and POD-Petrov-Galerkin Projection of the Normalized Boltzmann Transport Equation
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer
Keyword: adaptive
Robust Activity Recognition for Adaptive Worker-Robot Interaction using Transfer Learning
NAS-X: Neural Adaptive Smoothing via Twisting
Continual Learning for Generative Retrieval over Dynamic Corpora
Constructive Incremental Learning for Fault Diagnosis of Rolling Bearings with Ensemble Domain Adaptation
Incorporating Neuro-Inspired Adaptability for Continual Learning in Artificial Intelligence
Exploiting Problem Geometry in Safe Linear Bandits
SALI: A Scalable Adaptive Learned Index Framework based on Probability Models
Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping
Optimization via conformal Hamiltonian systems on manifolds
Group-Conditional Conformal Prediction via Quantile Regression Calibration for Crop and Weed Classification
Unleashing the Potential of Spiking Neural Networks for Sequential Modeling with Contextual Embedding
ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters
Knowledge-based Multiple Adaptive Spaces Fusion for Recommendation
Bearing-based Formation with Disturbance Rejection
Adaptivity in Local Kernel Based Methods for Approximating the Action of Linear Operators
RED: A Systematic Real-Time Scheduling Approach for Robotic Environmental Dynamics
An Adaptive Tangent Feature Perspective of Neural Networks
Keyword: quantization
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation
Continual Learning for Generative Retrieval over Dynamic Corpora
On-Device Learning with Binary Neural Networks