Abstract
Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.
Deep Semi-supervised Anomaly Detection with Metapath-based Context Knowledge
Authors: Hwan Kim, Junghoon Kim, Byung Suk Lee, Sungsu Lim
Abstract
Graph anomaly detection has attracted considerable attention in recent years. This paper introduces a novel approach that leverages metapath-based semi-supervised learning, addressing the limitations of previous methods. We present a new framework, Metapath-based Semi-supervised Anomaly Detection (MSAD), incorporating GCN layers in both the encoder and decoder to efficiently propagate context information between abnormal and normal nodes. The design of metapath-based context information and a specifically crafted anomaly community enhance the process of learning differences in structures and attributes, both globally and locally. Through a comprehensive set of experiments conducted on seven real-world networks, this paper demonstrates the superiority of the MSAD method compared to state-of-the-art techniques. The promising results of this study pave the way for future investigations, focusing on the optimization and analysis of metapath patterns to further enhance the effectiveness of anomaly detection on attributed networks.
Humanoid Robot Co-Design: Coupling Hardware Design with Gait Generation via Hybrid Zero Dynamics
Authors: Adrian B. Ghansah, Jeeseop Kim, Maegan Tucker, Aaron D. Ames
Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
Abstract
Selecting robot design parameters can be challenging since these parameters are often coupled with the performance of the controller and, therefore, the resulting capabilities of the robot. This leads to a time-consuming and often expensive process whereby one iterates between designing the robot and manually evaluating its capabilities. This is particularly challenging for bipedal robots, where it can be difficult to evaluate the behavior of the system due to the underlying nonlinear and hybrid dynamics. Thus, in an effort to streamline the design process of bipedal robots, and maximize their performance, this paper presents a systematic framework for the co-design of humanoid robots and their associated walking gaits. To this end, we leverage the framework of hybrid zero dynamic (HZD) gait generation, which gives a formal approach to the generation of dynamic walking gaits. The key novelty of this paper is to consider both virtual constraints associated with the actuators of the robot, coupled with design virtual constraints that encode the associated parameters of the robot to be designed. These virtual constraints are combined in an HZD optimization problem which simultaneously determines the design parameters while finding a stable walking gait that minimizes a given cost function. The proposed approach is demonstrated through the design of a novel humanoid robot, ADAM, wherein its thigh and shin are co-designed so as to yield energy efficient bipedal locomotion.
Decentralized Multi-Robot Social Navigation in Constrained Environments via Game-Theoretic Control Barrier Functions
Authors: Rohan Chandra, Vrushabh Zinage, Efstathios Bakolas, Joydeep Biswas, Peter Stone
Subjects: Robotics (cs.RO); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract
We present an approach to ensure safe and deadlock-free navigation for decentralized multi-robot systems operating in constrained environments, including doorways and intersections. Although many solutions have been proposed to ensure safety, preventing deadlocks in a decentralized fashion with global consensus remains an open problem. We first formalize the objective as a non-cooperative, non-communicative, partially observable multi-robot navigation problem in constrained spaces with multiple conflicting agents, which we term as social mini-games. Our approach to ensuring safety and liveness rests on two novel insights: (i) deadlock resolution is equivalent to deriving a mixed-Nash equilibrium solution to a social mini-game and (ii) this mixed-Nash strategy can be interpreted as an analogue to control barrier functions (CBFs), that can then be integrated with standard CBFs, inheriting their safety guarantees. Together, the standard CBF along with the mixed-Nash CBF analogue preserves both safety and liveness. We evaluate our proposed game-theoretic navigation algorithm in simulation as well on physical robots using F1/10 robots, a Clearpath Jackal, as well as a Boston Dynamics Spot in a doorway, corridor intersection, roundabout, and hallway scenario. We show that (i) our approach results in safer and more efficient navigation compared to local planners based on geometrical constraints, optimization, multi-agent reinforcement learning, and auctions, (ii) our deadlock resolution strategy is the smoothest in terms of smallest average change in velocity and path deviation, and most efficient in terms of makespan (iii) our approach yields a flow rate of 2.8 - 3.3 (ms)^{-1 which is comparable to flow rate in human navigation at 4 (ms)^{-1}.
Optimizing Sectorized Wireless Networks: Model, Analysis, and Algorithm
Abstract
Future wireless networks need to support the increasing demands for high data rates and improved coverage. One promising solution is sectorization, where an infrastructure node (e.g., a base station) is equipped with multiple sectors employing directional communication. Although the concept of sectorization is not new, it is critical to fully understand the potential of sectorized networks, such as the rate gain achieved when multiple sectors can be simultaneously activated. In this paper, we focus on sectorized wireless networks, where sectorized infrastructure nodes with beam-steering capabilities form a multi-hop mesh network for data forwarding and routing. We present a sectorized node model and characterize the capacity region of these sectorized networks. We define the flow extension ratio and the corresponding sectorization gain, which quantitatively measure the performance gain introduced by node sectorization as a function of the network flow. Our objective is to find the optimal sectorization of each node that achieves the maximum flow extension ratio, and thus the sectorization gain. Towards this goal, we formulate the corresponding optimization problem and develop an efficient distributed algorithm that obtains the node sectorization under a given network flow with an approximation ratio of 2/3. Through extensive simulations, we evaluate the sectorization gain and the performance of the proposed algorithm in various network scenarios with varying network flows. The simulation results show that the approximate sectorization gain increases sublinearly as a function of the number of sectors per node.
ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps
Abstract
This paper introduces a novel algorithm for solving the point-to-point shortest path problem in a static regular 8-neighbor connectivity (G8) grid. This algorithm can be seen as a generalization of Hadlock algorithm to G8 grids, and is shown to be theoretically equivalent to the relaxed $A^$ ($RA^$) algorithm in terms of the provided solution's path length, but with substantial time and memory savings, due to a completely different computation strategy, based on defining a set of lookup matrices. Through an experimental study on grid maps of various types and sizes (1290 runs on 43 maps), it is proven to be 2.25 times faster than $RA^$ and 17 times faster than the original $A^$, in average. Moreover, it is more memory-efficient, since it does not need to store a G score matrix.
Abstract
In today's world, with the rise of numerous social platforms, it has become relatively easy for anyone to spread false information and lure people into traps. Fraudulent schemes and traps are growing rapidly in the investment world. Due to this, countries and individuals face huge financial risks. We present an awareness system with the use of machine learning and gamification techniques to educate the people about investment scams and traps. Our system applies machine learning techniques to provide a personalized learning experience to the user. The system chooses distinct game-design elements and scams from the knowledge pool crafted by domain experts for each individual. The objective of the research project is to reduce inequalities in all countries by educating investors via Active Learning. Our goal is to assist the regulators in assuring a conducive environment for a fair, efficient, and inclusive capital market. In the paper, we discuss the impact of the problem, provide implementation details, and showcase the potentiality of the system through preliminary experiments and results.
GSA to HDL: Towards principled generation of dynamically scheduled circuits
Abstract
High-level synthesis (HLS) refers to the automatic translation of a software program written in a high-level language into a hardware design. Modern HLS tools have moved away from the traditional approach of static (compile time) scheduling of operations to generating dynamic circuits that schedule operations at run time. Such circuits trade-off area utilisation for increased dynamism and throughput. However, existing lowering flows in dynamically scheduled HLS tools rely on conservative assumptions on their input program due to both the intermediate representations (IR) utilised as well as the lack of formal specifications on the translation into hardware. These assumptions cause suboptimal hardware performance. In this work, we lift these assumptions by proposing a new and efficient abstraction for hardware mapping; namely h-GSA, an extension of the Gated Single Static Assignment (GSA) IR. Using this abstraction, we propose a lowering flow that transforms GSA into h-GSA and maps h-GSA into dynamically scheduled hardware circuits. We compare the schedules generated by our approach to those by the state-of-the-art dynamic-scheduling HLS tool, Dynamatic, and illustrate the potential performance improvement from hardware mapping using the proposed abstraction.
Neural Amortized Inference for Nested Multi-agent Reasoning
Authors: Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu
Abstract
Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.
Matrix Completion over Finite Fields: Bounds and Belief Propagation Algorithms
Authors: Mahdi Soleymani, Qiang Liu, Hessam Mahdavifar, Laura Balzano
Abstract
We consider the low rank matrix completion problem over finite fields. This problem has been extensively studied in the domain of real/complex numbers, however, to the best of authors' knowledge, there exists merely one efficient algorithm to tackle the problem in the binary field, due to Saunderson et al. [1]. In this paper, we improve upon the theoretical guarantees for the algorithm provided in [1]. Furthermore, we formulate a new graphical model for the matrix completion problem over the finite field of size $q$, $\Bbb{F}_q$, and present a message passing (MP) based approach to solve this problem. The proposed algorithm is the first one for the considered matrix completion problem over finite fields of arbitrary size. Our proposed method has a significantly lower computational complexity, reducing it from $O(n^{2r+3})$ in [1] down to $O(n^2)$ (where, the underlying matrix has dimension $n \times n$ and $r$ denotes its rank), while also improving the performance.
Embedded Object Detection and Mapping in Soft Materials Using Optical Tactile Sensing
Authors: Jose A. Solano-Castellanos, Won Kyung Do, Monroe Kennedy III
Abstract
In this paper, we present a methodology that uses an optical tactile sensor for efficient tactile exploration of embedded objects within soft materials. The methodology consists of an exploration phase, where a probabilistic estimate of the location of the embedded objects is built using a Bayesian approach. The exploration phase is then followed by a mapping phase which exploits the probabilistic map to reconstruct the underlying topography of the workspace by sampling in more detail regions where there is expected to be embedded objects. To demonstrate the effectiveness of the method, we tested our approach on an experimental setup that consists of a series of quartz beads located underneath a polyethylene foam that prevents direct observation of the configuration and requires the use of tactile exploration to recover the location of the beads. We show the performance of our methodology using ten different configurations of the beads where the proposed approach is able to approximate the underlying configuration. We benchmark our results against a random sampling policy.
Collaborative Route Planning of UAVs, Workers and Cars for Crowdsensing in Disaster Response
Authors: Lei Han, Chunyu Tu, Zhiwen Yu, Zhiyong Yu, Weihua Shan, Liang Wang, Bin Guo
Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract
Efficiently obtaining the up-to-date information in the disaster-stricken area is the key to successful disaster response. Unmanned aerial vehicles (UAVs), workers and cars can collaborate to accomplish sensing tasks, such as data collection, in disaster-stricken areas. In this paper, we explicitly address the route planning for a group of agents, including UAVs, workers, and cars, with the goal of maximizing the task completion rate. We propose MANF-RL-RP, a heterogeneous multi-agent route planning algorithm that incorporates several efficient designs, including global-local dual information processing and a tailored model structure for heterogeneous multi-agent systems. Global-local dual information processing encompasses the extraction and dissemination of spatial features from global information, as well as the partitioning and filtering of local information from individual agents. Regarding the construction of the model structure for heterogeneous multi-agent, we perform the following work. We design the same data structure to represent the states of different agents, prove the Markovian property of the decision-making process of agents to simplify the model structure, and also design a reasonable reward function to train the model. Finally, we conducted detailed experiments based on the rich simulation data. In comparison to the baseline algorithms, namely Greedy-SC-RP and MANF-DNN-RP, MANF-RL-RP has exhibited a significant improvement in terms of task completion rate.
Meta-Stock: Task-Difficulty-Adaptive Meta-learning for Sub-new Stock Price Prediction
Authors: Linghao Wang, Zhen Liu, Peitian Ma, Qianli Ma
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Sub-new stock price prediction, forecasting the price trends of stocks listed less than one year, is crucial for effective quantitative trading. While deep learning methods have demonstrated effectiveness in predicting old stock prices, they require large training datasets unavailable for sub-new stocks. In this paper, we propose Meta-Stock: a task-difficulty-adaptive meta-learning approach for sub-new stock price prediction. Leveraging prediction tasks formulated by old stocks, our meta-learning method aims to acquire the fast generalization ability that can be further adapted to sub-new stock price prediction tasks, thereby solving the data scarcity of sub-new stocks. Moreover, we enhance the meta-learning process by incorporating an adaptive learning strategy sensitive to varying task difficulties. Through wavelet transform, we extract high-frequency coefficients to manifest stock price volatility. This allows the meta-learning model to assign gradient weights based on volatility-quantified task difficulty. Extensive experiments on datasets collected from three stock markets spanning twenty-two years prove that our Meta-Stock significantly outperforms previous methods and manifests strong applicability in real-world stock trading. Besides, we evaluate the reasonability of the task difficulty quantification and the effectiveness of the adaptive learning strategy.
Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection
Authors: Masato Tamura
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Tremendous efforts have been put towards developing anomaly detectors due to their potential industrial applications. Considering the difficulty in acquiring various anomalous samples for training, most existing methods train models with only normal samples and measure discrepancies from the distribution of normal samples during inference, which requires training a model for each object category. The problem of this inefficient training requirement has been tackled by designing a CLIP-based anomaly detector that applies prompt-guided classification to each part of an image in a sliding window manner. However, the method still suffers from the labor of careful prompt ensembling with known object categories. To overcome the issues above, we propose leveraging CLIP as a data source for training. Our method generates text embeddings with the text encoder in CLIP with typical prompts that include words of normal and anomaly. In addition to these words, we insert several randomly generated words into prompts, which enables the encoder to generate a diverse set of normal and anomalous samples. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, and as a result, a category-agnostic anomaly detector can be obtained without any training images. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.
Efficient View Synthesis with Neural Radiance Distribution Field
Authors: Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a ~254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF.
Exploring Unsupervised Cell Recognition with Prior Self-activation Maps
Authors: Pingyi Chen, Chenglu Zhu, Zhongyi Shui, Jiatong Cai, Sunyi Zheng, Shichuan Zhang, Lin Yang
Abstract
The success of supervised deep learning models on cell recognition tasks relies on detailed annotations. Many previous works have managed to reduce the dependency on labels. However, considering the large number of cells contained in a patch, costly and inefficient labeling is still inevitable. To this end, we explored label-free methods for cell recognition. Prior self-activation maps (PSM) are proposed to generate pseudo masks as training targets. To be specific, an activation network is trained with self-supervised learning. The gradient information in the shallow layers of the network is aggregated to generate prior self-activation maps. Afterward, a semantic clustering module is then introduced as a pipeline to transform PSMs to pixel-level semantic pseudo masks for downstream tasks. We evaluated our method on two histological datasets: MoNuSeg (cell segmentation) and BCData (multi-class cell detection). Compared with other fully-supervised and weakly-supervised methods, our method can achieve competitive performance without any manual annotations. Our simple but effective framework can also achieve multi-class cell detection which can not be done by existing unsupervised methods. The results show the potential of PSMs that might inspire other research to deal with the hunger for labels in medical area.
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (Practical Experience Report)
Authors: Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, Chun Zuo
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored. In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters. An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models. The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.
Energy-Efficient On-Board Radio Resource Management for Satellite Communications via Neuromorphic Computing
Authors: Flor Ortiz, Nicolas Skatchkovsky, Eva Lagunas, Wallace A. Martins, Geoffrey Eappen, Saed Daoud, Osvaldo Simeone, Bipin Rajendran, Symeon Chatzinotas
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Abstract
The latest satellite communication (SatCom) missions are characterized by a fully reconfigurable on-board software-defined payload, capable of adapting radio resources to the temporal and spatial variations of the system traffic. As pure optimization-based solutions have shown to be computationally tedious and to lack flexibility, machine learning (ML)-based methods have emerged as promising alternatives. We investigate the application of energy-efficient brain-inspired ML models for on-board radio resource management. Apart from software simulation, we report extensive experimental results leveraging the recently released Intel Loihi 2 chip. To benchmark the performance of the proposed model, we implement conventional convolutional neural networks (CNN) on a Xilinx Versal VCK5000, and provide a detailed comparison of accuracy, precision, recall, and energy efficiency for different traffic demands. Most notably, for relevant workloads, spiking neural networks (SNNs) implemented on Loihi 2 yield higher accuracy, while reducing power consumption by more than 100$\times$ as compared to the CNN-based reference platform. Our findings point to the significant potential of neuromorphic computing and SNNs in supporting on-board SatCom operations, paving the way for enhanced efficiency and sustainability in future SatCom systems.
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
Abstract
The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Abstract
Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%)
ConcatPlexer: Additional Dim1 Batching for Faster ViTs
Abstract
Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.
Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models
Abstract
Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. This paper proposes a multimodal federated learning framework that enables multiple enterprises to utilize private domain data to collaboratively train large models for vertical domains, achieving intelligent services across scenarios. The authors discuss in-depth the strategic transformation of federated learning in terms of intelligence foundation and objectives in the era of big model, as well as the new challenges faced in heterogeneous data, model aggregation, performance and cost trade-off, data privacy, and incentive mechanism. The paper elaborates a case study of leading enterprises contributing multimodal data and expert knowledge to city safety operation management , including distributed deployment and efficient coordination of the federated learning platform, technical innovations on data quality improvement based on large model capabilities and efficient joint fine-tuning approaches. Preliminary experiments show that enterprises can enhance and accumulate intelligent capabilities through multimodal model federated learning, thereby jointly creating an smart city model that provides high-quality intelligent services covering energy infrastructure safety, residential community security, and urban operation management. The established federated learning cooperation ecosystem is expected to further aggregate industry, academia, and research resources, realize large models in multiple vertical domains, and promote the large-scale industrial application of artificial intelligence and cutting-edge research on multimodal federated learning.
Minwise-Independent Permutations with Insertion and Deletion of Features
Authors: Rameshwar Pratap, Raghav Kulkarni
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Abstract
In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, $\mathrm{minHash}$ has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute $\mathrm{minHash}$ with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of $\mathrm{minHash}$ is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the $\mathrm{minHash}$ sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running $\mathrm{minHash}$ from scratch. Our proposal is efficient, accurate, and easy to implement in practice.
Multi-User Modular XL-MIMO Communications: Near-Field and Beam Focusing Pattern and User Grouping
Abstract
In this paper, we investigate multi-user modular extremely large-scale multiple-input multiple-output (XL-MIMO) communication systems, where modular extremely large-scale uniform linear array (XL-ULA) is deployed at the base station (BS) to serve multiple single-antenna users. By exploiting the unique modular array architecture and considering the potential near-field propagation, we develop sub-array based uniform spherical wave (USW) models for distinct versus common angles of arrival/departure (AoAs/AoDs) with respect to different sub-arrays/modules, respectively. Under such USW models, we analyze the beam focusing patterns at the near-field observation location by using near-field beamforming. The analysis reveals that compared to the conventional XL-MIMO with collocated antenna elements, modular XL-MIMO can provide better spatial resolution by benefiting from its larger array aperture. However, it also incurs undesired grating lobes due to the large inter-module separation. Moreover, it is found that for multi-user modular XL-MIMO communications, the achievable signal-to-interference-plus-noise ratio (SINR) for users may be degraded by the grating lobes of the beam focusing pattern. To address this issue, an efficient user grouping method is proposed for multi-user transmission scheduling, so that users located within the grating lobes of each other are not allocated to the same time-frequency resource block (RB) for their communications. Numerical results are presented to verify the effectiveness of the proposed user grouping method, as well as the superior performance of modular XL-MIMO over its collocated counterpart with densely distributed users.
Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network
Authors: Dong Wen, Tao Li, Chenglong Li, Pengye Xia, Hui Yang, Zhigang Sun
Subjects: Hardware Architecture (cs.AR); Networking and Internet Architecture (cs.NI)
Abstract
Deep learning (DL) for network models have achieved excellent performance in the field and are becoming a promising component in future intelligent network system. Programmable in-network computing device has great potential to deploy DL for network models, however, existing device cannot afford to run a DL model. The main challenges of data-plane supporting DL-based network models lie in computing power, task granularity, model generality and feature extracting. To address above problems, we propose Octopus: a heterogeneous in-network computing accelerator enabling DL for network models. A feature extractor is designed for fast and efficient feature extracting. Vector accelerator and systolic array work in a heterogeneous collaborative way, offering low-latency-highthroughput general computing ability for packet-and-flow-based tasks. Octopus also contains on-chip memory fabric for storage and connecting, and Risc-V core for global controlling. The proposed Octopus accelerator design is implemented on FPGA. Functionality and performance of Octopus are validated in several use-cases, achieving performance of 31Mpkt/s feature extracting, 207ns packet-based computing latency, and 90kflow/s flow-based computing throughput.
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Authors: Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.
DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs
Abstract
Mixed-precision neural networks (MPNNs) that enable the use of just enough data width for a deep learning task promise significant advantages of both inference accuracy and computing overhead. FPGAs with fine-grained reconfiguration capability can adapt the processing with distinct data width and models, and hence, can theoretically unleash the potential of MPNNs. Nevertheless, commodity DPUs on FPGAs mostly emphasize generality and have limited support for MPNNs especially the ones with lower data width. In addition, primitive DSPs in FPGAs usually have much larger data width than that is required by MPNNs and haven't been sufficiently co-explored with MPNNs yet. To this end, we propose an open source MPNN accelerator design framework specifically tailored for FPGAs. In this framework, we have a systematic DSP-packing algorithm to pack multiple lower data width MACs in a single primitive DSP and enable efficient implementation of MPNNs. Meanwhile, we take DSP packing efficiency into consideration with MPNN quantization within a unified neural network architecture search (NAS) framework such that it can be aware of the DSP overhead during quantization and optimize the MPNN performance and accuracy concurrently. Finally, we have the optimized MPNN fine-tuned to a fully pipelined neural network accelerator template based on HLS and make best use of available resources for higher performance. Our experiments reveal the resulting accelerators produced by the proposed framework can achieve overwhelming advantages in terms of performance, resource utilization, and inference accuracy for MPNNs when compared with both handcrafted counterparts and prior hardware-aware neural network accelerators on FPGAs.
Higher-order multi-scale method for high-accuracy nonlinear thermo-mechanical simulation of heterogeneous shells
Abstract
In the present work, we consider multi-scale computation and convergence for nonlinear time-dependent thermo-mechanical equations of inhomogeneous shells possessing temperature-dependent material properties and orthogonal periodic configurations. The first contribution is that a novel higher-order macro-micro coupled computational model is rigorously devised via multi-scale asymptotic technique and Taylor series approach for high-accuracy simulation of heterogeneous shells. Benefitting from the higher-order corrected terms, the higher-order multi-scale computational model keeps the conservation of local energy and momentum for nonlinear thermo-mechanical simulation. Moreover, a global error estimation with explicit rate of higher-order multi-scale solutions is first derived in the energy norm sense. Furthermore, an efficient space-time numerical algorithm with off-line and on-line stages is presented in detail. Adequate numerical experiments are conducted to confirm the competitive advantages of the presented multi-scale approach, exhibiting not only the exceptional numerical accuracy, but also the less computational expense for heterogeneous shells.
Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization
Authors: Ning Gao, Bernard Hohmann, Gerhard Neumann
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Object-centric representations using slots have shown the advances towards efficient, flexible and interpretable abstraction from low-level perceptual features in a compositional scene. Current approaches randomize the initial state of slots followed by an iterative refinement. As we show in this paper, the random slot initialization significantly affects the accuracy of the final slot prediction. Moreover, current approaches require a predetermined number of slots from prior knowledge of the data, which limits the applicability in the real world. In our work, we initialize the slot representations with clustering algorithms conditioned on the perceptual input features. This requires an additional layer in the architecture to initialize the slots given the identified clusters. We design permutation invariant and permutation equivariant versions of this layer to enable the exchangeable slot representations after clustering. Additionally, we employ mean-shift clustering to automatically identify the number of slots for a given scene. We evaluate our method on object discovery and novel view synthesis tasks with various datasets. The results show that our method outperforms prior works consistently, especially for complex scenes.
Tackling the Curse of Dimensionality in Large-scale Multi-agent LTL Task Planning via Poset Product
Abstract
Linear Temporal Logic (LTL) formulas have been used to describe complex tasks for multi-agent systems, with both spatial and temporal constraints. However, since the planning complexity grows exponentially with the number of agents and the length of the task formula, existing applications are mostly limited to small artificial cases. To address this issue, a new planning algorithm is proposed for task formulas specified as sc-LTL formulas. It avoids two common bottlenecks in the model-checking-based planning methods, i.e., (i) the direct translation of the complete task formula to the associated B\"uchi automaton; and (ii) the synchronized product between the B\"uchi automaton and the transition models of all agents. In particular, each conjuncted sub-formula is first converted to the associated R-posets as an abstraction of the temporal dependencies among the subtasks. Then, an efficient algorithm is proposed to compute the product of these R-posets, which retains their dependencies and resolves potential conflicts. Furthermore, the proposed approach is applied to dynamic scenes where new tasks are generated online. It is capable of deriving the first valid plan with a polynomial time and memory complexity w.r.t. the system size and the formula length. Our method can plan for task formulas with a length of more than 60 and a system with more than 35 agents, while most existing methods fail at the formula length of 20. The proposed method is validated on large fleets of service robots in both simulation and hardware experiments.
Does society show differential attention to researchers based on gender and field?
Authors: Sara M. González-Betancor, Pablo Dorta-González
Subjects: Social and Information Networks (cs.SI); Applications (stat.AP)
Abstract
While not all researchers prioritize social impact, it is undeniably a crucial aspect that adds significance to their work. The objective of this paper is to explore potential gender differences in the social attention paid to researchers and to examine their association with specific fields of study. To achieve this goal, the paper analyzes four dimensions of social influence and examines three measures of social attention to researchers. The dimensions are media influence (mentions in mainstream news), political influence (mentions in public policy reports), social media influence (mentions in Twitter), and educational influence (mentions in Wikipedia). The measures of social attention to researchers are: proportion of publications with social mentions (social attention orientation), mentions per publication (level of social attention), and mentions per mentioned publication (intensity of social attention). By analyzing the rankings of authors -- for the four dimensions with the three measures in the 22 research fields of the Web of Science database -- and by using Spearman correlation coefficients, we conclude that: 1) significant differences are observed between fields; 2) the dimensions capture different and independent aspects of the social impact. Finally, we use non-parametric means comparison tests to detect gender bias in social attention. We conclude that for most fields and dimensions with enough non-zero altmetrics data, gender differences in social attention are not predominant, but are still present and vary across fields.
Targeted Data Augmentation for bias mitigation
Authors: Agnieszka Mikołajczyk-Bareła, Maria Ferlin, Michał Grochowski
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Abstract
The development of fair and ethical AI systems requires careful consideration of bias mitigation, an area often overlooked or ignored. In this study, we introduce a novel and efficient approach for addressing biases called Targeted Data Augmentation (TDA), which leverages classical data augmentation techniques to tackle the pressing issue of bias in data and models. Unlike the laborious task of removing biases, our method proposes to insert biases instead, resulting in improved performance. To identify biases, we annotated two diverse datasets: a dataset of clinical skin lesions and a dataset of male and female faces. These bias annotations are published for the first time in this study, providing a valuable resource for future research. Through Counterfactual Bias Insertion, we discovered that biases associated with the frame, ruler, and glasses had a significant impact on models. By randomly introducing biases during training, we mitigated these biases and achieved a substantial decrease in bias measures, ranging from two-fold to more than 50-fold, while maintaining a negligible increase in the error rate.
Statistical higher-order multi-scale method for nonlinear thermo-mechanical simulation of random composite materials with temperature-dependent properties
Abstract
Stochastic multi-scale modeling and simulation for nonlinear thermo-mechanical problems of composite materials with complicated random microstructures remains a challenging issue. In this paper, we develop a novel statistical higher-order multi-scale (SHOMS) method for nonlinear thermo-mechanical simulation of random composite materials, which is designed to overcome limitations of prohibitive computation involving the macro-scale and micro-scale. By virtue of statistical multi-scale asymptotic analysis and Taylor series method, the SHOMS computational model is rigorously derived for accurately analyzing nonlinear thermo-mechanical responses of random composite materials both in the macro-scale and micro-scale. Moreover, the local error analysis of SHOMS solutions in the point-wise sense clearly illustrates the crucial indispensability of establishing the higher-order asymptotic corrected terms in SHOMS computational model for keeping the conservation of local energy and momentum. Then, the corresponding space-time multi-scale numerical algorithm with off-line and on-line stages is designed to efficiently simulate nonlinear thermo-mechanical behaviors of random composite materials. Finally, extensive numerical experiments are presented to gauge the efficiency and accuracy of the proposed SHOMS approach.
A Partially Observable Deep Multi-Agent Active Inference Framework for Resource Allocation in 6G and Beyond Wireless Communications Networks
Abstract
Resource allocation is of crucial importance in wireless communications. However, it is extremely challenging to design efficient resource allocation schemes for future wireless communication networks since the formulated resource allocation problems are generally non-convex and consist of various coupled variables. Moreover, the dynamic changes of practical wireless communication environment and user service requirements thirst for efficient real-time resource allocation. To tackle these issues, a novel partially observable deep multi-agent active inference (PODMAI) framework is proposed for realizing intelligent resource allocation. A belief based learning method is exploited for updating the policy by minimizing the variational free energy. A decentralized training with a decentralized execution multi-agent strategy is designed to overcome the limitations of the partially observable state information. Exploited the proposed framework, an intelligent spectrum allocation and trajectory optimization scheme is developed for a spectrum sharing unmanned aerial vehicle (UAV) network with dynamic transmission rate requirements as an example. Simulation results demonstrate that our proposed framework can significantly improve the sum transmission rate of the secondary network compared to various benchmark schemes. Moreover, the convergence speed of the proposed PODMAI is significantly improved compared with the conventional reinforcement learning framework. Overall, our proposed framework can enrich the intelligent resource allocation frameworks and pave the way for realizing real-time resource allocation.
Achievable Sum-rate of variants of QAM over Gaussian Multiple Access Channel with and without security
Authors: Shifa Showkat, Zahid Bashir Dar, Shahid Mehraj Shah
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
The performance of next generation wireless systems (5G/6G and beyond) at the physical layer is primarily driven by the choice of digital modulation techniques that are bandwidth and power efficient, while maintaining high data rates. Achievable rates for Gaussian input and some finite constellations (BPSK/QPSK/QAM) are well studied in the literature. However, new variants of Quadrature Amplitude Modulation (QAM) such as Cross-QAM (XQAM), Star-QAM (S-QAM), Amplitude and phase shift keying (APSK), and Hexagonal Quadrature Amplitude Modulation (H-QAM) are not studied in the context of achievable rates for meeting the demand of high data rates. In this paper, we study achievable rate region for different variants of M-QAM like Cross-QAM, H-QAM, Star-QAM and APSK. We also compute mutual information corresponding to the sum rate of Gaussian Multiple Access Channel (G-MAC), for hybrid constellation scheme, e.g., user 1 transmits using Star-QAM and user 2 by H-QAM. From the results, it is observed that S-QAM gives the maximum sum-rate when users transmit same constellations. Also, it has been found that when hybrid constellation is used, the combination of Star-QAM \& H-QAM gives the maximum rate. In the next part of the paper, we consider a scenario wherein an adversary is also present at the receiver side and is trying to decode the information. We model this scenario as Gaussian Multiple Access Wiretap Channel (G-MAW-WT). We then compute the achievable secrecy sum rate of two user G-MAC-WT with discrete inputs from different variants of QAM (viz, X-QAM, H-QAM and S-QAM).It has been found that at higher values of SNR, S-QAM gives better values of SSR than the other variants. For hybrid inputs of QAM, at lower values of SNR, combination of APSK and S-QAM gives better results and at higher values of SNR, combination of HQAM and APSK gives greater value of SSR.
Supply Function Equilibrium in Networked Electricity Markets
Abstract
We study deregulated power markets with strategic power suppliers. In deregulated markets, each supplier submits its supply function (i.e., the amount of electricity it is willing to produce at various prices) to the independent system operator (ISO), who based on the submitted supply functions, dispatches the suppliers to clear the market with minimal total generation cost. If all suppliers reported their true marginal cost functions as supply functions, the market outcome would be efficient (i.e., the total generation cost is minimized). However, when suppliers are strategic and aim to maximize their own profits, the reported supply functions are not necessarily the true marginal cost functions, and the resulting market outcome may be inefficient. The efficiency loss depends crucially on the topology of the underlying transmission network. This paper provides an analytical upper bound of the efficiency loss due to strategic suppliers, and proves that the bound is tight under a large class of transmission networks (i.e., weakly cyclic networks). Our upper bound sheds light on how the efficiency loss depends on the transmission network topology (e.g., the degrees of nodes, the admittances and flow limits of transmission lines).
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Authors: Alexander Wong, Saad Abbasi, Saeejith Nair
Abstract
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.
Convergence guarantee for consistency models
Authors: Junlong Lyu, Zhitang Chen, Shoubo Feng
Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Probability (math.PR)
Abstract
We provide the first convergence guarantees for the Consistency Models (CMs), a newly emerging type of one-step generative models that can generate comparable samples to those generated by Diffusion Models. Our main result is that, under the basic assumptions on score-matching errors, consistency errors and smoothness of the data distribution, CMs can efficiently sample from any realistic data distribution in one step with small $W_2$ error. Our results (1) hold for $L^2$-accurate score and consistency assumption (rather than $L^\infty$-accurate); (2) do note require strong assumptions on the data distribution such as log-Sobelev inequality; (3) scale polynomially in all parameters; and (4) match the state-of-the-art convergence guarantee for score-based generative models (SGMs). We also provide the result that the Multistep Consistency Sampling procedure can further reduce the error comparing to one step sampling, which support the original statement of "Consistency Models, Yang Song 2023". Our result further imply a TV error guarantee when take some Langevin-based modifications to the output distributions.
Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms
Authors: Mariapia Rita Iandolo, Francesca Razzano, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
The aim of this work is to perform a multitemporal analysis using the Google Earth Engine (GEE) platform for the detection of changes in urban areas using optical data and specific machine learning (ML) algorithms. As a case study, Cairo City has been identified, in Egypt country, as one of the five most populous megacities of the last decade in the world. Classification and change detection analysis of the region of interest (ROI) have been carried out from July 2013 to July 2021. Results demonstrate the validity of the proposed method in identifying changed and unchanged urban areas over the selected period. Furthermore, this work aims to evidence the growing significance of GEE as an efficient cloud-based solution for managing large quantities of satellite data.
An iterative method for Helmholtz boundary value problems arising in wave propagation
Authors: Francisco Bernal, Xingyuan Chen, Goncalo dos Reis
Subjects: Numerical Analysis (math.NA); Probability (math.PR)
Abstract
The complex Helmholtz equation $(\Delta + k^2)u=f$ (where $k\in{\mathbb R},u(\cdot),f(\cdot)\in{\mathbb C}$) is a mainstay of computational wave simulation. Despite its apparent simplicity, efficient numerical methods are challenging to design and, in some applications, regarded as an open problem. Two sources of difficulty are the large number of degrees of freedom and the indefiniteness of the matrices arising after discretisation. Seeking to meet them within the novel framework of probabilistic domain decomposition, we set out to rewrite the Helmholtz equation into a form amenable to the Feynman-Kac formula for elliptic boundary value problems. We consider two typical scenarios, the scattering of a plane wave and the propagation inside a cavity, and recast them as a sequence of Poisson equations. By means of stochastic arguments, we find a sufficient and simulatable condition for the convergence of the iterations. Upon discretisation a necessary condition for convergence can be derived by adding up the iterates using the harmonic series for the matrix inverse -- we illustrate the procedure in the case of finite differences. From a practical point of view, our results are ultimately of limited scope. Nonetheless, this unexpected -- even paradoxical -- new direction of attack on the Helmholtz equation proposed by this work offers a fresh perspective on this classical and difficult problem. Our results show that there indeed exists a predictable range $k<k{max}$ in which this new ansatz works with $k{max}$ being far below the challenging situation.
Pose2Gait: Extracting Gait Features from Monocular Video of Individuals with Dementia
Authors: Caroline Malin-Mayor, Vida Adeli, Andrea Sabo, Sergey Noritsyn, Carolina Gorodetsky, Alfonso Fasano, Andrea Iaboni, Babak Taati
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Video-based ambient monitoring of gait for older adults with dementia has the potential to detect negative changes in health and allow clinicians and caregivers to intervene early to prevent falls or hospitalizations. Computer vision-based pose tracking models can process video data automatically and extract joint locations; however, publicly available models are not optimized for gait analysis on older adults or clinical populations. In this work we train a deep neural network to map from a two dimensional pose sequence, extracted from a video of an individual walking down a hallway toward a wall-mounted camera, to a set of three-dimensional spatiotemporal gait features averaged over the walking sequence. The data of individuals with dementia used in this work was captured at two sites using a wall-mounted system to collect the video and depth information used to train and evaluate our model. Our Pose2Gait model is able to extract velocity and step length values from the video that are correlated with the features from the depth camera, with Spearman's correlation coefficients of .83 and .60 respectively, showing that three dimensional spatiotemporal features can be predicted from monocular video. Future work remains to improve the accuracy of other features, such as step time and step width, and test the utility of the predicted values for detecting meaningful changes in gait during longitudinal ambient monitoring.
Four years of multi-modal odometry and mapping on the rail vehicles
Abstract
Precise, seamless, and efficient train localization as well as long-term railway environment monitoring is the essential property towards reliability, availability, maintainability, and safety (RAMS) engineering for railroad systems. Simultaneous localization and mapping (SLAM) is right at the core of solving the two problems concurrently. In this end, we propose a high-performance and versatile multi-modal framework in this paper, targeted for the odometry and mapping task for various rail vehicles. Our system is built atop an inertial-centric state estimator that tightly couples light detection and ranging (LiDAR), visual, optionally satellite navigation and map-based localization information with the convenience and extendibility of loosely coupled methods. The inertial sensors IMU and wheel encoder are treated as the primary sensor, which achieves the observations from subsystems to constrain the accelerometer and gyroscope biases. Compared to point-only LiDAR-inertial methods, our approach leverages more geometry information by introducing both track plane and electric power pillars into state estimation. The Visual-inertial subsystem also utilizes the environmental structure information by employing both lines and points. Besides, the method is capable of handling sensor failures by automatic reconfiguration bypassing failure modules. Our proposed method has been extensively tested in the long-during railway environments over four years, including general-speed, high-speed and metro, both passenger and freight traffic are investigated. Further, we aim to share, in an open way, the experience, problems, and successes of our group with the robotics community so that those that work in such environments can avoid these errors. In this view, we open source some of the datasets to benefit the research community.
L^2R: Lifelong Learning for First-stage Retrieval with Backward-Compatible Representations
Abstract
First-stage retrieval is a critical task that aims to retrieve relevant document candidates from a large-scale collection. While existing retrieval models have achieved impressive performance, they are mostly studied on static data sets, ignoring that in the real-world, the data on the Web is continuously growing with potential distribution drift. Consequently, retrievers trained on static old data may not suit new-coming data well and inevitably produce sub-optimal results. In this work, we study lifelong learning for first-stage retrieval, especially focusing on the setting where the emerging documents are unlabeled since relevance annotation is expensive and may not keep up with data emergence. Under this setting, we aim to develop model updating with two goals: (1) to effectively adapt to the evolving distribution with the unlabeled new-coming data, and (2) to avoid re-inferring all embeddings of old documents to efficiently update the index each time the model is updated. We first formalize the task and then propose a novel Lifelong Learning method for the first-stage Retrieval, namely L^2R. L^2R adopts the typical memory mechanism for lifelong learning, and incorporates two crucial components: (1) selecting diverse support negatives for model training and memory updating for effective model adaptation, and (2) a ranking alignment objective to ensure the backward-compatibility of representations to save the cost of index rebuilding without hurting the model performance. For evaluation, we construct two new benchmarks from LoTTE and Multi-CPR datasets to simulate the document distribution drift in realistic retrieval scenarios. Extensive experiments show that L^2R significantly outperforms competitive lifelong learning baselines.
Abstract
AI for IT Operations (AIOps) is a powerful platform that Site Reliability Engineers (SREs) use to automate and streamline operational workflows with minimal human intervention. Automated log analysis is a critical task in AIOps as it provides key insights for SREs to identify and address ongoing faults. Tasks such as log format detection, log classification, and log parsing are key components of automated log analysis. Most of these tasks require supervised learning; however, there are multiple challenges due to limited labelled log data and the diverse nature of log data. Large Language Models (LLMs) such as BERT and GPT3 are trained using self-supervision on a vast amount of unlabeled data. These models provide generalized representations that can be effectively used for various downstream tasks with limited labelled data. Motivated by the success of LLMs in specific domains like science and biology, this paper introduces a LLM for log data which is trained on public and proprietary log data. The results of our experiments demonstrate that the proposed LLM outperforms existing models on multiple downstream tasks. In summary, AIOps powered by LLMs offers an efficient and effective solution for automating log analysis tasks and enabling SREs to focus on higher-level tasks. Our proposed LLM, trained on public and proprietary log data, offers superior performance on multiple downstream tasks, making it a valuable addition to the AIOps platform.
G3Reg: Pyramid Graph-based Global Registration using Gaussian Ellipsoid Model
Abstract
This study introduces a novel framework, G3Reg, for fast and robust global registration of LiDAR point clouds. In contrast to conventional complex keypoints and descriptors, we extract fundamental geometric primitives including planes, clusters, and lines (PCL) from the raw point cloud to obtain low-level semantic segments. Each segment is formulated as a unified Gaussian Ellipsoid Model (GEM) by employing a probability ellipsoid to ensure the ground truth centers are encompassed with a certain degree of probability. Utilizing these GEMs, we then present a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). Specifically, we establish an upper bound, which can be traversed based on the confidence level for compatibility testing to construct the pyramid graph. Gradually, we solve multiple maximum cliques (MAC) for each level of the graph, generating numerous transformation candidates. In the verification phase, we adopt a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate. The performance of the algorithm is extensively validated on three publicly available datasets and a self-collected multi-session dataset, without changing any parameter settings in the experimental evaluation. The results exhibit superior robustness and real-time performance of the G3Reg framework compared to state-of-the-art methods. Furthermore, we demonstrate the potential for integrating individual GEM and PAGOR components into other algorithmic frameworks to enhance their efficacy. To advance further research and promote community understanding, we have publicly shared the source code.
Dynamically Orthogonal Approximation for Stochastic Differential Equations
Abstract
In this paper, we set the mathematical foundations of the Dynamical Low Rank Approximation (DLRA) method for high-dimensional stochastic differential equations. DLRA aims at approximating the solution as a linear combination of a small number of basis vectors with random coefficients (low rank format) with the peculiarity that both the basis vectors and the random coefficients vary in time. While the formulation and properties of DLRA are now well understood for random/parametric equations, the same cannot be said for SDEs and this work aims to fill this gap. We start by rigorously formulating a Dynamically Orthogonal (DO) approximation (an instance of DLRA successfully used in applications) for SDEs, which we then generalize to define a parametrization independent DLRA for SDEs. We show local well-posedness of the DO equations and their equivalence with the DLRA formulation. We also characterize the explosion time of the DO solution by a loss of linear independence of the random coefficients defining the solution expansion and give sufficient conditions for global existence.
Towards Universal Interaction for Extended Reality
Abstract
Extended Reality (XR) is a rapidly growing field offering unique immersive experiences, social networking, learning, and collaboration opportunities. The continuous advancements in XR technology and industry efforts are gradually moving this technology toward end consumers. However, a universal one-size-fits-all solution for seamless XR interaction still needs to be discovered. Currently, we face a diverse landscape of interaction modalities that depend on the environment, user preferences, task, and device capabilities. Commercially available input methods like handheld controllers, hand gestures, voice commands, and combinations of those need universal flexibility and expressiveness. Additionally, hybrid user interfaces, such as smartwatches and smartphones as ubiquitous input and output devices, expand this interaction design space. In this position paper, we discuss the idea of a universal interaction concept for XR. We present challenges and opportunities for implementing hybrid user interfaces, emphasizing Environment, Task, and User. We explore the potential to enhance user experiences, interaction capabilities, and the development of seamless and efficient XR interaction methods. We examine challenges and aim to stimulate a discussion on the design of generic, universal interfaces for XR.
Tryage: Real-time, intelligent Routing of User Prompts to Large Language Model
Authors: Surya Narayanan Hari, Matt Thomson
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Abstract
The introduction of the transformer architecture and the self-attention mechanism has led to an explosive production of language models trained on specific downstream tasks and data domains. With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted workflows and data domains while addressing computational, security, and recency concerns. There is an urgent need for machine learning frameworks that can eliminate the burden of model selection and customization and unleash the incredible power of the vast emerging model library for end users. Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library based on analysis of individual input prompts. Inspired by the thalamic router in the brain, Tryage employs a perceptive router to predict down-stream model performance on prompts and, then, makes a routing decision using an objective function that integrates performance predictions with user goals and constraints that are incorporated through flags (e.g., model size, model recency). Tryage allows users to explore a Pareto front and automatically trade-off between task accuracy and secondary goals including minimization of model size, recency, security, verbosity, and readability. Across heterogeneous data sets that include code, text, clinical data, and patents, the Tryage framework surpasses Gorilla and GPT3.5 turbo in dynamic model selection identifying the optimal model with an accuracy of 50.9% , compared to 23.6% by GPT 3.5 Turbo and 10.8% by Gorilla. Conceptually, Tryage demonstrates how routing models can be applied to program and control the behavior of multi-model LLM systems to maximize efficient use of the expanding and evolving language model ecosystem.
Keyword: faster
ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps
Abstract
This paper introduces a novel algorithm for solving the point-to-point shortest path problem in a static regular 8-neighbor connectivity (G8) grid. This algorithm can be seen as a generalization of Hadlock algorithm to G8 grids, and is shown to be theoretically equivalent to the relaxed $A^$ ($RA^$) algorithm in terms of the provided solution's path length, but with substantial time and memory savings, due to a completely different computation strategy, based on defining a set of lookup matrices. Through an experimental study on grid maps of various types and sizes (1290 runs on 43 maps), it is proven to be 2.25 times faster than $RA^$ and 17 times faster than the original $A^$, in average. Moreover, it is more memory-efficient, since it does not need to store a G score matrix.
Abstract
(I) We revisit the algorithmic problem of finding all triangles in a graph $G=(V,E)$ with $n$ vertices and $m$ edges. According to a result of Chiba and Nishizeki (1985), this task can be achieved by a combinatorial algorithm running in $O(m \alpha) = O(m^{3/2})$ time, where $\alpha= \alpha(G)$ is the graph arboricity. We provide a new very simple combinatorial algorithm for finding all triangles in a graph and show that is amenable to the same running time analysis. We derive these worst-case bounds from first principles and with very simple proofs that do not rely on classic results due to Nash-Williams from the 1960s. (II) We extend our arguments to the problem of finding all small complete subgraphs of a given fixed size. We show that the dependency on $m$ and $\alpha$ in the running time $O(\alpha^{\ell-2} \cdot m)$ of the algorithm of Chiba and Nishizeki for listing all copies of $K\ell$, where $\ell \geq 3$, is asymptotically tight. (III) We give improved arboricity-sensitive running times for counting and/or detection of copies of $K\ell$, for small $\ell \geq 4$. A key ingredient in our algorithms is, once again, the algorithm of Chiba and Nishizeki. Our new algorithms are faster than all previous algorithms in certain high-range arboricity intervals for every $\ell \geq 7$.
ReFit: Recurrent Fitting Network for 3D Human Recovery
Authors: Yufu Wang, Kostas Daniilidis
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We present Recurrent Fitting (ReFit), a neural network architecture for single-image, parametric 3D human reconstruction. ReFit learns a feedback-update loop that mirrors the strategy of solving an inverse problem through optimization. At each iterative step, it reprojects keypoints from the human model to feature maps to query feedback, and uses a recurrent-based updater to adjust the model to fit the image better. Because ReFit encodes strong knowledge of the inverse problem, it is faster to train than previous regression models. At the same time, ReFit improves state-of-the-art performance on standard benchmarks. Moreover, ReFit applies to other optimization settings, such as multi-view fitting and single-view shape fitting. Project website: https://yufu-wang.github.io/refit_humans/
Protect Federated Learning Against Backdoor Attacks via Data-Free Trigger Generation
Authors: Yanxin Yang, Ming Hu, Yue Cao, Jun Xia, Yihao Huang, Yang Liu, Mingsong Chen
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Abstract
As a distributed machine learning paradigm, Federated Learning (FL) enables large-scale clients to collaboratively train a model without sharing their raw data. However, due to the lack of data auditing for untrusted clients, FL is vulnerable to poisoning attacks, especially backdoor attacks. By using poisoned data for local training or directly changing the model parameters, attackers can easily inject backdoors into the model, which can trigger the model to make misclassification of targeted patterns in images. To address these issues, we propose a novel data-free trigger-generation-based defense approach based on the two characteristics of backdoor attacks: i) triggers are learned faster than normal knowledge, and ii) trigger patterns have a greater effect on image classification than normal class patterns. Our approach generates the images with newly learned knowledge by identifying the differences between the old and new global models, and filters trigger images by evaluating the effect of these generated images. By using these trigger images, our approach eliminates poisoned models to ensure the updated global model is benign. Comprehensive experiments demonstrate that our approach can defend against almost all the existing types of backdoor attacks and outperform all the seven state-of-the-art defense methods with both IID and non-IID scenarios. Especially, our approach can successfully defend against the backdoor attack even when 80\% of the clients are malicious.
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Authors: Alexander Wong, Saad Abbasi, Saeejith Nair
Abstract
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.
Keyword: mobile
Addressing Knowledge Leakage Risk caused by the use of mobile devices in Australian Organizations
Authors: Carlos Andres Agudelo Serna, Rachelle Bosua, Sean B. Maynard, Atif Ahmad
Abstract
Information and knowledge leakage has become a significant security risk to Australian organizations. Each security incident in Australian business cost an average US$\$$2.8 million. Furthermore, Australian organisations spend the second most worldwide (US$\$$1.2 million each on average) on investigating and assessing information breaches. The leakage of sensitive organizational information occurs through different avenues, such as social media, cloud computing and mobile devices. In this study, we (1) analyze the knowledge leakage risk (KLR) caused by the use of mobile devices in knowledge-intensive Australian organizations, (2) present a conceptual research model to explain the determinants that influence KLR through the use of mobile devices grounded in the literature, (3) conduct interviews with security and knowledge managers to understand what strategies they use to mitigate KLR caused by the use of mobile devices and (4) use content analysis and the conceptual model to frame the preliminary findings from the interviews. Keywords: Knowledge leakage, mobile devices, mobile contexts, knowledge leakage risk
Mobility-Aware Computation Offloading for Swarm Robotics using Deep Reinforcement Learning
Authors: Xiucheng Wang, Hongzhi Guo
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract
Swarm robotics is envisioned to automate a large number of dirty, dangerous, and dull tasks. Robots have limited energy, computation capability, and communication resources. Therefore, current swarm robotics have a small number of robots, which can only provide limited spatio-temporal information. In this paper, we propose to leverage the mobile edge computing to alleviate the computation burden. We develop an effective solution based on a mobility-aware deep reinforcement learning model at the edge server side for computing scheduling and resource. Our results show that the proposed approach can meet delay requirements and guarantee computation precision by using minimum robot energy.
Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images
Authors: Alperen Kalay, Bahri Batuhan Bilecen, Mustafa Ayazoglu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Super-resolution (SR) networks have been investigated for a while, with their mobile and lightweight versions gaining noticeable popularity recently. Quantization, the procedure of decreasing the precision of network parameters (mostly FP32 to INT8), is also utilized in SR networks for establishing mobile compatibility. This study focuses on a very important but mostly overlooked post-training quantization (PTQ) step: representative dataset (RD), which adjusts the quantization range for PTQ. We propose a novel pipeline (clip-free quantization pipeline, CFQP) backed up with extensive experimental justifications to cleverly augment RD images by only using outputs of the FP32 model. Using the proposed pipeline for RD, we can successfully eliminate unwanted clipped activation layers, which nearly all mobile SR methods utilize to make the model more robust to PTQ in return for a large overhead in runtime. Removing clipped activations with our method significantly benefits overall increased stability, decreased inference runtime up to 54% on some SR models, better visual quality results compared to INT8 clipped models - and outperforms even some FP32 non-quantized models, both in runtime and visual quality, without the need for retraining with clipped activation.
Multi-Objective Improvement of Android Applications
Abstract
Non-functional properties, such as runtime or memory use, are important to mobile app users and developers, as they affect user experience. Previous work on automated improvement of non-functional properties in mobile apps failed to address the inherent trade-offs between such properties. We propose a practical approach and the first open-source tool, GIDroid (2023), for multi-objective automated improvement of Android apps. In particular, we use Genetic improvement, a search-based technique that navigates the space of software variants to find improved software. We use a simulation-based testing framework to greatly improve the speed of search. GIDroid contains three state-of-the-art multi-objective algorithms, and two new mutation operators, which cache the results of method calls. Genetic improvement relies on testing to validate patches. Previous work showed that tests in open-source Android applications are scarce. We thus wrote tests for 21 versions of 7 Android apps, creating a new benchmark for performance improvements. We used GIDroid to improve versions of mobile apps where developers had previously found improvements to runtime, memory, and bandwidth use. Our technique automatically re-discovers 64% of existing improvements. We then applied our approach to current versions of software in which there were no known improvements. We were able to improve execution time by up to 35%, and memory use by up to 33% in these apps.
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Authors: Alexander Wong, Saad Abbasi, Saeejith Nair
Abstract
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. However, the architectural and computational complexity of such network architectures have made them challenging to deploy in real-world applications with high-throughput, low-memory requirements. As such, there has been significant research recently on the design of efficient vision transformer architectures. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. Through this generative architecture search process, we create TurboViT, a highly efficient hierarchical vision transformer architecture design that is generated around mask unit attention and Q-pooling design patterns. The resulting TurboViT architecture design achieves significantly lower architectural computational complexity (>2.47$\times$ smaller than FasterViT-0 while achieving same accuracy) and computational complexity (>3.4$\times$ fewer FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other state-of-the-art efficient vision transformer network architecture designs within a similar range of accuracy on the ImageNet-1K dataset. Furthermore, TurboViT demonstrated strong inference latency and throughput in both low-latency and batch processing scenarios (>3.21$\times$ lower latency and >3.18$\times$ higher throughput compared to FasterViT-0 for low-latency scenario). These promising results demonstrate the efficacy of leveraging generative architecture search for generating efficient transformer architecture designs for high-throughput scenarios.
Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users
Authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding.
Keyword: pruning
Graph Neural Network-Enhanced Expectation Propagation Algorithm for MIMO Turbo Receivers
Authors: Xingyu Zhou, Jing Zhang, Chao-Kai Wen, Shi Jin, Shuangfeng Han
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Deep neural networks (NNs) are considered a powerful tool for balancing the performance and complexity of multiple-input multiple-output (MIMO) receivers due to their accurate feature extraction, high parallelism, and excellent inference ability. Graph NNs (GNNs) have recently demonstrated outstanding capability in learning enhanced message passing rules and have shown success in overcoming the drawback of inaccurate Gaussian approximation of expectation propagation (EP)-based MIMO detectors. However, the application of the GNN-enhanced EP detector to MIMO turbo receivers is underexplored and non-trivial due to the requirement of extrinsic information for iterative processing. This paper proposes a GNN-enhanced EP algorithm for MIMO turbo receivers, which realizes the turbo principle of generating extrinsic information from the MIMO detector through a specially designed training procedure. Additionally, an edge pruning strategy is designed to eliminate redundant connections in the original fully connected model of the GNN utilizing the correlation information inherently from the EP algorithm. Edge pruning reduces the computational cost dramatically and enables the network to focus more attention on the weights that are vital for performance. Simulation results and complexity analysis indicate that the proposed MIMO turbo receiver outperforms the EP turbo approaches by over 1 dB at the bit error rate of $10^{-5}$, exhibits performance equivalent to state-of-the-art receivers with 2.5 times shorter running time, and adapts to various scenarios.
Keyword: diffusion
Diffusion Model as Representation Learner
Authors: Xingyi Yang, Xinchao Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at \url{https://github.com/Adamdad/Repfusion}.
Hey That's Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs
Authors: Luke Ditria, Tom Drummond
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Generative models have seen an explosion in popularity with the release of huge generative Diffusion models like Midjourney and Stable Diffusion to the public. Because of this new ease of access, questions surrounding the automated collection of data and issues regarding content ownership have started to build. In this paper we present new work which aims to provide ways of protecting content when shared to the public. We show that a generative Diffusion model trained on data that has been imperceptibly watermarked will generate new images with these watermarks present. We further show that if a given watermark is correlated with a certain feature of the training data, the generated images will also have this correlation. Using statistical tests we show that we are able to determine whether a model has been trained on marked data, and what data was marked. As a result our system offers a solution to protect intellectual property when sharing content online.
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment
Authors: Xujie Zhang, Binbin Yang, Michael C. Kampffmeyer, Wenqing Zhang, Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, Xiaodan Liang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces.Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.
MusicJam: Visualizing Music Insights via Generated Narrative Illustrations
Authors: Chuer Chen, Nan Cao, Jiani Hou, Yi Guo, Yulei Zhang, Yang Shi
Abstract
Visualizing the insights of the invisible music is able to bring listeners an enjoyable and immersive listening experience, and therefore has attracted much attention in the field of information visualization. Over the past decades, various music visualization techniques have been introduced. However, most of them are manually designed by following the visual encoding rules, thus shown in form of a graphical visual representation whose visual encoding schema is usually taking effort to understand. Recently, some researchers use figures or illustrations to represent music moods, lyrics, and musical features, which are more intuitive and attractive. However, in these techniques, the figures are usually pre-selected or statically generated, so they cannot precisely convey insights of different pieces of music. To address this issue, in this paper, we introduce MusicJam, a music visualization system that is able to generate narrative illustrations to represent the insight of the input music. The system leverages a novel generation model designed based on GPT-2 to generate meaningful lyrics given the input music and then employs the stable diffusion model to transform the lyrics into coherent illustrations. Finally, the generated results are synchronized and rendered as an MP4 video accompanied by the input music. We evaluated the proposed lyric generation model by comparing it to the baseline models and conducted a user study to estimate the quality of the generated illustrations and the final music videos. The results showed the power of our technique.
MatFuse: Controllable Material Generation with Diffusion Models
Authors: Giuseppe Vecchio, Renato Sortino, Simone Palazzo, Concetto Spampinato
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Abstract
Creating high quality and realistic materials in computer graphics is a challenging and time-consuming task, which requires great expertise. In this paper, we present MatFuse, a novel unified approach that harnesses the generative power of diffusion models (DM) to simplify the creation of SVBRDF maps. Our DM-based pipeline integrates multiple sources of conditioning, such as color palettes, sketches, and pictures, enabling fine-grained control and flexibility in material synthesis. This design allows for the combination of diverse information sources (e.g., sketch + image embedding), enhancing creative possibilities in line with the principle of compositionality. We demonstrate the generative capabilities of the proposed method under various conditioning settings; on the SVBRDF estimation task, we show that our method yields performance comparable to state-of-the-art approaches, both qualitatively and quantitatively.
SDeMorph: Towards Better Facial De-morphing from Single Morph
Authors: Nitish Shukla
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Face Recognition Systems (FRS) are vulnerable to morph attacks. A face morph is created by combining multiple identities with the intention to fool FRS and making it match the morph with multiple identities. Current Morph Attack Detection (MAD) can detect the morph but are unable to recover the identities used to create the morph with satisfactory outcomes. Existing work in de-morphing is mostly reference-based, i.e. they require the availability of one identity to recover the other. Sudipta et al. \cite{ref9} proposed a reference-free de-morphing technique but the visual realism of outputs produced were feeble. In this work, we propose SDeMorph (Stably Diffused De-morpher), a novel de-morphing method that is reference-free and recovers the identities of bona fides. Our method produces feature-rich outputs that are of significantly high quality in terms of definition and facial fidelity. Our method utilizes Denoising Diffusion Probabilistic Models (DDPM) by destroying the input morphed signal and then reconstructing it back using a branched-UNet. Experiments on ASML, FRLL-FaceMorph, FRLL-MorDIFF, and SMDD datasets support the effectiveness of the proposed method.
Convergence guarantee for consistency models
Authors: Junlong Lyu, Zhitang Chen, Shoubo Feng
Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Probability (math.PR)
Abstract
We provide the first convergence guarantees for the Consistency Models (CMs), a newly emerging type of one-step generative models that can generate comparable samples to those generated by Diffusion Models. Our main result is that, under the basic assumptions on score-matching errors, consistency errors and smoothness of the data distribution, CMs can efficiently sample from any realistic data distribution in one step with small $W_2$ error. Our results (1) hold for $L^2$-accurate score and consistency assumption (rather than $L^\infty$-accurate); (2) do note require strong assumptions on the data distribution such as log-Sobelev inequality; (3) scale polynomially in all parameters; and (4) match the state-of-the-art convergence guarantee for score-based generative models (SGMs). We also provide the result that the Multistep Consistency Sampling procedure can further reduce the error comparing to one step sampling, which support the original statement of "Consistency Models, Yang Song 2023". Our result further imply a TV error guarantee when take some Langevin-based modifications to the output distributions.
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis
Authors: Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.
Keyword: adaptive
Hierarchical Lowrank Arithmetic with Binary Compression
Authors: Ronald Kriemann
Subjects: Mathematical Software (cs.MS); Data Structures and Algorithms (cs.DS)
Abstract
With lowrank approximation the storage requirements for dense data are reduced down to linear complexity and with the addition of hierarchy this also works for data without global lowrank properties. However, the lowrank factors itself are often still stored using double precision numbers. Newer approaches exploit the different IEEE754 floating point formats available nowadays in a mixed precision approach. However, these formats show a significant gap in storage (and accuracy), e.g. between half, single and double precision. We therefore look beyond these standard formats and use adaptive compression for storing the lowrank and dense data and investigate how that affects the arithmetic of such matrices.
Personalized Event Prediction for Electronic Health Records
Abstract
Clinical event sequences consist of hundreds of clinical events that represent records of patient care in time. Developing accurate predictive models of such sequences is of a great importance for supporting a variety of models for interpreting/classifying the current patient condition, or predicting adverse clinical events and outcomes, all aimed to improve patient care. One important challenge of learning predictive models of clinical sequences is their patient-specific variability. Based on underlying clinical conditions, each patient's sequence may consist of different sets of clinical events (observations, lab results, medications, procedures). Hence, simple population-wide models learned from event sequences for many different patients may not accurately predict patient-specific dynamics of event sequences and their differences. To address the problem, we propose and investigate multiple new event sequence prediction models and methods that let us better adjust the prediction for individual patients and their specific conditions. The methods developed in this work pursue refinement of population-wide models to subpopulations, self-adaptation, and a meta-level model switching that is able to adaptively select the model with the best chance to support the immediate prediction. We analyze and test the performance of these models on clinical event sequences of patients in MIMIC-III database.
LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction
Authors: Haesoo Chung, Nam Ik Cho
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.
Meta-Stock: Task-Difficulty-Adaptive Meta-learning for Sub-new Stock Price Prediction
Authors: Linghao Wang, Zhen Liu, Peitian Ma, Qianli Ma
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Sub-new stock price prediction, forecasting the price trends of stocks listed less than one year, is crucial for effective quantitative trading. While deep learning methods have demonstrated effectiveness in predicting old stock prices, they require large training datasets unavailable for sub-new stocks. In this paper, we propose Meta-Stock: a task-difficulty-adaptive meta-learning approach for sub-new stock price prediction. Leveraging prediction tasks formulated by old stocks, our meta-learning method aims to acquire the fast generalization ability that can be further adapted to sub-new stock price prediction tasks, thereby solving the data scarcity of sub-new stocks. Moreover, we enhance the meta-learning process by incorporating an adaptive learning strategy sensitive to varying task difficulties. Through wavelet transform, we extract high-frequency coefficients to manifest stock price volatility. This allows the meta-learning model to assign gradient weights based on volatility-quantified task difficulty. Extensive experiments on datasets collected from three stock markets spanning twenty-two years prove that our Meta-Stock significantly outperforms previous methods and manifests strong applicability in real-world stock trading. Besides, we evaluate the reasonability of the task difficulty quantification and the effectiveness of the adaptive learning strategy.
High Dynamic Range Imaging of Dynamic Scenes with Saturation Compensation but without Explicit Motion Compensation
Authors: Haesoo Chung, Nam Ik Cho
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
High dynamic range (HDR) imaging is a highly challenging task since a large amount of information is lost due to the limitations of camera sensors. For HDR imaging, some methods capture multiple low dynamic range (LDR) images with altering exposures to aggregate more information. However, these approaches introduce ghosting artifacts when significant inter-frame motions are present. Moreover, although multi-exposure images are given, we have little information in severely over-exposed areas. Most existing methods focus on motion compensation, i.e., alignment of multiple LDR shots to reduce the ghosting artifacts, but they still produce unsatisfying results. These methods also rather overlook the need to restore the saturated areas. In this paper, we generate well-aligned multi-exposure features by reformulating a motion alignment problem into a simple brightness adjustment problem. In addition, we propose a coarse-to-fine merging strategy with explicit saturation compensation. The saturated areas are reconstructed with similar well-exposed content using adaptive contextual attention. We demonstrate that our method outperforms the state-of-the-art methods regarding qualitative and quantitative evaluations.
TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes
Abstract
Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE23 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruct Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE23. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 12% to 51% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offers a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
Abstract
The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread use, often underperform with sparse IDs and struggle with the cold-start problem. Besides, inconsistent ID mappings hinder the model's transferability, isolating similar recommendation domains that could have been co-optimized. This paper aims to address these issues by exploring the potential of multi-modal information in learning robust and generalizable sequence representations. We propose MISSRec, a multi-modal pre-training and transfer learning framework for SR. On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal synergy while a novel interest-aware decoder is developed to grasp item-modality-interest relations for better sequence representation. On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation, providing more precise matching between users and items. We pre-train the model with contrastive learning objectives and fine-tune it in an efficient manner. Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, promising an practical solution for real-world recommendation scenarios.
A Simple Framework for Multi-mode Spatial-Temporal Data Modeling
Abstract
Spatial-temporal data modeling aims to mine the underlying spatial relationships and temporal dependencies of objects in a system. However, most existing methods focus on the modeling of spatial-temporal data in a single mode, lacking the understanding of multiple modes. Though very few methods have been presented to learn the multi-mode relationships recently, they are built on complicated components with higher model complexities. In this paper, we propose a simple framework for multi-mode spatial-temporal data modeling to bring both effectiveness and efficiency together. Specifically, we design a general cross-mode spatial relationships learning component to adaptively establish connections between multiple modes and propagate information along the learned connections. Moreover, we employ multi-layer perceptrons to capture the temporal dependencies and channel correlations, which are conceptually and technically succinct. Experiments on three real-world datasets show that our model can consistently outperform the baselines with lower space and time complexity, opening up a promising direction for modeling spatial-temporal data. The generalizability of the cross-mode spatial relationships learning module is also validated.
Edge-Centric Space Rescaling with Redirected Walking for Dissimilar Physical-Virtual Space Registration
Abstract
We propose a novel space-rescaling technique for registering dissimilar physical-virtual spaces by utilizing the effects of adjusting physical space with redirected walking. Achieving a seamless immersive Virtual Reality (VR) experience requires overcoming the spatial heterogeneities between the physical and virtual spaces and accurately aligning the VR environment with the user's tracked physical space. However, existing space-matching algorithms that rely on one-to-one scale mapping are inadequate when dealing with highly dissimilar physical and virtual spaces, and redirected walking controllers could not utilize basic geometric information from physical space in the virtual space due to coordinate distortion. To address these issues, we apply relative translation gains to partitioned space grids based on the main interactable object's edge, which enables space-adaptive modification effects of physical space without coordinate distortion. Our evaluation results demonstrate the effectiveness of our algorithm in aligning the main object's edge, surface, and wall, as well as securing the largest registered area compared to alternative methods under all conditions. These findings can be used to create an immersive play area for VR content where users can receive passive feedback from the plane and edge in their physical environment.
Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks
Abstract
Artificial Intelligence (AI) has found wide application, but also poses risks due to unintentional or malicious tampering during deployment. Regular checks are therefore necessary to detect and prevent such risks. Fragile watermarking is a technique used to identify tampering in AI models. However, previous methods have faced challenges including risks of omission, additional information transmission, and inability to locate tampering precisely. In this paper, we propose a method for detecting tampered parameters and bits, which can be used to detect, locate, and restore parameters that have been tampered with. We also propose an adaptive embedding method that maximizes information capacity while maintaining model accuracy. Our approach was tested on multiple neural networks subjected to attacks that modified weight parameters, and our results demonstrate that our method achieved great recovery performance when the modification rate was below 20%. Furthermore, for models where watermarking significantly affected accuracy, we utilized an adaptive bit technique to recover more than 15% of the accuracy loss of the model.
LEAP: Efficient and Automated Test Method for NLP Software
Authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, Pengcheng Zhang
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Abstract
The widespread adoption of DNNs in NLP software has highlighted the need for robustness. Researchers proposed various automatic testing techniques for adversarial test cases. However, existing methods suffer from two limitations: weak error-discovering capabilities, with success rates ranging from 0% to 24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to 205.28s per test case, making them challenging for time-constrained scenarios. To address these issues, this paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. Specifically, we adopt Levy flight for population initialization to increase the diversity of generated test cases. We also design an inertial weight adaptive update operator to improve the efficiency of LEAP's global optimization of high-dimensional text examples and a mutation operator based on the greedy strategy to reduce the search time. We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1% higher than the next best approach (PSOattack). While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other heuristic-based methods. Additionally, the experimental results demonstrate that LEAP can generate more transferable test cases and significantly enhance the robustness of DNN-based systems.
ProAgent: Building Proactive Cooperative AI with Large Language Models
Abstract
Building AIs with adaptive behaviors in human-AI cooperation stands as a pivotal focus in AGI research. Current methods for developing cooperative agents predominantly rely on learning-based methods, where policy generalization heavily hinges on past interactions with specific teammates. These approaches constrain the agent's capacity to recalibrate its strategy when confronted with novel teammates. We propose \textbf{ProAgent}, a novel framework that harnesses large language models (LLMs) to fashion a \textit{pro}active \textit{agent} empowered with the ability to anticipate teammates' forthcoming decisions and formulate enhanced plans for itself. ProAgent excels at cooperative reasoning with the capacity to dynamically adapt its behavior to enhance collaborative efforts with teammates. Moreover, the ProAgent framework exhibits a high degree of modularity and interpretability, facilitating seamless integration to address a wide array of coordination scenarios. Experimental evaluations conducted within the framework of \textit{Overcook-AI} unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training in cooperation with AI agents. Further, when cooperating with human proxy models, its performance exhibits an average improvement exceeding 10\% compared to the current state-of-the-art, COLE. The advancement was consistently observed across diverse scenarios involving interactions with both AI agents of varying characteristics and human counterparts. These findings inspire future research for human-robot collaborations. For a hands-on demonstration, please visit \url{https://pku-proagent.github.io}.
Adaptive Graduated Non-Convexity for Pose Graph Optimization
Authors: Seungwon Choi, Wonseok Kang, Jiseong Chung, Jaehyun Kim, Tae-wan Kim
Abstract
We present a novel approach to robust pose graph optimization based on Graduated Non-Convexity (GNC). Unlike traditional GNC-based methods, the proposed approach employs an adaptive shape function using B-spline to optimize the shape of the robust kernel. This aims to reduce GNC iterations, boosting computational speed without compromising accuracy. When integrated with the open-source riSAM algorithm, the method demonstrates enhanced efficiency across diverse datasets. Accompanying open-source code aims to encourage further research in this area. https://github.com/SNU-DLLAB/AGNC-PGO
SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation
Authors: Lixiong Qin, Mei Wang, Chao Deng, Ke Wang, Xi Chen, Jiani Hu, Weihong Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, vision transformers have been introduced into face recognition and analysis and have achieved performance breakthroughs. However, most previous methods generally train a single model or an ensemble of models to perform the desired task, which ignores the synergy among different tasks and fails to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, the SwinFace, consists of a single shared backbone together with a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.
Abstract
There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.
Distributed Energy Resource Management: All-Time Resource-Demand Feasibility, Delay-Tolerance, Nonlinearity, and Beyond
Abstract
In this work, we propose distributed and networked energy management scenarios to optimize the production and reservation of energy among a set of distributed energy nodes. In other words, the idea is to optimally allocate the generated and reserved powers based on nodes' local cost gradient information while meeting the demand energy. One main concern is the all-time (or anytime) resource-demand feasibility, implying that at all iterations of the scheduling algorithm, the balance between the produced power and demand plus reserved power must hold. The other concern is to design algorithms to tolerate communication time-delays and changes in the network. Further, one can incorporate possible model nonlinearity in the algorithm to address both inherent (e.g., saturation and quantization) and purposefully-added (e.g., signum-based) nonlinearities in the model. The proposed optimal allocation algorithm addresses all the above concerns, while it benefits from possible features of the distributed (or networked) solutions such as no-single-node-of-failure and distributed information processing. We show both the all-time feasibility of the proposed scheme and its convergence under certain bound on the step-rate using Lyapunov-type proofs.
DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs
Abstract
Mixed-precision neural networks (MPNNs) that enable the use of just enough data width for a deep learning task promise significant advantages of both inference accuracy and computing overhead. FPGAs with fine-grained reconfiguration capability can adapt the processing with distinct data width and models, and hence, can theoretically unleash the potential of MPNNs. Nevertheless, commodity DPUs on FPGAs mostly emphasize generality and have limited support for MPNNs especially the ones with lower data width. In addition, primitive DSPs in FPGAs usually have much larger data width than that is required by MPNNs and haven't been sufficiently co-explored with MPNNs yet. To this end, we propose an open source MPNN accelerator design framework specifically tailored for FPGAs. In this framework, we have a systematic DSP-packing algorithm to pack multiple lower data width MACs in a single primitive DSP and enable efficient implementation of MPNNs. Meanwhile, we take DSP packing efficiency into consideration with MPNN quantization within a unified neural network architecture search (NAS) framework such that it can be aware of the DSP overhead during quantization and optimize the MPNN performance and accuracy concurrently. Finally, we have the optimized MPNN fine-tuned to a fully pipelined neural network accelerator template based on HLS and make best use of available resources for higher performance. Our experiments reveal the resulting accelerators produced by the proposed framework can achieve overwhelming advantages in terms of performance, resource utilization, and inference accuracy for MPNNs when compared with both handcrafted counterparts and prior hardware-aware neural network accelerators on FPGAs.
Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images
Authors: Alperen Kalay, Bahri Batuhan Bilecen, Mustafa Ayazoglu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Super-resolution (SR) networks have been investigated for a while, with their mobile and lightweight versions gaining noticeable popularity recently. Quantization, the procedure of decreasing the precision of network parameters (mostly FP32 to INT8), is also utilized in SR networks for establishing mobile compatibility. This study focuses on a very important but mostly overlooked post-training quantization (PTQ) step: representative dataset (RD), which adjusts the quantization range for PTQ. We propose a novel pipeline (clip-free quantization pipeline, CFQP) backed up with extensive experimental justifications to cleverly augment RD images by only using outputs of the FP32 model. Using the proposed pipeline for RD, we can successfully eliminate unwanted clipped activation layers, which nearly all mobile SR methods utilize to make the model more robust to PTQ in return for a large overhead in runtime. Removing clipped activations with our method significantly benefits overall increased stability, decreased inference runtime up to 54% on some SR models, better visual quality results compared to INT8 clipped models - and outperforms even some FP32 non-quantized models, both in runtime and visual quality, without the need for retraining with clipped activation.
Keyword: efficient
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data
Deep Semi-supervised Anomaly Detection with Metapath-based Context Knowledge
Humanoid Robot Co-Design: Coupling Hardware Design with Gait Generation via Hybrid Zero Dynamics
Decentralized Multi-Robot Social Navigation in Constrained Environments via Game-Theoretic Control Barrier Functions
Optimizing Sectorized Wireless Networks: Model, Analysis, and Algorithm
ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps
AI For Fraud Awareness
GSA to HDL: Towards principled generation of dynamically scheduled circuits
Neural Amortized Inference for Nested Multi-agent Reasoning
Matrix Completion over Finite Fields: Bounds and Belief Propagation Algorithms
Embedded Object Detection and Mapping in Soft Materials Using Optical Tactile Sensing
Collaborative Route Planning of UAVs, Workers and Cars for Crowdsensing in Disaster Response
Meta-Stock: Task-Difficulty-Adaptive Meta-learning for Sub-new Stock Price Prediction
Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection
Efficient View Synthesis with Neural Radiance Distribution Field
Exploring Unsupervised Cell Recognition with Prior Self-activation Maps
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (Practical Experience Report)
Energy-Efficient On-Board Radio Resource Management for Satellite Communications via Neuromorphic Computing
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
ConcatPlexer: Additional Dim1 Batching for Faster ViTs
Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models
Minwise-Independent Permutations with Insertion and Deletion of Features
Multi-User Modular XL-MIMO Communications: Near-Field and Beam Focusing Pattern and User Grouping
Octopus: A Heterogeneous In-network Computing Accelerator Enabling Deep Learning for network
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs
Higher-order multi-scale method for high-accuracy nonlinear thermo-mechanical simulation of heterogeneous shells
Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization
Tackling the Curse of Dimensionality in Large-scale Multi-agent LTL Task Planning via Poset Product
Does society show differential attention to researchers based on gender and field?
Targeted Data Augmentation for bias mitigation
Statistical higher-order multi-scale method for nonlinear thermo-mechanical simulation of random composite materials with temperature-dependent properties
A Partially Observable Deep Multi-Agent Active Inference Framework for Resource Allocation in 6G and Beyond Wireless Communications Networks
Achievable Sum-rate of variants of QAM over Gaussian Multiple Access Channel with and without security
Supply Function Equilibrium in Networked Electricity Markets
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Convergence guarantee for consistency models
Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms
An iterative method for Helmholtz boundary value problems arising in wave propagation
Pose2Gait: Extracting Gait Features from Monocular Video of Individuals with Dementia
Four years of multi-modal odometry and mapping on the rail vehicles
L^2R: Lifelong Learning for First-stage Retrieval with Backward-Compatible Representations
Learning Representations on Logs for AIOps
G3Reg: Pyramid Graph-based Global Registration using Gaussian Ellipsoid Model
Dynamically Orthogonal Approximation for Stochastic Differential Equations
Towards Universal Interaction for Extended Reality
Tryage: Real-time, intelligent Routing of User Prompts to Large Language Model
Keyword: faster
ERA: Enhanced Relaxed A algorithm for Solving the Shortest Path Problem in Regular Grid Maps
Finding Small Complete Subgraphs Efficiently
ReFit: Recurrent Fitting Network for 3D Human Recovery
Protect Federated Learning Against Backdoor Attacks via Data-Free Trigger Generation
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Keyword: mobile
Addressing Knowledge Leakage Risk caused by the use of mobile devices in Australian Organizations
Mobility-Aware Computation Offloading for Swarm Robotics using Deep Reinforcement Learning
Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images
Multi-Objective Improvement of Android Applications
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users
Keyword: pruning
Graph Neural Network-Enhanced Expectation Propagation Algorithm for MIMO Turbo Receivers
Keyword: diffusion
Diffusion Model as Representation Learner
Hey That's Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment
MusicJam: Visualizing Music Insights via Generated Narrative Illustrations
MatFuse: Controllable Material Generation with Diffusion Models
SDeMorph: Towards Better Facial De-morphing from Single Morph
Convergence guarantee for consistency models
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis
Keyword: adaptive
Hierarchical Lowrank Arithmetic with Binary Compression
Personalized Event Prediction for Electronic Health Records
LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction
Meta-Stock: Task-Difficulty-Adaptive Meta-learning for Sub-new Stock Price Prediction
High Dynamic Range Imaging of Dynamic Scenes with Saturation Compensation but without Explicit Motion Compensation
TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
A Simple Framework for Multi-mode Spatial-Temporal Data Modeling
Edge-Centric Space Rescaling with Redirected Walking for Dissimilar Physical-Virtual Space Registration
Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks
LEAP: Efficient and Automated Test Method for NLP Software
ProAgent: Building Proactive Cooperative AI with Large Language Models
Adaptive Graduated Non-Convexity for Pose Graph Optimization
SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation
Keyword: quantization
Analyzing Quantization in TVM
Distributed Energy Resource Management: All-Time Resource-Demand Feasibility, Delay-Tolerance, Nonlinearity, and Beyond
DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs
Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images