Abstract
Recently, foundation models have exhibited remarkable advancements in multi-modal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.
Eight-input optical programmable logic array enabled by parallel spectrum modulation
Authors: Wenkai Zhang, Bo Wu, Junwei Cheng, Hailong Zhou, Jianji Dong, Dongmei Huang, P. K. A. Wai, Xinliang Zhang
Abstract
Despite over 40 years' development of optical logic computing, the studies have been still struggling to support more than four operands, since the high parallelism of light has not been fully leveraged blocked by the optical nonlinearity and redundant input modulation in existing methods. Here, we propose a scalable multi-input optical programmable logic array (PLA) with minimal logical input, enabled by parallel spectrum modulation. By making full use of the wavelength resource, an eight-input PLA is experimentally demonstrated, and there are 2^256 possible combinations of generated logic gates. Various complex logic fuctions, such as 8-256 decoder, 4-bit comparator, adder and multiplier are experimentally demonstrated via leveraging the PLA. The scale of PLA can be further extended by fully using the dimensions of wavelength and space. As an example, a nine-input PLA is implemented to realize the two-dimensional optical cellular automaton for the first time and perform Conway's Game of Life to simulate the evolutionary process of cells. Our work significantly alleviates the challenge of extensibility of optical logic devices, opening up new avenues for future large-scale, high-speed and energy-efficient optical digital computing.
Abstract
With the performance of deep neural networks (DNNs) remarkably improving, DNNs have been widely used in many areas. Consequently, the DNN model has become a valuable asset, and its intellectual property is safeguarded by ownership verification techniques (e.g., DNN fingerprinting). However, the feasibility of the DNN fingerprint removal attack and its potential influence remains an open problem. In this paper, we perform the first comprehensive investigation of DNN fingerprint removal attacks. Generally, the knowledge contained in a DNN model can be categorized into general semantic and fingerprint-specific knowledge. To this end, we propose a min-max bilevel optimization-based DNN fingerprint removal attack named RemovalNet, to evade model ownership verification. The lower-level optimization is designed to remove fingerprint-specific knowledge. While in the upper-level optimization, we distill the victim model's general semantic knowledge to maintain the surrogate model's performance. We conduct extensive experiments to evaluate the fidelity, effectiveness, and efficiency of the RemovalNet against four advanced defense methods on six metrics. The empirical results demonstrate that (1) the RemovalNet is effective. After our DNN fingerprint removal attack, the model distance between the target and surrogate models is x100 times higher than that of the baseline attacks, (2) the RemovalNet is efficient. It uses only 0.2% (400 samples) of the substitute dataset and 1,000 iterations to conduct our attack. Besides, compared with advanced model stealing attacks, the RemovalNet saves nearly 85% of computational resources at most, (3) the RemovalNet achieves high fidelity that the created surrogate model maintains high accuracy after the DNN fingerprint removal process. Our code is available at: https://github.com/grasses/RemovalNet.
Vision Transformer Adapters for Generalizable Multitask Learning
Abstract
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at \url{https://ivrl.github.io/VTAGML}.
FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features
Authors: Yufeng Yin, Di Chang, Guoxian Song, Shen Sang, Tiancheng Zhi, Jing Liu, Linjie Luo, Mohammad Soleymani
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Automatic detection of facial Action Units (AUs) allows for objective facial expression analysis. Due to the high cost of AU labeling and the limited size of existing benchmarks, previous AU detection methods tend to overfit the dataset, resulting in a significant performance loss when evaluated across corpora. To address this problem, we propose FG-Net for generalizable facial action unit detection. Specifically, FG-Net extracts feature maps from a StyleGAN2 model pre-trained on a large and diverse face image dataset. Then, these features are used to detect AUs with a Pyramid CNN Interpreter, making the training efficient and capturing essential local features. The proposed FG-Net achieves a strong generalization ability for heatmap-based AU detection thanks to the generalizable and semantic-rich features extracted from the pre-trained generative model. Extensive experiments are conducted to evaluate within- and cross-corpus AU detection with the widely-used DISFA and BP4D datasets. Compared with the state-of-the-art, the proposed method achieves superior cross-domain performance while maintaining competitive within-domain performance. In addition, FG-Net is data-efficient and achieves competitive performance even when trained on 1000 samples. Our code will be released at \url{https://github.com/ihp-lab/FG-Net}
Advance Simulation Method for Wheel-Terrain Interactions of Space Rovers: A Case Study on the UAE Rashid Rover
Authors: Ahmad Abubakar, Ruqqayya Alhammadi, Yahya Zweiri, Lakmal Seneviratne
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Abstract
A thorough analysis of wheel-terrain interaction is critical to ensure the safe and efficient operation of space rovers on extraterrestrial surfaces like the Moon or Mars. This paper presents an approach for developing and experimentally validating a virtual wheel-terrain interaction model for the UAE Rashid rover. The model aims to improve the fidelity and capability of current simulation methods for space rovers and facilitate the design, evaluation, and control of their locomotion systems. The proposed method considers various factors, such as wheel grouser properties, wheel slippage, loose soil properties, and interaction mechanics. The model accuracy was validated through experiments on a Test-rig testbed that simulated lunar soil conditions. In specific, a set of experiments was carried out to test the behaviors acted on a Grouser-Rashid rover wheel by the lunar soil with different slip ratios of 0, 0.25, 0.50, and 0.75. The obtained results demonstrate that the proposed simulation method provides a more accurate and realistic simulation of the wheel-terrain interaction behavior and provides insight into the overall performance of the rover
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); General Economics (econ.GN)
Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.
Optimizing Neural Network Scale for ECG Classification
Authors: Byeong Tak Lee, Yong-Yeon Jo, Joon-Myoung Kwon
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract
We study scaling convolutional neural networks (CNNs), specifically targeting Residual neural networks (ResNet), for analyzing electrocardiograms (ECGs). Although ECG signals are time-series data, CNN-based models have been shown to outperform other neural networks with different architectures in ECG analysis. However, most previous studies in ECG analysis have overlooked the importance of network scaling optimization, which significantly improves performance. We explored and demonstrated an efficient approach to scale ResNet by examining the effects of crucial parameters, including layer depth, the number of channels, and the convolution kernel size. Through extensive experiments, we found that a shallower network, a larger number of channels, and smaller kernel sizes result in better performance for ECG classifications. The optimal network scale might differ depending on the target task, but our findings provide insight into obtaining more efficient and accurate models with fewer computing resources or less time. In practice, we demonstrate that a narrower search space based on our findings leads to higher performance.
Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis
Authors: Yuqi Fang, Jinjian Wu, Qianqian Wang, Shijun Qiu, Andrea Bozoki, Huaicheng Yan, Mingxia Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Resting-state functional MRI (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis. Existing studies usually suffer from significant cross-site/domain data heterogeneity caused by site effects such as differences in scanners/protocols. Many methods have been proposed to reduce fMRI heterogeneity between source and target domains, heavily relying on the availability of source data. But acquiring source data is challenging due to privacy concerns and/or data storage burdens in multi-site studies. To this end, we design a source-free collaborative domain adaptation (SCDA) framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible. Specifically, a multi-perspective feature enrichment method (MFE) is developed for target fMRI analysis, consisting of multiple collaborative branches to dynamically capture fMRI features of unlabeled target data from multiple views. Each branch has a data-feeding module, a spatiotemporal feature encoder, and a class predictor. A mutual-consistency constraint is designed to encourage pair-wise consistency of latent features of the same input generated from these branches for robust representation learning. To facilitate efficient cross-domain knowledge transfer without source data, we initialize MFE using parameters of a pretrained source model. We also introduce an unsupervised pretraining strategy using 3,806 unlabeled fMRIs from three large-scale auxiliary databases, aiming to obtain a general feature encoder. Experimental results on three public datasets and one private dataset demonstrate the efficacy of our method in cross-scanner and cross-study prediction tasks. The model pretrained on large-scale rs-fMRI data has been released to the public.
Incentive Mechanism Design for Federated Learning and Unlearning
Abstract
To protect users' right to be forgotten in federated learning, federated unlearning aims at eliminating the impact of leaving users' data on the global learned model. The current research in federated unlearning mainly concentrated on developing effective and efficient unlearning techniques. However, the issue of incentivizing valuable users to remain engaged and preventing their data from being unlearned is still under-explored, yet important to the unlearned model performance. This paper focuses on the incentive issue and develops an incentive mechanism for federated learning and unlearning. We first characterize the leaving users' impact on the global model accuracy and the required communication rounds for unlearning. Building on these results, we propose a four-stage game to capture the interaction and information updates during the learning and unlearning process. A key contribution is to summarize users' multi-dimensional private information into one-dimensional metrics to guide the incentive design. We show that users who incur high costs and experience significant training losses are more likely to discontinue their engagement through federated unlearning. The server tends to retain users who make substantial contributions to the model but has a trade-off on users' training losses, as large training losses of retained users increase privacy costs but decrease unlearning costs. The numerical results demonstrate the necessity of unlearning incentives for retaining valuable leaving users, and also show that our proposed mechanisms decrease the server's cost by up to 53.91% compared to state-of-the-art benchmarks.
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
Authors: Yuan Yuan, Yang Zhan, Zhitong Xiong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.
Masked Autoencoders are Efficient Class Incremental Learners
Authors: Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng
Abstract
Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL .
Not Only Rewards But Also Constraints: Applications on Legged Robot Locomotion
Authors: Yunho Kim, Hyunsik Oh, Jeonghyun Lee, Jinhyeok Choi, Gwanghyeon Ji, Moonkyu Jung, Donghoon Youm, Jemin Hwangbo
Abstract
Several earlier studies have shown impressive control performance in complex robotic systems by designing the controller using a neural network and training it with model-free reinforcement learning. However, these outstanding controllers with natural motion style and high task performance are developed through extensive reward engineering, which is a highly laborious and time-consuming process of designing numerous reward terms and determining suitable reward coefficients. In this work, we propose a novel reinforcement learning framework for training neural network controllers for complex robotic systems consisting of both rewards and constraints. To let the engineers appropriately reflect their intent to constraints and handle them with minimal computation overhead, two constraint types and an efficient policy optimization algorithm are suggested. The learning framework is applied to train locomotion controllers for several legged robots with different morphology and physical attributes to traverse challenging terrains. Extensive simulation and real-world experiments demonstrate that performant controllers can be trained with significantly less reward engineering, by tuning only a single reward coefficient. Furthermore, a more straightforward and intuitive engineering process can be utilized, thanks to the interpretability and generalizability of constraints. The summary video is available at https://youtu.be/KAlm3yskhvM.
SC-PSRO: A Unified Strategy Learning Method for Normal-form Games
Abstract
Solving Nash equilibrium is the key challenge in normal-form games with large strategy spaces, wherein open-ended learning framework provides an efficient approach. Previous studies invariably employ diversity as a conduit to foster the advancement of strategies. Nevertheless, diversity-based algorithms can only work in zero-sum games with cyclic dimensions, which lead to limitations in their applicability. Here, we propose an innovative unified open-ended learning framework SC-PSRO, i.e., Self-Confirming Policy Space Response Oracle, as a general framework for both zero-sum and general-sum games. In particular, we introduce the advantage function as an improved evaluation metric for strategies, allowing for a unified learning objective for agents in normal-form games. Concretely, SC-PSRO comprises three quintessential components: 1) A Diversity Module, aiming to avoid strategies to be constrained by the cyclic structure. 2) A LookAhead Module, devised for the promotion of strategy in the transitive dimension. This module is theoretically guaranteed to learn strategies in the direction of the Nash equilibrium. 3) A Confirming-based Population Clipping Module, contrived for tackling the equilibrium selection problem in general-sum games. This module can be applied to learn equilibria with optimal rewards, which to our knowledge is the first improvement for general-sum games. Our experiments indicate that SC-PSRO accomplishes a considerable decrease in exploitability in zero-sum games and an escalation in rewards in general-sum games, markedly surpassing antecedent methodologies. Code will be released upon acceptance.
Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation
Abstract
This paper presents some of our findings on the scalability of parallel 3D mesh generation on distributed memory machines. The primary objective of this study was to evaluate a distributed memory approach for implementing a 3D parallel Delaunay-based algorithm that converts images to meshes by leveraging an efficient shared memory implementation. The secondary objective was to evaluate the effectiveness of labor (i.e., reduce development time) while introducing minimal overheads to maintain the parallel efficiency of the end-product i.e., distributed implementation. The distributed algorithm utilizes two existing and independently developed parallel Delaunay-based methods: (1) a fine-grained method that employs multi-threading and speculative execution on shared memory nodes and (2) a loosely coupled Delaunay-refinement framework for multi-node platforms. The shared memory implementation uses a FIFO work-sharing scheme for thread scheduling, while the distributed memory implementation utilizes the MPI and the Master-Worker (MW) model. The findings from the specific MPI-MW implementation we tested suggest that the execution on (1) 40 cores not necessary in the same single node is 2.3 times faster than the execution on ten cores, (2) the best speedup is 5.4 with 180 cores again the comparison is with the best performance on ten cores. A closer look at the performance of distributed memory and shared memory implementation executing on a single node (40 cores) suggest that the overheads introduced in the MPI-MW implementation are high and render the MPI-MW implementation 4 times slower than the shared memory code using the same number of cores. These findings raise several questions on the potential scalability of a "black box" approach, i.e., re-using a code designed to execute efficiently on shared memory machines without considering its potential use in a distributed memory setting.
Variational Information Pursuit with Large Language and Multimodal Models for Interpretable Predictions
Authors: Kwan Ho Ryan Chan, Aditya Chattopadhyay, Benjamin David Haeffele, Rene Vidal
Abstract
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design by sequentially selecting a short chain of task-relevant, user-defined and interpretable queries about the data that are most informative for the task. While this allows for built-in interpretability in predictive models, applying V-IP to any task requires data samples with dense concept-labeling by domain experts, limiting the application of V-IP to small-scale tasks where manual data annotation is feasible. In this work, we extend the V-IP framework with Foundational Models (FMs) to address this limitation. More specifically, we use a two-step process, by first leveraging Large Language Models (LLMs) to generate a sufficiently large candidate set of task-relevant interpretable concepts, then using Large Multimodal Models to annotate each data sample by semantic similarity with each concept in the generated concept set. While other interpretable-by-design frameworks such as Concept Bottleneck Models (CBMs) require an additional step of removing repetitive and non-discriminative concepts to have good interpretability and test performance, we mathematically and empirically justify that, with a sufficiently informative and task-relevant query (concept) set, the proposed FM+V-IP method does not require any type of concept filtering. In addition, we show that FM+V-IP with LLM generated concepts can achieve better test performance than V-IP with human annotated concepts, demonstrating the effectiveness of LLMs at generating efficient query sets. Finally, when compared to other interpretable-by-design frameworks such as CBMs, FM+V-IP can achieve competitive test performance using fewer number of concepts/queries in both cases with filtered or unfiltered concept sets.
HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation
Abstract
Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro.
Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization
Authors: Songchun Zhang, Chunhui Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
Try with Simpler -- An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection
Authors: Lin Yang, Junjie Chen, Zhihao Gong, Shutao Gao, Hongyu Zhang, Yue Kang, Huaan Li
Abstract
The rapid growth of deep learning (DL) has spurred interest in enhancing log-based anomaly detection. This approach aims to extract meaning from log events (log message templates) and develop advanced DL models for anomaly detection. However, these DL methods face challenges like heavy reliance on training data, labels, and computational resources due to model complexity. In contrast, traditional machine learning and data mining techniques are less data-dependent and more efficient but less effective than DL. To make log-based anomaly detection more practical, the goal is to enhance traditional techniques to match DL's effectiveness. Previous research in a different domain (linking questions on Stack Overflow) suggests that optimized traditional techniques can rival state-of-the-art DL methods. Drawing inspiration from this concept, we conducted an empirical study. We optimized the unsupervised PCA (Principal Component Analysis), a traditional technique, by incorporating lightweight semantic-based log representation. This addresses the issue of unseen log events in training data, enhancing log representation. Our study compared seven log-based anomaly detection methods, including four DL-based, two traditional, and the optimized PCA technique, using public and industrial datasets. Results indicate that the optimized unsupervised PCA technique achieves similar effectiveness to advanced supervised/semi-supervised DL methods while being more stable with limited training data and resource-efficient. This demonstrates the adaptability and strength of traditional techniques through small yet impactful adaptations.
Hydrogen jet diffusion modeling by using physics-informed graph neural network and sparsely-distributed sensor data
Abstract
Efficient modeling of jet diffusion during accidental release is critical for operation and maintenance management of hydrogen facilities. Deep learning has proven effective for concentration prediction in gas jet diffusion scenarios. Nonetheless, its reliance on extensive simulations as training data and its potential disregard for physical laws limit its applicability to unseen accidental scenarios. Recently, physics-informed neural networks (PINNs) have emerged to reconstruct spatial information by using data from sparsely-distributed sensors which are easily collected in real-world applications. However, prevailing approaches use the fully-connected neural network as the backbone without considering the spatial dependency of sensor data, which reduces the accuracy of concentration prediction. This study introduces the physics-informed graph deep learning approach (Physic_GNN) for efficient and accurate hydrogen jet diffusion prediction by using sparsely-distributed sensor data. Graph neural network (GNN) is used to model the spatial dependency of such sensor data by using graph nodes at which governing equations describing the physical law of hydrogen jet diffusion are immediately solved. The computed residuals are then applied to constrain the training process. Public experimental data of hydrogen jet is used to compare the accuracy and efficiency between our proposed approach Physic_GNN and state-of-the-art PINN. The results demonstrate our Physic_GNN exhibits higher accuracy and physical consistency of centerline concentration prediction given sparse concentration compared to PINN and more efficient compared to OpenFOAM. The proposed approach enables accurate and robust real-time spatial consequence reconstruction and underlying physical mechanisms analysis by using sparse sensor data.
An EPTAS for Cardinality Constrained Multiple Knapsack via Iterative Randomized Rounding
Authors: Ilan Doron-Arad, Ariel Kulik, Hadas Shachnai
Abstract
We study the Uniform Cardinality Constrained Multiple Knapsack problem (CMK), a natural generalization of Multiple Knapsack with applications ranging from cloud computing to radio networks. The input is a set of items, each has a value and a weight, and a set of uniform capacity bins. The goal is to assign a subset of the items of maximum total value to the bins such that $(i)$ the capacity of any bin is not exceeded, and $(ii)$ the number of items assigned to each bin satisfies a given cardinality constraint. The best known approximation ratio for CMK is $1-\frac{\ln (2)}{2} -\epsilon \approx 0.653$, which follows from a result for a generalization of the problem. Our main contribution is an efficient polynomial time approximation scheme (EPTAS) for CMK. This essentially resolves the complexity status of the problem, since the existence of a fully polynomial time approximation scheme (FPTAS) is ruled out. Our technique is based on the following simple algorithm: in each iteration, solve a configuration linear program (LP) of the problem; then, sample configurations (i.e., feasible subsets of items for a single bin) according to a distribution specified by the LP solution. The algorithm terminates once each bin is assigned a configuration. We believe that our generic technique may lead to efficient approximations for other assignment problems.
Capacity Analysis and Throughput Maximization of NOMA with Nonlinear Power Amplifier Distortion
Abstract
In future B5G/6G broadband communication systems, non-linear signal distortion caused by the impairment of transmit power amplifier (PA) can severely degrade the communication performance, especially when uplink users share the wireless medium using non-orthogonal multiple access (NOMA) schemes. This is because the successive interference cancellation (SIC) decoding technique, used in NOMA, is incapable of eliminating the interference caused by PA distortion. Consequently, each user's decoding process suffers from the cumulative distortion noise of all uplink users. In this paper, we establish a new and tractable PA distortion signal model based on real-world measurements, where the distortion noise power is a polynomial function of PA transmit power diverging from the oversimplified linear function commonly employed in existing studies. Applying the proposed signal model, we characterize the capacity rate region of multi-user uplink NOMA by optimizing the user transmit power. Our findings reveal a significant contraction in the capacity region of NOMA, attributable to polynomial distortion noise power. For practical engineering applications, we formulate a general weighted sum rate maximization (WSRMax) problem under individual user rate constraints. We further propose an efficient power control algorithm to attain the optimal performance. Numerical results show that the optimal power control policy under the proposed non-linear PA model achieves on average 13\% higher throughput compared to the policies assuming an ideal linear PA model. Overall, our findings demonstrate the importance of accurate PA distortion modeling to the performance of NOMA and provide efficient optimal power control method accordingly.
Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints
Authors: Hanchi Huang, Li Shen, Deheng Ye, Wei Liu
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract
We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.
The key to the enhanced performance of slab-like topologically interlocked structures with non-planar blocks
Authors: Ioannis Koureas, Mohit Pundir, Shai Feldfogel, David S. Kammer
Abstract
Topologically interlocked structures are assemblies of interlocking blocks that hold together solely through contact. Such structures have been shown to exhibit high strength, energy dissipation, and crack arrest properties. Recent studies on beam-like topologically interlocked structures have shown that, with non-planar blocks, it is possible to reach levels of strength and work-to-failure which are otherwise possible only with unrealistically high friction coefficients. While non-planar blocks have been extensively used for slab-like assemblies, many questions in that context are still not fully understood. Specifically, it is unclear what are the exact characteristics of non-planar surface morphologies which can potentially improve the enhanced mechanical response of slab-like assemblies. In addition, it is unclear if slab-like structures with non-planar surface blocks can reach a saturated response with realistic friction coefficient values, as is the case with beam-like ones. Here, we investigate such fundamental questions using numerical simulations. We show that, by using non-planar blocks, it is possible to reach saturation to the response capacity of the structure with a realistic friction coefficient. Furthermore, we show that the key morphology parameter responsible for the enhanced performance is the local angle of inclination at the top of the loaded block. Lastly, we show that non-planar morphologies lead to improved work-to-failure and ultimate deflection, which cannot be attained with planar-faced blocks. These findings shed new light on topologically interlocked structures with non-planar blocks, allowing for a better understanding of their strengths and potential applications.
An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression
Abstract
This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.
Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models
Abstract
Instruction tuning is instrumental in enabling Large Language Models~(LLMs) to follow user instructions to complete various open-domain tasks. The success of instruction tuning depends on the availability of high-quality instruction data. Owing to the exorbitant cost and substandard quality of human annotation, recent works have been deeply engaged in the exploration of the utilization of powerful closed-source models to generate instruction data automatically. However, these methods carry potential risks arising from the usage requirements of powerful closed-source models, which strictly forbid the utilization of their outputs to develop machine learning models. To deal with this problem, in this work, we explore alternative approaches to generate high-quality instruction data that do not rely on closed-source models. Our exploration includes an investigation of various existing instruction generation methods, culminating in the integration of the most efficient variant with two novel strategies to enhance the quality further. Evaluation results from two benchmarks and the GPT-4 model demonstrate the effectiveness of our generated instruction data, which can outperform Alpaca, a method reliant on closed-source models. We hope that more progress can be achieved in generating high-quality instruction data without using closed-source models.
DeepLOC: Deep Learning-based Bone Pathology Localization and Classification in Wrist X-ray Images
Authors: Razan Dibo, Andrey Galichin, Pavel Astashev, Dmitry V. Dylov, Oleg Y. Rogov
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
In recent years, computer-aided diagnosis systems have shown great potential in assisting radiologists with accurate and efficient medical image analysis. This paper presents a novel approach for bone pathology localization and classification in wrist X-ray images using a combination of YOLO (You Only Look Once) and the Shifted Window Transformer (Swin) with a newly proposed block. The proposed methodology addresses two critical challenges in wrist X-ray analysis: accurate localization of bone pathologies and precise classification of abnormalities. The YOLO framework is employed to detect and localize bone pathologies, leveraging its real-time object detection capabilities. Additionally, the Swin, a transformer-based module, is utilized to extract contextual information from the localized regions of interest (ROIs) for accurate classification.
FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI
Authors: Santiago Estrada, David Kügler, Emad Bahrami, Peng Xu, Dilshad Mousa, Monique M.B. Breteler, N. Ahmad Aziz, Martin Reuter
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The hypothalamus plays a crucial role in the regulation of a broad range of physiological, behavioural, and cognitive functions. However, despite its importance, only a few small-scale neuroimaging studies have investigated its substructures, likely due to the lack of fully automated segmentation tools to address scalability and reproducibility issues of manual segmentation. While the only previous attempt to automatically sub-segment the hypothalamus with a neural network showed promise for 1.0 mm isotropic T1-weighted (T1w) MRI, there is a need for an automated tool to sub-segment also high-resolutional (HiRes) MR scans, as they are becoming widely available, and include structural detail also from multi-modal MRI. We, therefore, introduce a novel, fast, and fully automated deep learning method named HypVINN for sub-segmentation of the hypothalamus and adjacent structures on 0.8 mm isotropic T1w and T2w brain MR images that is robust to missing modalities. We extensively validate our model with respect to segmentation accuracy, generalizability, in-session test-retest reliability, and sensitivity to replicate hypothalamic volume effects (e.g. sex-differences). The proposed method exhibits high segmentation performance both for standalone T1w images as well as for T1w/T2w image pairs. Even with the additional capability to accept flexible inputs, our model matches or exceeds the performance of state-of-the-art methods with fixed inputs. We, further, demonstrate the generalizability of our method in experiments with 1.0 mm MR scans from both the Rhineland Study and the UK Biobank. Finally, HypVINN can perform the segmentation in less than a minute (GPU) and will be available in the open source FastSurfer neuroimaging software suite, offering a validated, efficient, and scalable solution for evaluating imaging-derived phenotypes of the hypothalamus.
Human Comprehensible Active Learning of Genome-Scale Metabolic Networks
Authors: Lun Ai, Shi-Shun Liang, Wang-Zhou Dai, Liam Hallett, Stephen H. Muggleton, Geoff S. Baldwin
Abstract
An important application of Synthetic Biology is the engineering of the host cell system to yield useful products. However, an increase in the scale of the host system leads to huge design space and requires a large number of validation trials with high experimental costs. A comprehensible machine learning approach that efficiently explores the hypothesis space and guides experimental design is urgently needed for the Design-Build-Test-Learn (DBTL) cycle of the host cell system. We introduce a novel machine learning framework ILP-iML1515 based on Inductive Logic Programming (ILP) that performs abductive logical reasoning and actively learns from training examples. In contrast to numerical models, ILP-iML1515 is built on comprehensible logical representations of a genome-scale metabolic model and can update the model by learning new logical structures from auxotrophic mutant trials. The ILP-iML1515 framework 1) allows high-throughput simulations and 2) actively selects experiments that reduce the experimental cost of learning gene functions in comparison to randomly selected experiments.
Constructive Interference based Block-Level Precoding for Scene Expansion: Closed-Form Solutions
Authors: Yiran Wang, Ang Li, Yunsi Wen, Xiaoyan Hu
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
We study closed-form constructive interference based block-level precoding (CI-BLP) for scene expansion in the downlink of multi-user multiple-input single-output (MU-MISO) systems. We extend the analysis on CI-BLP to the case where the number of symbol slots in a block is smaller than the number of user. To this end, we mathematically prove the feasibility of using the pseudo-inverse to express the closed-form expression of the CI-BLP optimal precoding matrix. Building upon this, a quadratic programming (QP) optimization on simplex is obtained without being limited by the relationship between the number of symbol slots in a block and the number of users. We study the low complexity algorithm of large scale QP problem. We first mathematically obtain the rank of the quadratic coefficient matrix in the QP problem. Although the iterative closed-form algorithm for QP problems in CI-based symbol-level precoding (CI-SLP) can be used in certain scenarios, the complexity of the iterative closed algorithm for large-scale QP problems is impractical. In addition, we design a low complexity algorithm based on alternating direction method of multipliers (ADMM), which can efficiently solve large-scale QP problems. We further analyze the convergence and complexity of the proposed algorithm. Numerical results validate our analysis and the optimality of the proposed algorithm, and further show that the proposed algorithm offers a flexible performance-complexity tradeoff by limiting the maximum number of iterations, which motivates the use of CI-BLP in practical wireless systems.
Reinforcement learning informed evolutionary search for autonomous systems testing
Abstract
Evolutionary search-based techniques are commonly used for testing autonomous robotic systems. However, these approaches often rely on computationally expensive simulator-based models for test scenario evaluation. To improve the computational efficiency of the search-based testing, we propose augmenting the evolutionary search (ES) with a reinforcement learning (RL) agent trained using surrogate rewards derived from domain knowledge. In our approach, known as RIGAA (Reinforcement learning Informed Genetic Algorithm for Autonomous systems testing), we first train an RL agent to learn useful constraints of the problem and then use it to produce a certain part of the initial population of the search algorithm. By incorporating an RL agent into the search process, we aim to guide the algorithm towards promising regions of the search space from the start, enabling more efficient exploration of the solution space. We evaluate RIGAA on two case studies: maze generation for an autonomous ant robot and road topology generation for an autonomous vehicle lane keeping assist system. In both case studies, RIGAA converges faster to fitter solutions and produces a better test suite (in terms of average test scenario fitness and diversity). RIGAA also outperforms the state-of-the-art tools for vehicle lane keeping assist system testing, such as AmbieGen and Frenetic.
Towards Communication-Efficient Model Updating for On-Device Session-Based Recommendation
Abstract
On-device recommender systems recently have garnered increasing attention due to their advantages of providing prompt response and securing privacy. To stay current with evolving user interests, cloud-based recommender systems are periodically updated with new interaction data. However, on-device models struggle to retrain themselves because of limited onboard computing resources. As a solution, we consider the scenario where the model retraining occurs on the server side and then the updated parameters are transferred to edge devices via network communication. While this eliminates the need for local retraining, it incurs a regular transfer of parameters that significantly taxes network bandwidth. To mitigate this issue, we develop an efficient approach based on compositional codes to compress the model update. This approach ensures the on-device model is updated flexibly with minimal additional parameters whilst utilizing previous knowledge. The extensive experiments conducted on multiple session-based recommendation models with distinctive architectures demonstrate that the on-device model can achieve comparable accuracy to the retrained server-side counterpart through transferring an update 60x smaller in size. The codes are available at \url{https://github.com/xiaxin1998/ODUpdate}.
A Riemannian optimization method to compute the nearest singular pencil
Authors: Froilán Dopico, Vanni Noferini, Lauri Nyman
Abstract
Given a square pencil $A+ \lambda B$, where $A$ and $B$ are complex matrices, we consider the problem of finding the singular pencil nearest to it in the Frobenius distance. This problem is known to be very difficult, and the few algorithms available in the literature can only deal efficiently with pencils of very small size. We show that the problem is equivalent to minimizing a certain objective function over the Riemannian manifold $SU(n) \times SU(n)$, where $SU(n)$ denotes the special unitary group. With minor modifications, the same approach extends to the case of finding a nearest singular pencil with a specified minimal index. This novel perspective is based on the generalized Schur form of pencils, and yields a competitive numerical method, by pairing it with an algorithm capable of doing optimization on a Riemannian manifold. We provide numerical experiments that show that the resulting method allows us to deal with pencils of much larger size than alternative techniques, yielding candidate minimizers of comparable or better quality. In the course of our analysis, we also obtain a number of new theoretical results related to the generalized Schur form of a (regular or singular) square pencil and to the minimal index of a singular square pencil whose nullity is $1$.
DiCA: A Hardware-Software Co-Design for Differential Check-Pointing in Intermittently Powered Devices
Authors: Antonio Joia Neto, Adam Caulfield, Chistabelle Alvares, Ivan De Oliveira Nunes
Abstract
Intermittently powered devices rely on opportunistic energy-harvesting to function, leading to recurrent power interruptions. This paper introduces DiCA, a proposal for a hardware/software co-design to create differential check-points in intermittent devices. DiCA leverages an affordable hardware module that simplifies the check-pointing process, reducing the check-point generation time and energy consumption. This hardware module continuously monitors volatile memory, efficiently tracking modifications and determining optimal check-point times. To minimize energy waste, the module dynamically estimates the energy required to create and store the check-point based on tracked memory modifications, triggering the check-pointing routine optimally via a nonmaskable interrupt. Experimental results show the cost-effectiveness and energy efficiency of DiCA, enabling extended application activity cycles in intermittently powered embedded devices.
Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph
Authors: Nadav Shalit, Michael Fire, Dima Kagan, Eran Ben-Elia
Abstract
Public transport routing plays a crucial role in transit network design, ensuring a satisfactory level of service for passengers. However, current routing solutions rely on traditional operational research heuristics, which can be time-consuming to implement and lack the ability to provide quick solutions. Here, we propose a novel deep learning-based methodology for a decision support system that enables public transport (PT) planners to identify short-term route improvements rapidly. By seamlessly adjusting specific sections of routes between two stops during specific times of the day, our method effectively reduces times and enhances PT services. Leveraging diverse data sources such as GTFS and smart card data, we extract features and model the transportation network as a directed graph. Using self-supervision, we train a deep learning model for predicting lateness values for road segments. These lateness values are then utilized as edge weights in the transportation graph, enabling efficient path searching. Through evaluating the method on Tel Aviv, we are able to reduce times on more than 9\% of the routes. The improved routes included both intraurban and suburban routes showcasing a fact highlighting the model's versatility. The findings emphasize the potential of our data-driven decision support system to enhance public transport and city logistics, promoting greater efficiency and reliability in PT services.
Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques
Abstract
Plagiarism detection is one of the most researched areas among the Natural Language Processing(NLP) community. A good plagiarism detection covers all the NLP methods including semantics, named entities, paraphrases etc. and produces detailed plagiarism reports. Detection of Cross Lingual Plagiarism requires deep knowledge of various advanced methods and algorithms to perform effective text similarity checking. Nowadays the plagiarists are also advancing themselves from hiding the identity from being catch in such offense. The plagiarists are bypassed from being detected with techniques like paraphrasing, synonym replacement, mismatching citations, translating one language to another. Image Content Plagiarism Detection (ICPD) has gained importance, utilizing advanced image content processing to identify instances of plagiarism to ensure the integrity of image content. The issue of plagiarism extends beyond textual content, as images such as figures, graphs, and tables also have the potential to be plagiarized. However, image content plagiarism detection remains an unaddressed challenge. Therefore, there is a critical need to develop methods and systems for detecting plagiarism in image content. In this paper, the system has been implemented to detect plagiarism form contents of Images such as Figures, Graphs, Tables etc. Along with statistical algorithms such as Jaccard and Cosine, introducing semantic algorithms such as LSA, BERT, WordNet outperformed in detecting efficient and accurate plagiarism.
Abstract
Fast adversarial training (FAT) is beneficial for improving the adversarial robustness of neural networks. However, previous FAT work has encountered a significant issue known as catastrophic overfitting when dealing with large perturbation budgets, \ie the adversarial robustness of models declines to near zero during training. To address this, we analyze the training process of prior FAT work and observe that catastrophic overfitting is accompanied by the appearance of loss convergence outliers. Therefore, we argue a moderately smooth loss convergence process will be a stable FAT process that solves catastrophic overfitting. To obtain a smooth loss convergence process, we propose a novel oscillatory constraint (dubbed ConvergeSmooth) to limit the loss difference between adjacent epochs. The convergence stride of ConvergeSmooth is introduced to balance convergence and smoothing. Likewise, we design weight centralization without introducing additional hyperparameters other than the loss balance coefficient. Our proposed methods are attack-agnostic and thus can improve the training stability of various FAT techniques. Extensive experiments on popular datasets show that the proposed methods efficiently avoid catastrophic overfitting and outperform all previous FAT methods. Code is available at \url{https://github.com/FAT-CS/ConvergeSmooth}.
Auto-weighted Bayesian Physics-Informed Neural Networks and robust estimations for multitask inverse problems in pore-scale imaging of dissolution
Abstract
In this article, we present a novel data assimilation strategy in pore-scale imaging and demonstrate that this makes it possible to robustly address reactive inverse problems incorporating Uncertainty Quantification (UQ). Pore-scale modeling of reactive flow offers a valuable opportunity to investigate the evolution of macro-scale properties subject to dynamic processes. Yet, they suffer from imaging limitations arising from the associated X-ray microtomography (X-ray microCT) process, which induces discrepancies in the properties estimates. Assessment of the kinetic parameters also raises challenges, as reactive coefficients are critical parameters that can cover a wide range of values. We account for these two issues and ensure reliable calibration of pore-scale modeling, based on dynamical microCT images, by integrating uncertainty quantification in the workflow. The present method is based on a multitasking formulation of reactive inverse problems combining data-driven and physics-informed techniques in calcite dissolution. This allows quantifying morphological uncertainties on the porosity field and estimating reactive parameter ranges through prescribed PDE models with a latent concentration field and dynamical microCT. The data assimilation strategy relies on sequential reinforcement incorporating successively additional PDE constraints. We guarantee robust and unbiased uncertainty quantification by straightforward adaptive weighting of Bayesian Physics-Informed Neural Networks (BPINNs), ensuring reliable micro-porosity changes during geochemical transformations. We demonstrate successful Bayesian Inference in 1D+Time and 2D+Time calcite dissolution based on synthetic microCT images with meaningful posterior distribution on the reactive parameters and dimensionless numbers.
A highly efficient and accurate divergence-free spectral method for curl-curl equation in two and three dimensions
Abstract
In this paper, we present a fast divergence-free spectral algorithm (FDSA) for the curl-curl problem. Divergence-free bases in two and three dimensions are constructed by using the generalized Jacobi polynomials. An accurate spectral method with exact preservation of the divergence-free constraint point-wisely is then proposed, and its corresponding error estimate is established. We then present a highly efficient solution algorithm based on a combination of matrix-free preconditioned Krylov subspace iterative method and a fully diagonalizable auxiliary problem, which is derived from the spectral discretisations of generalized eigenvalue problems of Laplace and biharmonic operators. We rigorously prove that the dimensions of the invariant subspace of the preconditioned linear system resulting from the divergence-free spectral method with respect to the dominate eigenvalue $1$, are $(N-3)^2$ and $2(N-3)^3$ for two- and three-dimensional problems with $(N-1)^2$ and $2(N-1)^3$ unknowns, respectively. Thus, the proposed method usually takes only several iterations to converge, and astonishingly, as the problem size (polynomial order) increases, the number of iterations will decrease, even for highly indefinite system and oscillatory solutions. As a result, the computational cost of the solution algorithm is only a small multiple of $N^3$ and $N^4$ floating number operations for 2D and 3D problems, respectively. Plenty of numerical examples for solving the curl-curl problem with both constant and variable coefficients in two and three dimensions are presented to demonstrate the accuracy and efficiency of the proposed method.
IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
Authors: Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi
Abstract
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%.
A second-order length-preserving and unconditionally energy stable rotational discrete gradient method for Oseen-Frank gradient flows
Abstract
We present a second-order strictly length-preserving and unconditionally energy-stable rotational discrete gradient (Rdg) scheme for the numerical approximation of the Oseen-Frank gradient flows with anisotropic elastic energy functional. Two essential ingredients of the Rdg method are reformulation of the length constrained gradient flow into an unconstrained rotational form and discrete gradient discretization for the energy variation. Besides the well-known mean-value and Gonzalez discrete gradients, we propose a novel Oseen-Frank discrete gradient, specifically designed for the solution of Oseen-Frank gradient flow. We prove that the proposed Oseen-Frank discrete gradient satisfies the energy difference relation, thus the resultant Rdg scheme is energy stable. Numerical experiments demonstrate the efficiency and accuracy of the proposed Rdg method and its capability for providing reliable simulation results with highly disparate elastic coefficients.
Linear implicit approximations of invariant measures of semi-linear SDEs with non-globally Lipschitz coefficients
Authors: Chenxu Pang, Xiaojie Wang, Yue Wu
Subjects: Numerical Analysis (math.NA); Probability (math.PR)
Abstract
This article investigates the weak approximation towards the invariant measure of semi-linear stochastic differential equations (SDEs) under non-globally Lipschitz coefficients. For this purpose, we propose a linear-theta-projected Euler (LTPE) scheme, which also admits an invariant measure, to handle the potential influence of the linear stiffness. Under certain assumptions, both the SDE and the corresponding LTPE method are shown to converge exponentially to the underlying invariant measures, respectively. Moreover, with time-independent regularity estimates for the corresponding Kolmogorov equation, the weak error between the numerical invariant measure and the original one can be guaranteed with an order one. Numerical experiments are provided to verify our theoretical findings.
Large Language Models Vote: Prompting for Rare Disease Identification
Authors: David Oniani, Jordan Hilsman, Hang Dong, Fengyi Gao, Shiven Verma, Yanshan Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
The emergence of generative Large Language Models (LLMs) emphasizes the need for accurate and efficient prompting approaches. LLMs are often applied in Few-Shot Learning (FSL) contexts, where tasks are executed with minimal training data. FSL has become popular in many Artificial Intelligence (AI) subdomains, including AI for health. Rare diseases, affecting a small fraction of the population, inherently require FSL techniques due to limited data availability, though manual data collection and annotation is costly and time-consuming. In this paper, we propose Models-Vote Prompting (MVP), a flexible prompting approach for improving the performance of LLM queries in FSL settings. MVP works by prompting numerous LLMs to perform the same tasks and then conducting a majority vote on the resulting outputs. This method achieves improved results to any one model in the ensemble on one-shot rare disease identification and classification tasks. We also release a novel rare disease dataset for FSL, available to those who agreed to the MIMIC-IV Data Use Agreement (DUA). Furthermore, in using MVP, each model is prompted multiple times, substantially increasing the time needed for manual annotation, and to address this, we assess the feasibility of using JSON for automating generative LLM evaluation.
Beyond Document Page Classification: Design, Datasets, and Challenges
Authors: Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko, Marie-Francine Moens
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}
Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark]
Authors: Jiawei Jiang, Chengkai Han, Wayne Xin Zhao, Jingyuan Wang
Abstract
The field of urban spatial-temporal prediction is advancing rapidly with the development of deep learning techniques and the availability of large-scale datasets. However, challenges persist in accessing and utilizing diverse urban spatial-temporal datasets from different sources and stored in different formats, as well as determining effective model structures and components with the proliferation of deep learning models. This work addresses these challenges and provides three significant contributions. Firstly, we introduce "atomic files", a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets, simplifying data management. Secondly, we present a comprehensive overview of technological advances in urban spatial-temporal prediction models, guiding the development of robust models. Thirdly, we conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions. Overall, this work effectively manages urban spatial-temporal data, guides future efforts, and facilitates the development of accurate and efficient urban spatial-temporal prediction models. It can potentially make long-term contributions to urban spatial-temporal data management and prediction, ultimately leading to improved urban living standards.
CDAN: Convolutional Dense Attention-guided Network for Low-light Image Enhancement
Authors: Hossein Shakibania, Sina Raoufi, Hassan Khotanlou
Abstract
Low-light images, characterized by inadequate illumination, pose challenges of diminished clarity, muted colors, and reduced details. Low-light image enhancement, an essential task in computer vision, aims to rectify these issues by improving brightness, contrast, and overall perceptual quality, thereby facilitating accurate analysis and interpretation. This paper introduces the Convolutional Dense Attention-guided Network (CDAN), a novel solution for enhancing low-light images. CDAN integrates an autoencoder-based architecture with convolutional and dense blocks, complemented by an attention mechanism and skip connections. This architecture ensures efficient information propagation and feature learning. Furthermore, a dedicated post-processing phase refines color balance and contrast. Our approach demonstrates notable progress compared to state-of-the-art results in low-light image enhancement, showcasing its robustness across a wide range of challenging scenarios. Our model performs remarkably on benchmark datasets, effectively mitigating under-exposure and proficiently restoring textures and colors in diverse low-light scenarios. This achievement underscores CDAN's potential for diverse computer vision tasks, notably enabling robust object detection and recognition in challenging low-light conditions.
New time domain decomposition methods for parabolic control problems I: Dirichlet-Neumann and Neumann-Dirichlet algorithms
Authors: Martin Jakob Gander, Liu-Di Lu
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Abstract
We present new Dirichlet-Neumann and Neumann-Dirichlet algorithms with a time domain decomposition applied to unconstrained parabolic optimal control problems. After a spatial semi-discretization, we use the Lagrange multiplier approach to derive a coupled forward-backward optimality system, which can then be solved using a time domain decomposition. Due to the forward-backward structure of the optimality system, three variants can be found for the Dirichlet-Neumann and Neumann-Dirichlet algorithms. We analyze their convergence behavior and determine the optimal relaxation parameter for each algorithm. Our analysis reveals that the most natural algorithms are actually only good smoothers, and there are better choices which lead to efficient solvers. We illustrate our analysis with numerical experiments.
Efficient assessment of window views in high-rise, high-density urban areas using 3D color City Information Models
Authors: Maosu Li, Fan Xue, Anthony G.O. Yeh
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Urban-scale quantification of window views can inform housing selection and valuation, landscape management, and urban planning. However, window views are numerous in high-rise, high-density urban areas and current automatic assessments of window views are inaccurate and time-consuming. Thus, both accurate and efficient assessment of window views is significant in improving the automation for urban-scale window view applications. The paper presents an automatic, accurate, and efficient assessment of window view indices (WVIs) of greenery, sky, waterbody, and construction using 3D color City Information Models (CIMs). The workflow includes: i) 3D semantic segmentation of photorealistic CIM and Digital Surface Model (DSM), and ii) batch computation of WVIs. Experimental results showed the estimated WVIs were more accurate (RMSE < 0.01), and the proposed method was more efficient (3.68 times faster) than Li et al.'s (2022) 2D semantic segmentation. Thus, the proposed method can facilitate large-scale WVI assessment and update in healthy high-rise, high-density urban development.
Towards Realistic Unsupervised Fine-tuning with CLIP
Authors: Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, Tieniu Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.
DLIP: Distilling Language-Image Pre-training
Authors: Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji
Abstract
Vision-Language Pre-training (VLP) shows remarkable progress with the assistance of extremely heavy parameters, which challenges deployment in real applications. Knowledge distillation is well recognized as the essential procedure in model compression. However, existing knowledge distillation techniques lack an in-depth investigation and analysis of VLP, and practical guidelines for VLP-oriented distillation are still not yet explored. In this paper, we present DLIP, a simple yet efficient Distilling Language-Image Pre-training framework, through which we investigate how to distill a light VLP model. Specifically, we dissect the model distillation from multiple dimensions, such as the architecture characteristics of different modules and the information transfer of different modalities. We conduct comprehensive experiments and provide insights on distilling a light but performant VLP model. Experimental results reveal that DLIP can achieve a state-of-the-art accuracy/efficiency trade-off across diverse cross-modal tasks, e.g., image-text retrieval, image captioning and visual question answering. For example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while achieving comparable or better performance. Furthermore, DLIP succeeds in retaining more than 95% of the performance with 22.4% parameters and 24.8% FLOPs compared to the teacher model and accelerates inference speed by 2.7x.
Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks
Abstract
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot semantic segmentation methods first pre-train the models on seen' classes, and then evaluate their generalization performance onunseen' classes. However, the prior pre-training stage not only introduces excessive time overhead, but also incurs a significant domain gap on `unseen' classes. To tackle these issues, we propose an efficient Training-free Few-shot 3D Segmentation netwrok, TFS3D, and a further training-based variant, TFS3D-T. Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings, and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires to train a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.
Motion-Guided Masking for Spatiotemporal Representation Learning
Authors: David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector Santos-Villalobos, Rohith MV, Xinyu Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.
NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory
Authors: Navami Kairanda, Marc Habermann, Christian Theobalt, Vladislav Golyanik
Abstract
Cloth simulation is an extensively studied problem, with a plethora of solutions available in computer graphics literature. Existing cloth simulators produce realistic cloth deformations that obey different types of boundary conditions. Nevertheless, their operational principle remains limited in several ways: They operate on explicit surface representations with a fixed spatial resolution, perform a series of discretised updates (which bounds their temporal resolution), and require comparably large amounts of storage. Moreover, back-propagating gradients through the existing solvers is often not straightforward, which poses additional challenges when integrating them into modern neural architectures. In response to the limitations mentioned above, this paper takes a fundamentally different perspective on physically-plausible cloth simulation and re-thinks this long-standing problem: We propose NeuralClothSim, i.e., a new cloth simulation approach using thin shells, in which surface evolution is encoded in neural network weights. Our memory-efficient and differentiable solver operates on a new continuous coordinate-based representation of dynamic surfaces, i.e., neural deformation fields (NDFs); it supervises NDF evolution with the rules of the non-linear Kirchhoff-Love shell theory. NDFs are adaptive in the sense that they 1) allocate their capacity to the deformation details as the latter arise during the cloth evolution and 2) allow surface state queries at arbitrary spatial and temporal resolutions without retraining. We show how to train our NeuralClothSim solver while imposing hard boundary conditions and demonstrate multiple applications, such as material interpolation and simulation editing. The experimental results highlight the effectiveness of our formulation and its potential impact.
Keyword: faster
Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation
Abstract
This paper presents some of our findings on the scalability of parallel 3D mesh generation on distributed memory machines. The primary objective of this study was to evaluate a distributed memory approach for implementing a 3D parallel Delaunay-based algorithm that converts images to meshes by leveraging an efficient shared memory implementation. The secondary objective was to evaluate the effectiveness of labor (i.e., reduce development time) while introducing minimal overheads to maintain the parallel efficiency of the end-product i.e., distributed implementation. The distributed algorithm utilizes two existing and independently developed parallel Delaunay-based methods: (1) a fine-grained method that employs multi-threading and speculative execution on shared memory nodes and (2) a loosely coupled Delaunay-refinement framework for multi-node platforms. The shared memory implementation uses a FIFO work-sharing scheme for thread scheduling, while the distributed memory implementation utilizes the MPI and the Master-Worker (MW) model. The findings from the specific MPI-MW implementation we tested suggest that the execution on (1) 40 cores not necessary in the same single node is 2.3 times faster than the execution on ten cores, (2) the best speedup is 5.4 with 180 cores again the comparison is with the best performance on ten cores. A closer look at the performance of distributed memory and shared memory implementation executing on a single node (40 cores) suggest that the overheads introduced in the MPI-MW implementation are high and render the MPI-MW implementation 4 times slower than the shared memory code using the same number of cores. These findings raise several questions on the potential scalability of a "black box" approach, i.e., re-using a code designed to execute efficiently on shared memory machines without considering its potential use in a distributed memory setting.
Motion In-Betweening with Phase Manifolds
Authors: Paul Starke, Sebastian Starke, Taku Komura, Frank Steinicke
Abstract
This paper introduces a novel data-driven motion in-betweening system to reach target poses of characters by making use of phases variables learned by a Periodic Autoencoder. Our approach utilizes a mixture-of-experts neural network model, in which the phases cluster movements in both space and time with different expert weights. Each generated set of weights then produces a sequence of poses in an autoregressive manner between the current and target state of the character. In addition, to satisfy poses which are manually modified by the animators or where certain end effectors serve as constraints to be reached by the animation, a learned bi-directional control scheme is implemented to satisfy such constraints. The results demonstrate that using phases for motion in-betweening tasks sharpen the interpolated movements, and furthermore stabilizes the learning process. Moreover, using phases for motion in-betweening tasks can also synthesize more challenging movements beyond locomotion behaviors. Additionally, style control is enabled between given target keyframes. Our proposed framework can compete with popular state-of-the-art methods for motion in-betweening in terms of motion quality and generalization, especially in the existence of long transition durations. Our framework contributes to faster prototyping workflows for creating animated character sequences, which is of enormous interest for the game and film industry.
Reinforcement learning informed evolutionary search for autonomous systems testing
Abstract
Evolutionary search-based techniques are commonly used for testing autonomous robotic systems. However, these approaches often rely on computationally expensive simulator-based models for test scenario evaluation. To improve the computational efficiency of the search-based testing, we propose augmenting the evolutionary search (ES) with a reinforcement learning (RL) agent trained using surrogate rewards derived from domain knowledge. In our approach, known as RIGAA (Reinforcement learning Informed Genetic Algorithm for Autonomous systems testing), we first train an RL agent to learn useful constraints of the problem and then use it to produce a certain part of the initial population of the search algorithm. By incorporating an RL agent into the search process, we aim to guide the algorithm towards promising regions of the search space from the start, enabling more efficient exploration of the solution space. We evaluate RIGAA on two case studies: maze generation for an autonomous ant robot and road topology generation for an autonomous vehicle lane keeping assist system. In both case studies, RIGAA converges faster to fitter solutions and produces a better test suite (in terms of average test scenario fitness and diversity). RIGAA also outperforms the state-of-the-art tools for vehicle lane keeping assist system testing, such as AmbieGen and Frenetic.
A note on improving the search of optimal prices in envy-free perfect matchings
Authors: Marcos Salvatierra, Juan G. Colonna, Mario Salvatierra Jr., Alcides de C. Amorim Neto
Subjects: Computer Science and Game Theory (cs.GT)
Abstract
We present a method for finding envy-free prices in a combinatorial auction where the consumers' number $n$ coincides with that of distinct items for sale, each consumer can buy one single item and each item has only one unit available. This is a particular case of the {\it unit-demand envy-free pricing problem}, and was recently revisited by Arbib et al. (2019). These authors proved that using a Fibonacci heap for solving the maximum weight perfect matching and the Bellman-Ford algorithm for getting the envy-free prices, the overall time complexity for solving the problem is $O(n^3)$. We propose a method based on dynamic programming design strategy that seeks the optimal envy-free prices by increasing the consumers' utilities, which has the same cubic complexity time as the aforementioned approach, but whose theoretical and empirical results indicate that our method performs faster than the shortest paths strategy, obtaining an average time reduction in determining optimal envy-free prices of approximately 48\%.
Efficient assessment of window views in high-rise, high-density urban areas using 3D color City Information Models
Authors: Maosu Li, Fan Xue, Anthony G.O. Yeh
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Abstract
Urban-scale quantification of window views can inform housing selection and valuation, landscape management, and urban planning. However, window views are numerous in high-rise, high-density urban areas and current automatic assessments of window views are inaccurate and time-consuming. Thus, both accurate and efficient assessment of window views is significant in improving the automation for urban-scale window view applications. The paper presents an automatic, accurate, and efficient assessment of window view indices (WVIs) of greenery, sky, waterbody, and construction using 3D color City Information Models (CIMs). The workflow includes: i) 3D semantic segmentation of photorealistic CIM and Digital Surface Model (DSM), and ii) batch computation of WVIs. Experimental results showed the estimated WVIs were more accurate (RMSE < 0.01), and the proposed method was more efficient (3.68 times faster) than Li et al.'s (2022) 2D semantic segmentation. Thus, the proposed method can facilitate large-scale WVI assessment and update in healthy high-rise, high-density urban development.
Keyword: mobile
Fine-grained Spatio-Temporal Distribution Prediction of Mobile Content Delivery in 5G Ultra-Dense Networks
Authors: Shaoyuan Huang, Heng Zhang, Xiaofei Wang, Min Chen, Jianxin Li, Victor C. M. Leung
Subjects: Networking and Internet Architecture (cs.NI)
Abstract
The 5G networks have extensively promoted the growth of mobile users and novel applications, and with the skyrocketing user requests for a large amount of popular content, the consequent content delivery services (CDSs) have been bringing a heavy load to mobile service providers. As a key mission in intelligent networks management, understanding and predicting the distribution of CDSs benefits many tasks of modern network services such as resource provisioning and proactive content caching for content delivery networks. However, the revolutions in novel ubiquitous network architectures led by ultra-dense networks (UDNs) make the task extremely challenging. Specifically, conventional methods face the challenges of insufficient spatio precision, lacking generalizability, and complex multi-feature dependencies of user requests, making their effectiveness unreliable in CDSs prediction under 5G UDNs. In this paper, we propose to adopt a series of encoding and sampling methods to model CDSs of known and unknown areas at a tailored fine-grained level. Moreover, we design a spatio-temporal-social multi-feature extraction framework for CDSs hotspots prediction, in which a novel edge-enhanced graph convolution block is proposed to encode dynamic CDSs networks based on the social relationships and the spatio features. Besides, we introduce the Long-Short Term Memory (LSTM) to further capture the temporal dependency. Extensive performance evaluations with real-world measurement data collected in two mobile content applications demonstrate the effectiveness of our proposed solution, which can improve the prediction area under the curve (AUC) by 40.5% compared to the state-of-the-art proposals at a spatio granularity of 76m, with up to 80% of the unknown areas.
Deploying Deep Reinforcement Learning Systems: A Taxonomy of Challenges
Authors: Ahmed Haj Yahmed, Altaf Allah Abbassi, Amin Nikanjam, Heng Li, Foutse Khomh
Abstract
Deep reinforcement learning (DRL), leveraging Deep Learning (DL) in reinforcement learning, has shown significant potential in achieving human-level autonomy in a wide range of domains, including robotics, computer vision, and computer games. This potential justifies the enthusiasm and growing interest in DRL in both academia and industry. However, the community currently focuses mostly on the development phase of DRL systems, with little attention devoted to DRL deployment. In this paper, we propose an empirical study on Stack Overflow (SO), the most popular Q&A forum for developers, to uncover and understand the challenges practitioners faced when deploying DRL systems. Specifically, we categorized relevant SO posts by deployment platforms: server/cloud, mobile/embedded system, browser, and game engine. After filtering and manual analysis, we examined 357 SO posts about DRL deployment, investigated the current state, and identified the challenges related to deploying DRL systems. Then, we investigate the prevalence and difficulty of these challenges. Results show that the general interest in DRL deployment is growing, confirming the study's relevance and importance. Results also show that DRL deployment is more difficult than other DRL issues. Additionally, we built a taxonomy of 31 unique challenges in deploying DRL to different platforms. On all platforms, RL environment-related challenges are the most popular, and communication-related challenges are the most difficult among practitioners. We hope our study inspires future research and helps the community overcome the most common and difficult challenges practitioners face when deploying DRL systems.
BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Abstract
We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); General Economics (econ.GN)
Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.
MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices
Abstract
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. Source Code of our method is available at \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA}.
Exploiting Time-Frequency Conformers for Music Audio Enhancement
Authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee
Abstract
With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.
Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection
Abstract
Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede the practicality of existing LTV prediction models, since game whales' purchase behaviours always exhibit varied distribution from general users. Consequently, identifying game whales can open up new opportunities to improve the accuracy of LTV prediction models. However, little attention has been paid to applying game whale detection in LTV prediction, and existing works are mainly specialized for the long-term LTV prediction with the assumption that the high-quality user features are available, which is not applicable in the UA stage. In this paper, we propose ExpLTV, a novel multi-task framework to perform LTV prediction and game whale detection in a unified way. In ExpLTV, we first innovatively design a deep neural network-based game whale detector that can not only infer the intrinsic order in accordance with monetary value, but also precisely identify high spenders (i.e., game whales) and low spenders. Then, by treating the game whale detector as a gating network to decide the different mixture patterns of LTV experts assembling, we can thoroughly leverage the shared information and scenario-specific information (i.e., game whales modelling and low spenders modelling). Finally, instead of separately designing a purchase rate estimator for two tasks, we design a shared estimator that can preserve the inner task relationships. The superiority of ExpLTV is further validated via extensive experiments on three industrial datasets.
SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Authors: Xinyu Zhang, Yan Gong, Zhiwei Li, Xin Gao, Dafeng Jin, Jun Li, Huaping Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices.
Keyword: pruning
There is no result
Keyword: diffusion
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
Authors: Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods.
Augmenting medical image classifiers with synthetic data from latent diffusion models
Authors: Luke W. Sagers, James A. Diao, Luke Melas-Kyriazi, Matthew Groh, Pranav Rajpurkar, Adewole S. Adamson, Veronica Rotemberg, Roxana Daneshjou, Arjun K. Manrai
Abstract
While hundreds of artificial intelligence (AI) algorithms are now approved or cleared by the US Food and Drugs Administration (FDA), many studies have shown inconsistent generalization or latent bias, particularly for underrepresented populations. Some have proposed that generative AI could reduce the need for real data, but its utility in model development remains unclear. Skin disease serves as a useful case study in synthetic image generation due to the diversity of disease appearance, particularly across the protected attribute of skin tone. Here we show that latent diffusion models can scalably generate images of skin disease and that augmenting model training with these data improves performance in data-limited settings. These performance gains saturate at synthetic-to-real image ratios above 10:1 and are substantially smaller than the gains obtained from adding real images. As part of our analysis, we generate and analyze a new dataset of 458,920 synthetic images produced using several generation strategies. Our results suggest that synthetic data could serve as a force-multiplier for model development, but the collection of diverse real-world data remains the most important step to improve medical AI algorithms.
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Authors: Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.
False Information, Bots and Malicious Campaigns: Demystifying Elements of Social Media Manipulations
Authors: Mohammad Majid Akhtar, Rahat Masood, Muhammad Ikram, Salil S. Kanhere
Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract
The rapid spread of false information and persistent manipulation attacks on online social networks (OSNs), often for political, ideological, or financial gain, has affected the openness of OSNs. While researchers from various disciplines have investigated different manipulation-triggering elements of OSNs (such as understanding information diffusion on OSNs or detecting automated behavior of accounts), these works have not been consolidated to present a comprehensive overview of the interconnections among these elements. Notably, user psychology, the prevalence of bots, and their tactics in relation to false information detection have been overlooked in previous research. To address this research gap, this paper synthesizes insights from various disciplines to provide a comprehensive analysis of the manipulation landscape. By integrating the primary elements of social media manipulation (SMM), including false information, bots, and malicious campaigns, we extensively examine each SMM element. Through a systematic investigation of prior research, we identify commonalities, highlight existing gaps, and extract valuable insights in the field. Our findings underscore the urgent need for interdisciplinary research to effectively combat social media manipulations, and our systematization can guide future research efforts and assist OSN providers in ensuring the safety and integrity of their platforms.
DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition
Authors: Chang Li, Qian Huang, Yingchi Mao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition. In GCN-based methods, the spatio-temporal graph is fundamental for capturing motion patterns. However, existing approaches ignore the physical dependency and synchronized spatio-temporal correlations between joints, which limits the representation capability of GCNs. To solve these problems, we construct the directed diffusion graph for action modeling and introduce the activity partition strategy to optimize the weight sharing mechanism of graph convolution kernels. In addition, we present the spatio-temporal synchronization encoder to embed synchronized spatio-temporal semantics. Finally, we propose Directed Diffusion Graph Convolutional Network (DD-GCN) for action recognition, and the experiments on three public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, demonstrate the state-of-the-art performance of our method.
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Authors: Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.
Hydrogen jet diffusion modeling by using physics-informed graph neural network and sparsely-distributed sensor data
Abstract
Efficient modeling of jet diffusion during accidental release is critical for operation and maintenance management of hydrogen facilities. Deep learning has proven effective for concentration prediction in gas jet diffusion scenarios. Nonetheless, its reliance on extensive simulations as training data and its potential disregard for physical laws limit its applicability to unseen accidental scenarios. Recently, physics-informed neural networks (PINNs) have emerged to reconstruct spatial information by using data from sparsely-distributed sensors which are easily collected in real-world applications. However, prevailing approaches use the fully-connected neural network as the backbone without considering the spatial dependency of sensor data, which reduces the accuracy of concentration prediction. This study introduces the physics-informed graph deep learning approach (Physic_GNN) for efficient and accurate hydrogen jet diffusion prediction by using sparsely-distributed sensor data. Graph neural network (GNN) is used to model the spatial dependency of such sensor data by using graph nodes at which governing equations describing the physical law of hydrogen jet diffusion are immediately solved. The computed residuals are then applied to constrain the training process. Public experimental data of hydrogen jet is used to compare the accuracy and efficiency between our proposed approach Physic_GNN and state-of-the-art PINN. The results demonstrate our Physic_GNN exhibits higher accuracy and physical consistency of centerline concentration prediction given sparse concentration compared to PINN and more efficient compared to OpenFOAM. The proposed approach enables accurate and robust real-time spatial consequence reconstruction and underlying physical mechanisms analysis by using sparse sensor data.
Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI
Authors: Yuqian Sun, Zhouyi Li, Ke Fang, Chang Hee Lee, Ali Asadipour
Abstract
In this paper, we present "1001 Nights", an AI-native game that allows players lead in-game reality through co-created storytelling with the character driven by large language model. The concept is inspired by Wittgenstein's idea of the limits of one's world being determined by the bounds of their language. Using advanced AI tools like GPT-4 and Stable Diffusion, the second iteration of the game enables the protagonist, Shahrzad, to realize words and stories in her world. The player can steer the conversation with the AI King towards specific keywords, which then become battle equipment in the game. This blend of interactive narrative and text-to-image transformation challenges the conventional border between the game world and reality through a dual perspective. We focus on Shahrzad, who seeks to alter her fate compared to the original folklore, and the player, who collaborates with AI to craft narratives and shape the game world. We explore the technical and design elements of implementing such a game with an objective to enhance the narrative game genre with AI-generated content and to delve into AI-native gameplay possibilities.
Dense Text-to-Image Generation with Attention Modulation
Authors: Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu
Abstract
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.
Keyword: adaptive
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
Authors: Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods.
Abstract
Image fusion aims to generate a high-quality image from multiple images captured under varying conditions. The key problem of this task is to preserve complementary information while filtering out irrelevant information for the fused result. However, existing methods address this problem by leveraging static convolutional neural networks (CNNs), suffering two inherent limitations during feature extraction, i.e., being unable to handle spatial-variant contents and lacking guidance from multiple inputs. In this paper, we propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive feature extraction, composed of a mutual-guided cross-attention (MGCA) module and a dynamic filter predictor, where the former incorporates additional guidance from different inputs and the latter generates spatial-variant kernels for different locations. In addition, we introduce a parallel feature fusion (PFF) module to effectively fuse local and global information of the extracted features. To further reduce the redundancy among the extracted features while simultaneously preserving their shared structural information, we devise a novel loss function that combines the minimization of normalized mutual information (NMI) with an estimated gradient mask. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks. The code and model are publicly available at: https://github.com/Guanys-dar/MGDN.
Hyperbolic Audio-visual Zero-shot Learning
Authors: Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, Lars Petersson
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Audio-visual zero-shot learning aims to classify samples consisting of a pair of corresponding audio and video sequences from classes that are not present during training. An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning, with the aim of exploring more complex hierarchical data structures for this task. The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space. Additionally, we explore the use of multiple adaptive curvatures for hyperbolic projections. The experimental results on this very challenging task demonstrate that our proposed hyperbolic approach for zero-shot learning outperforms the SOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL achieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%, respectively.
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Abstract
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.
PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation
Authors: Haibo Jin, Haoxuan Che, Yi Lin, Hao Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Abstract
Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and the identification of clinical findings. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnostic performance unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.
Multitasking Evolutionary Algorithm Based on Adaptive Seed Transfer for Combinatorial Problem
Authors: Haoyuan Lv, Ruochen Liu
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Evolutionary computing (EC) is widely used in dealing with combinatorial optimization problems (COP). Traditional EC methods can only solve a single task in a single run, while real-life scenarios often need to solve multiple COPs simultaneously. In recent years, evolutionary multitasking optimization (EMTO) has become an emerging topic in the EC community. And many methods have been designed to deal with multiple COPs concurrently through exchanging knowledge. However, many-task optimization, cross-domain knowledge transfer, and negative transfer are still significant challenges in this field. A new evolutionary multitasking algorithm based on adaptive seed transfer (MTEA-AST) is developed for multitasking COPs in this work. First, a dimension unification strategy is proposed to unify the dimensions of different tasks. And then, an adaptive task selection strategy is designed to capture the similarity between the target task and other online optimization tasks. The calculated similarity is exploited to select suitable source tasks for the target one and determine the transfer strength. Next, a task transfer strategy is established to select seeds from source tasks and correct unsuitable knowledge in seeds to suppress negative transfer. Finally, the experimental results indicate that MTEA-AST can adaptively transfer knowledge in both same-domain and cross-domain many-task environments. And the proposed method shows competitive performance compared to other state-of-the-art EMTOs in experiments consisting of four COPs.
SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Authors: Xinyu Zhang, Yan Gong, Zhiwei Li, Xin Gao, Dafeng Jin, Jun Li, Huaping Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices.
Auto-weighted Bayesian Physics-Informed Neural Networks and robust estimations for multitask inverse problems in pore-scale imaging of dissolution
Abstract
In this article, we present a novel data assimilation strategy in pore-scale imaging and demonstrate that this makes it possible to robustly address reactive inverse problems incorporating Uncertainty Quantification (UQ). Pore-scale modeling of reactive flow offers a valuable opportunity to investigate the evolution of macro-scale properties subject to dynamic processes. Yet, they suffer from imaging limitations arising from the associated X-ray microtomography (X-ray microCT) process, which induces discrepancies in the properties estimates. Assessment of the kinetic parameters also raises challenges, as reactive coefficients are critical parameters that can cover a wide range of values. We account for these two issues and ensure reliable calibration of pore-scale modeling, based on dynamical microCT images, by integrating uncertainty quantification in the workflow. The present method is based on a multitasking formulation of reactive inverse problems combining data-driven and physics-informed techniques in calcite dissolution. This allows quantifying morphological uncertainties on the porosity field and estimating reactive parameter ranges through prescribed PDE models with a latent concentration field and dynamical microCT. The data assimilation strategy relies on sequential reinforcement incorporating successively additional PDE constraints. We guarantee robust and unbiased uncertainty quantification by straightforward adaptive weighting of Bayesian Physics-Informed Neural Networks (BPINNs), ensuring reliable micro-porosity changes during geochemical transformations. We demonstrate successful Bayesian Inference in 1D+Time and 2D+Time calcite dissolution based on synthetic microCT images with meaningful posterior distribution on the reactive parameters and dimensionless numbers.
Abstract
The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.
NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory
Authors: Navami Kairanda, Marc Habermann, Christian Theobalt, Vladislav Golyanik
Abstract
Cloth simulation is an extensively studied problem, with a plethora of solutions available in computer graphics literature. Existing cloth simulators produce realistic cloth deformations that obey different types of boundary conditions. Nevertheless, their operational principle remains limited in several ways: They operate on explicit surface representations with a fixed spatial resolution, perform a series of discretised updates (which bounds their temporal resolution), and require comparably large amounts of storage. Moreover, back-propagating gradients through the existing solvers is often not straightforward, which poses additional challenges when integrating them into modern neural architectures. In response to the limitations mentioned above, this paper takes a fundamentally different perspective on physically-plausible cloth simulation and re-thinks this long-standing problem: We propose NeuralClothSim, i.e., a new cloth simulation approach using thin shells, in which surface evolution is encoded in neural network weights. Our memory-efficient and differentiable solver operates on a new continuous coordinate-based representation of dynamic surfaces, i.e., neural deformation fields (NDFs); it supervises NDF evolution with the rules of the non-linear Kirchhoff-Love shell theory. NDFs are adaptive in the sense that they 1) allocate their capacity to the deformation details as the latter arise during the cloth evolution and 2) allow surface state queries at arbitrary spatial and temporal resolutions without retraining. We show how to train our NeuralClothSim solver while imposing hard boundary conditions and demonstrate multiple applications, such as material interpolation and simulation editing. The experimental results highlight the effectiveness of our formulation and its potential impact.
Keyword: quantization
Quantized distributed Nash equilibrium seeking under DoS attacks: A quantized consensus based approach
Authors: Shuai Feng, Maojiao Ye, Lihua Xie, Shengyuan Xu
Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA)
Abstract
This paper studies distributed Nash equilibrium (NE) seeking under Denial-of-Service (DoS) attacks and quantization. The players can only exchange information with their own direct neighbors. The transmitted information is subject to quantization and packet losses induced by malicious DoS attacks. We propose a quantized distributed NE seeking strategy based on the approach of dynamic quantized consensus. To solve the quantizer saturation problem caused by DoS attacks, the quantization mechanism is equipped to have zooming-in and holding capabilities, in which the holding capability is consistent with the results in quantized consensus under DoS. A sufficient condition on the number of quantizer levels is provided, under which the quantizers are free from saturation under DoS attacks. The proposed distributed quantized NE seeking strategy is shown to have the so-called maximum resilience to DoS attacks. Namely, if the bound characterizing the maximum resilience is violated, an attacker can deny all the transmissions and hence distributed NE seeking is impossible.
Keyword: efficient
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Eight-input optical programmable logic array enabled by parallel spectrum modulation
RemovalNet: DNN Fingerprint Removal Attacks
Vision Transformer Adapters for Generalizable Multitask Learning
FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features
Advance Simulation Method for Wheel-Terrain Interactions of Space Rovers: A Case Study on the UAE Rashid Rover
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Optimizing Neural Network Scale for ECG Classification
Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis
Incentive Mechanism Design for Federated Learning and Unlearning
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
Masked Autoencoders are Efficient Class Incremental Learners
Not Only Rewards But Also Constraints: Applications on Legged Robot Locomotion
SC-PSRO: A Unified Strategy Learning Method for Normal-form Games
Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation
Variational Information Pursuit with Large Language and Multimodal Models for Interpretable Predictions
HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation
Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization
Try with Simpler -- An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection
Hydrogen jet diffusion modeling by using physics-informed graph neural network and sparsely-distributed sensor data
An EPTAS for Cardinality Constrained Multiple Knapsack via Iterative Randomized Rounding
Capacity Analysis and Throughput Maximization of NOMA with Nonlinear Power Amplifier Distortion
Master-slave Deep Architecture for Top-K Multi-armed Bandits with Non-linear Bandit Feedback and Diversity Constraints
The key to the enhanced performance of slab-like topologically interlocked structures with non-planar blocks
An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression
Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models
DeepLOC: Deep Learning-based Bone Pathology Localization and Classification in Wrist X-ray Images
FastSurfer-HypVINN: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain MRI
Human Comprehensible Active Learning of Genome-Scale Metabolic Networks
Constructive Interference based Block-Level Precoding for Scene Expansion: Closed-Form Solutions
Reinforcement learning informed evolutionary search for autonomous systems testing
Towards Communication-Efficient Model Updating for On-Device Session-Based Recommendation
A Riemannian optimization method to compute the nearest singular pencil
DiCA: A Hardware-Software Co-Design for Differential Check-Pointing in Intermittently Powered Devices
Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph
Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques
Fast Adversarial Training with Smooth Convergence
Auto-weighted Bayesian Physics-Informed Neural Networks and robust estimations for multitask inverse problems in pore-scale imaging of dissolution
A highly efficient and accurate divergence-free spectral method for curl-curl equation in two and three dimensions
IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
A second-order length-preserving and unconditionally energy stable rotational discrete gradient method for Oseen-Frank gradient flows
Linear implicit approximations of invariant measures of semi-linear SDEs with non-globally Lipschitz coefficients
Large Language Models Vote: Prompting for Rare Disease Identification
Beyond Document Page Classification: Design, Datasets, and Challenges
Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark]
CDAN: Convolutional Dense Attention-guided Network for Low-light Image Enhancement
New time domain decomposition methods for parabolic control problems I: Dirichlet-Neumann and Neumann-Dirichlet algorithms
Efficient assessment of window views in high-rise, high-density urban areas using 3D color City Information Models
Towards Realistic Unsupervised Fine-tuning with CLIP
DLIP: Distilling Language-Image Pre-training
Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks
seen' classes, and then evaluate their generalization performance on
unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead, but also incurs a significant domain gap on `unseen' classes. To tackle these issues, we propose an efficient Training-free Few-shot 3D Segmentation netwrok, TFS3D, and a further training-based variant, TFS3D-T. Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings, and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires to train a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.Motion-Guided Masking for Spatiotemporal Representation Learning
NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory
Keyword: faster
Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation
Motion In-Betweening with Phase Manifolds
Reinforcement learning informed evolutionary search for autonomous systems testing
A note on improving the search of optimal prices in envy-free perfect matchings
Efficient assessment of window views in high-rise, high-density urban areas using 3D color City Information Models
Keyword: mobile
Fine-grained Spatio-Temporal Distribution Prediction of Mobile Content Delivery in 5G Ultra-Dense Networks
Deploying Deep Reinforcement Learning Systems: A Taxonomy of Challenges
BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
MOFA: A Model Simplification Roadmap for Image Restoration on Mobile Devices
Exploiting Time-Frequency Conformers for Music Audio Enhancement
Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection
SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Keyword: pruning
There is no result
Keyword: diffusion
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
Augmenting medical image classifiers with synthetic data from latent diffusion models
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
False Information, Bots and Malicious Campaigns: Demystifying Elements of Social Media Manipulations
DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Hydrogen jet diffusion modeling by using physics-informed graph neural network and sparsely-distributed sensor data
Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI
Dense Text-to-Image Generation with Attention Modulation
Keyword: adaptive
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation
Mutual-Guided Dynamic Network for Image Fusion
Hyperbolic Audio-visual Zero-shot Learning
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation
Multitasking Evolutionary Algorithm Based on Adaptive Seed Transfer for Combinatorial Problem
SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Auto-weighted Bayesian Physics-Informed Neural Networks and robust estimations for multitask inverse problems in pore-scale imaging of dissolution
Label Budget Allocation in Multi-Task Learning
NeuralClothSim: Neural Deformation Fields Meet the Kirchhoff-Love Thin Shell Theory
Keyword: quantization
Quantized distributed Nash equilibrium seeking under DoS attacks: A quantized consensus based approach