New submissions for Fri, 8 Sep 23

Keyword: efficient

A 9 Transistor SRAM Featuring Array-level XOR Parallelism with Secure Data Toggling Operation

Authors: Zihan Yin, Annewsha Datta, Shwetha Vijayakumar, Ajey Jacob, Akhilesh Jaiswal
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2309.03204
Pdf link: https://arxiv.org/pdf/2309.03204
Abstract Security and energy-efficiency are critical for computing applications in general and for edge applications in particular. Digital in-Memory Computing (IMC) in SRAM cells have widely been studied to accelerate inference tasks to maximize both throughput and energy efficiency for intelligent computing at the edge. XOR operations have been of particular interest due to their wide applicability in numerous applications that include binary neural networks and encryption. However, existing IMC circuits for XOR acceleration are limited to two rows in a memory array and extending the XOR parallelism to multiple rows in an SRAM array has remained elusive. Further, SRAM is prone to both data imprinting and data remanence security issues, which poses limitations on security . Based on commerical Globalfoundries 22nm mode, we are proposing a novel 9T SRAM cell such that multiple rows of data (entire array) can be XORed in a massively parallel single cycle fashion. The new cell also supports data-toggling within the SRAM cell efficiently to circumvent imprinting attacks and erase the SRAM value in case of remanence attack.
Explainable and Trustworthy Traffic Sign Detection for Safe Autonomous Driving: An Inductive Logic Programming Approach
Authors: Zahra Chaghazardi (University of Surrey), Saber Fallah (University of Surrey), Alireza Tamaddoni-Nezhad (University of Surrey)
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2309.03215
Pdf link: https://arxiv.org/pdf/2309.03215
Abstract Traffic sign detection is a critical task in the operation of Autonomous Vehicles (AV), as it ensures the safety of all road users. Current DNN-based sign classification systems rely on pixel-level features to detect traffic signs and can be susceptible to adversarial attacks. These attacks involve small, imperceptible changes to a sign that can cause traditional classifiers to misidentify the sign. We propose an Inductive Logic Programming (ILP) based approach for stop sign detection in AVs to address this issue. This method utilises high-level features of a sign, such as its shape, colour, and text, to detect categories of traffic signs. This approach is more robust against adversarial attacks, as it mimics human-like perception and is less susceptible to the limitations of current DNN classifiers. We consider two adversarial attacking methods to evaluate our approach: Robust Physical Perturbation (PR2) and Adversarial Camouflage (AdvCam). These attacks are able to deceive DNN classifiers, causing them to misidentify stop signs as other signs with high confidence. The results show that the proposed ILP-based technique is able to correctly identify all targeted stop signs, even in the presence of PR2 and ADvCam attacks. The proposed learning method is also efficient as it requires minimal training data. Moreover, it is fully explainable, making it possible to debug AVs.
Companion Animal Disease Diagnostics based on Literal-aware Medical Knowledge Graph Representation Learning
Authors: Van Thuy Hoang, Sang Thanh Nguyen, Sangmyeong Lee, Jooho Lee, Luong Vuong Nguyen, O-Joun Lee
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03219
Pdf link: https://arxiv.org/pdf/2309.03219
Abstract Knowledge graph (KG) embedding has been used to benefit the diagnosis of animal diseases by analyzing electronic medical records (EMRs), such as notes and veterinary records. However, learning representations to capture entities and relations with literal information in KGs is challenging as the KGs show heterogeneous properties and various types of literal information. Meanwhile, the existing methods mostly aim to preserve graph structures surrounding target nodes without considering different types of literals, which could also carry significant information. In this paper, we propose a knowledge graph embedding model for the efficient diagnosis of animal diseases, which could learn various types of literal information and graph structure and fuse them into unified representations, namely LiteralKG. Specifically, we construct a knowledge graph that is built from EMRs along with literal information collected from various animal hospitals. We then fuse different types of entities and node feature information into unified vector representations through gate networks. Finally, we propose a self-supervised learning task to learn graph structure in pretext tasks and then towards various downstream tasks. Experimental results on link prediction tasks demonstrate that our model outperforms the baselines that consist of state-of-the-art models. The source code is available at https://github.com/NSLab-CUK/LiteralKG.
SPAIC: A sub-$μ$W/Channel, 16-Channel General-Purpose Event-Based Analog Front-End with Dual-Mode Encoders
Authors: Shyam Narayanan, Matteo Cartiglia, Arianna Rubino, Charles Lego, Charlotte Frenkel, Giacomo Indiveri
Subjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2309.03221
Pdf link: https://arxiv.org/pdf/2309.03221
Abstract Low-power event-based analog front-ends (AFE) are a crucial component required to build efficient end-to-end neuromorphic processing systems for edge computing. Although several neuromorphic chips have been developed for implementing spiking neural networks (SNNs) and solving a wide range of sensory processing tasks, there are only a few general-purpose analog front-end devices that can be used to convert analog sensory signals into spikes and interfaced to neuromorphic processors. In this work, we present a novel, highly configurable analog front-end chip, denoted as SPAIC (signal-to-spike converter for analog AI computation), that offers a general-purpose dual-mode analog signal-to-spike encoding with delta modulation and pulse frequency modulation, with tunable frequency bands. The ASIC is designed in a 180 nm process. It supports and encodes a wide variety of signals spanning 4 orders of magnitude in frequency, and provides an event-based output that is compatible with existing neuromorphic processors. We validated the ASIC for its functions and present initial silicon measurement results characterizing the basic building blocks of the chip.
Retail store customer behavior analysis system: Design and Implementation
Authors: Tuan Dinh Nguyen, Keisuke Hihara, Tung Cao Hoang, Yumeka Utada, Akihiko Torii, Naoki Izumi, Nguyen Thanh Thuy, Long Quoc Tran
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2309.03232
Pdf link: https://arxiv.org/pdf/2309.03232
Abstract Understanding customer behavior in retail stores plays a crucial role in improving customer satisfaction by adding personalized value to services. Behavior analysis reveals both general and detailed patterns in the interaction of customers with a store items and other people, providing store managers with insight into customer preferences. Several solutions aim to utilize this data by recognizing specific behaviors through statistical visualization. However, current approaches are limited to the analysis of small customer behavior sets, utilizing conventional methods to detect behaviors. They do not use deep learning techniques such as deep neural networks, which are powerful methods in the field of computer vision. Furthermore, these methods provide limited figures when visualizing the behavioral data acquired by the system. In this study, we propose a framework that includes three primary parts: mathematical modeling of customer behaviors, behavior analysis using an efficient deep learning based system, and individual and group behavior visualization. Each module and the entire system were validated using data from actual situations in a retail store.
RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation
Authors: Hengyue Liu, Bir Bhanu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03240
Pdf link: https://arxiv.org/pdf/2309.03240
Abstract Scene Graph Generation (SGG) has achieved significant progress recently. However, most previous works rely heavily on fixed-size entity representations based on bounding box proposals, anchors, or learnable queries. As each representation's cardinality has different trade-offs between performance and computation overhead, extracting highly representative features efficiently and dynamically is both challenging and crucial for SGG. In this work, a novel architecture called RepSGG is proposed to address the aforementioned challenges, formulating a subject as queries, an object as keys, and their relationship as the maximum attention weight between pairwise queries and keys. With more fine-grained and flexible representation power for entities and relationships, RepSGG learns to sample semantically discriminative and representative points for relationship inference. Moreover, the long-tailed distribution also poses a significant challenge for generalization of SGG. A run-time performance-guided logit adjustment (PGLA) strategy is proposed such that the relationship logits are modified via affine transformations based on run-time performance during training. This strategy encourages a more balanced performance between dominant and rare classes. Experimental results show that RepSGG achieves the state-of-the-art or comparable performance on the Visual Genome and Open Images V6 datasets with fast inference speed, demonstrating the efficacy and efficiency of the proposed methods.
Testing properties of distributions in the streaming model
Authors: Sampriti Roy, Yadu Vasudev
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03245
Pdf link: https://arxiv.org/pdf/2309.03245
Abstract We study distribution testing in the standard access model and the conditional access model when the memory available to the testing algorithm is bounded. In both scenarios, the samples appear in an online fashion and the goal is to test the properties of distribution using an optimal number of samples subject to a memory constraint on how many samples can be stored at a given time. First, we provide a trade-off between the sample complexity and the space complexity for testing identity when the samples are drawn according to the conditional access oracle. We then show that we can learn a succinct representation of a monotone distribution efficiently with a memory constraint on the number of samples that are stored that is almost optimal. We also show that the algorithm for monotone distributions can be extended to a larger class of decomposable distributions.
Graph Theory Applications in Advanced Geospatial Research
Authors: Surajit Ghosh, Archita Mallick, Anuva Chowdhury, Kounik De Sarkar
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Geophysics (physics.geo-ph)
Arxiv link: https://arxiv.org/abs/2309.03249
Pdf link: https://arxiv.org/pdf/2309.03249
Abstract Geospatial sciences include a wide range of applications, from environmental monitoring transportation to infrastructure planning, as well as location-based analysis and services. Graph theory algorithms in mathematics have emerged as indispensable tools in these domains due to their capability to model and analyse spatial relationships efficiently. This technical report explores the applications of graph theory algorithms in geospatial sciences, highlighting their role in network analysis, spatial connectivity, geographic information systems, and various other spatial problem-solving scenarios. It provides a comprehensive idea about the key concepts and algorithms of graph theory that assist the modelling processes. The report provides insights into the practical significance of graph theory in addressing real-world geospatial challenges and opportunities. It lists the extensive research, innovative technologies and methodologies implemented in this field.
Scalable Learning of Intrusion Responses through Recursive Decomposition
Authors: Kim Hammar, Rolf Stadler
Subjects: Systems and Control (eess.SY); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03292
Pdf link: https://arxiv.org/pdf/2309.03292
Abstract We study automated intrusion response for an IT infrastructure and formulate the interaction between an attacker and a defender as a partially observed stochastic game. To solve the game we follow an approach where attack and defense strategies co-evolve through reinforcement learning and self-play toward an equilibrium. Solutions proposed in previous work prove the feasibility of this approach for small infrastructures but do not scale to realistic scenarios due to the exponential growth in computational complexity with the infrastructure size. We address this problem by introducing a method that recursively decomposes the game into subgames which can be solved in parallel. Applying optimal stopping theory we show that the best response strategies in these subgames exhibit threshold structures, which allows us to compute them efficiently. To solve the decomposed game we introduce an algorithm called Decompositional Fictitious Self-Play (DFSP), which learns Nash equilibria through stochastic approximation. We evaluate the learned strategies in an emulation environment where real intrusions and response actions can be executed. The results show that the learned strategies approximate an equilibrium and that DFSP significantly outperforms a state-of-the-art algorithm for a realistic infrastructure configuration.
A Novel Approach for Invoice Management using Blockchain
Authors: Nikhil Sontakke, Shivansh Rastogi, Sejal Utekar, Shriraj Sonawane
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03303
Pdf link: https://arxiv.org/pdf/2309.03303
Abstract Electronic invoicing is another area where blockchain technology is being used. Additionally, it has the power to alter how payments are made, invoices are issued, and transactions are validated. Using a blockchain-based invoicing system will enable smooth payments from a customer's digital wallet to a business's digital wallet. Transactions are simple to track and monitor, and the blockchain may be used to retrieve an exchange's full history. Sometimes shopkeepers create fake bills and submit them to the higher tax-paying authorities. To bring transparency to this billing system between customers, shopkeepers, and tax-paying authorities billing system using blockchain is to be implemented using the concept of Blockchain and make the billing system in our country work smoothly. Blockchain technology can revolutionize the invoicing and payment process by providing a secure, transparent and tamper-proof system. A blockchain-based billing system can facilitate smooth payments, allow for easy tracking and monitoring of transactions, and provide a tamper-proof history of all exchanges. The use of blockchain can prevent fraud and increase transparency among customers, shopkeepers, and tax-paying authorities. Furthermore, it can streamline the process by using digital wallets for both customers and businesses, reducing time and resources for traditional invoicing methods. Overall, blockchain technology can bring greater efficiency and trust to the billing system, benefiting all parties involved. It can prevent fraud, increase transparency and streamline the invoicing and payment process. This technology can create a more secure and efficient billing system ultimately benefiting all parties involved.
Adaptive Sampling of 3D Spatial Correlations for Focus+Context Visualization
Authors: Christoph Neuhauser, Josef Stumpfegger, Rüdiger Westermann
Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2309.03308
Pdf link: https://arxiv.org/pdf/2309.03308
Abstract Visualizing spatial structures in 3D ensembles is challenging due to the vast amounts of information that need to be conveyed. Memory and time constraints make it unfeasible to pre-compute and store the correlations between all pairs of domain points. We propose the embedding of adaptive correlation sampling into chord diagrams with hierarchical edge bundling to alleviate these constraints. Entities representing spatial regions are arranged along the circular chord layout via a space-filling curve, and Bayesian optimal sampling is used to efficiently estimate the maximum occurring correlation between any two points from different regions. Hierarchical edge bundling reduces visual clutter and emphasizes the major correlation structures. By selecting an edge, the user triggers a focus diagram in which only the two regions connected via this edge are refined and arranged in a specific way in a second chord layout. For visualizing correlations between two different variables, which are not symmetric anymore, we switch to showing a full correlation matrix. This avoids drawing the same edges twice with different correlation values. We introduce GPU implementations of both linear and non-linear correlation measures to further reduce the time that is required to generate the context and focus views, and to even enable the analysis of correlations in a 1000-member ensemble.
REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous Manipulation
Authors: Zheyuan Hu, Aaron Rovinsky, Jianlan Luo, Vikash Kumar, Abhishek Gupta, Sergey Levine
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2309.03322
Pdf link: https://arxiv.org/pdf/2309.03322
Abstract Dexterous manipulation tasks involving contact-rich interactions pose a significant challenge for both model-based control systems and imitation learning algorithms. The complexity arises from the need for multi-fingered robotic hands to dynamically establish and break contacts, balance non-prehensile forces, and control large degrees of freedom. Reinforcement learning (RL) offers a promising approach due to its general applicability and capacity to autonomously acquire optimal manipulation strategies. However, its real-world application is often hindered by the necessity to generate a large number of samples, reset the environment, and obtain reward signals. In this work, we introduce an efficient system for learning dexterous manipulation skills with RL to alleviate these challenges. The main idea of our approach is the integration of recent advances in sample-efficient RL and replay buffer bootstrapping. This combination allows us to utilize data from different tasks or objects as a starting point for training new tasks, significantly improving learning efficiency. Additionally, our system completes the real-world training cycle by incorporating learned resets via an imitation-based pickup policy as well as learned reward functions, eliminating the need for manual resets and reward engineering. We demonstrate the benefits of reusing past data as replay buffer initialization for new tasks, for instance, the fast acquisition of intricate manipulation skills in the real world on a four-fingered robotic hand. (Videos: https://sites.google.com/view/reboot-dexterous)
MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation
Authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Quang-Thuc Nguyen, Minh-Triet Tran, Ngan Le
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03329
Pdf link: https://arxiv.org/pdf/2309.03329
Abstract Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tissue) is difficult. To mitigate these challenges, we propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored specifically for polyp segmentation within colonoscopy images. This network draws inspiration from the fusion of a classical edge detection technique with an attention mechanism. By combining these techniques, MEGANet effectively preserves high-frequency information, notably edges and boundaries, which tend to erode as neural networks deepen. MEGANet is designed as an end-to-end framework, encompassing three key modules: an encoder, which is responsible for capturing and abstracting the features from the input image, a decoder, which focuses on salient features, and the Edge-Guided Attention module (EGA) that employs the Laplacian Operator to accentuate polyp boundaries. Extensive experiments, both qualitative and quantitative, on five benchmark datasets, demonstrate that our EGANet outperforms other existing SOTA methods under six evaluation metrics. Our code is available at \url{https://github.com/DinhHieuHoang/MEGANet}
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation
Authors: Arvind Krishna Sridhar, Yinyi Guo, Erik Visser, Rehana Mahfuz
Subjects: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2309.03340
Pdf link: https://arxiv.org/pdf/2309.03340
Abstract There has been significant research on developing pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio captioning. First, we propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. Then, we propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data. During the beam decoding step, the smaller model utilizes an audio-text shared latent representation to semantically align the generated text with corresponding input audio. Faithful guidance is introduced into the beam probability by incorporating the cosine similarity between latent representation projections of greedy rolled out intermediate beams and audio clip. We show the efficacy of our algorithm on benchmark datasets and evaluate the proposed scheme against baselines using conventional audio captioning and semantic similarity metrics while illustrating tradeoffs between performance and complexity.
ViewMix: Augmentation for Robust Representation in Self-Supervised Learning
Authors: Arjon Das, Xin Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03360
Pdf link: https://arxiv.org/pdf/2309.03360
Abstract Joint Embedding Architecture-based self-supervised learning methods have attributed the composition of data augmentations as a crucial factor for their strong representation learning capabilities. While regional dropout strategies have proven to guide models to focus on lesser indicative parts of the objects in supervised methods, it hasn't been adopted by self-supervised methods for generating positive pairs. This is because the regional dropout methods are not suitable for the input sampling process of the self-supervised methodology. Whereas dropping informative pixels from the positive pairs can result in inefficient training, replacing patches of a specific object with a different one can steer the model from maximizing the agreement between different positive pairs. Moreover, joint embedding representation learning methods have not made robustness their primary training outcome. To this end, we propose the ViewMix augmentation policy, specially designed for self-supervised learning, upon generating different views of the same image, patches are cut and pasted from one view to another. By leveraging the different views created by this augmentation strategy, multiple joint embedding-based self-supervised methodologies obtained better localization capability and consistently outperformed their corresponding baseline methods. It is also demonstrated that incorporating ViewMix augmentation policy promotes robustness of the representations in the state-of-the-art methods. Furthermore, our experimentation and analysis of compute times suggest that ViewMix augmentation doesn't introduce any additional overhead compared to other counterparts.
Self-Supervised Masked Digital Elevation Models Encoding for Low-Resource Downstream Tasks
Authors: Priyam Mazumdar, Aiman Soliman, Volodymyr Kindratenko, Luigi Marini, Kenton McHenry
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03367
Pdf link: https://arxiv.org/pdf/2309.03367
Abstract The lack of quality labeled data is one of the main bottlenecks for training Deep Learning models. As the task increases in complexity, there is a higher penalty for overfitting and unstable learning. The typical paradigm employed today is Self-Supervised learning, where the model attempts to learn from a large corpus of unstructured and unlabeled data and then transfer that knowledge to the required task. Some notable examples of self-supervision in other modalities are BERT for Large Language Models, Wav2Vec for Speech Recognition, and the Masked AutoEncoder for Vision, which all utilize Transformers to solve a masked prediction task. GeoAI is uniquely poised to take advantage of the self-supervised methodology due to the decades of data collected, little of which is precisely and dependably annotated. Our goal is to extract building and road segmentations from Digital Elevation Models (DEM) that provide a detailed topography of the earths surface. The proposed architecture is the Masked Autoencoder pre-trained on ImageNet (with the limitation that there is a large domain discrepancy between ImageNet and DEM) with an UperNet Head for decoding segmentations. We tested this model with 450 and 50 training images only, utilizing roughly 5% and 0.5% of the original data respectively. On the building segmentation task, this model obtains an 82.1% Intersection over Union (IoU) with 450 Images and 69.1% IoU with only 50 images. On the more challenging road detection task the model obtains an 82.7% IoU with 450 images and 73.2% IoU with only 50 images. Any hand-labeled dataset made today about the earths surface will be immediately obsolete due to the constantly changing nature of the landscape. This motivates the clear necessity for data-efficient learners that can be used for a wide variety of downstream tasks.
Towards Solving Industry-Grade Surrogate Modeling Problems using Physics Informed Machine Learning
Authors: Saakaar Bhatnagar, Andrew Comerford, Araz Banaeizadeh
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2309.03374
Pdf link: https://arxiv.org/pdf/2309.03374
Abstract Deep learning combined with physics-based modeling represents an attractive and efficient approach for producing accurate and robust surrogate modeling. In this paper, a new framework that utilizes Physics Informed Neural Networks (PINN) to solve PDE-based problems for the creation of surrogate models for steady-state flow-thermal engineering design applications is introduced. The surrogate models developed through this framework are demonstrated on several use cases from electronics cooling to biomechanics. Additionally, it is demonstrated how these trained surrogate models can be combined with design optimization methods to improve the efficiency and reduced the cost of the design process. The former is shown through several realistic 3D examples and the latter via a detailed cost-benefit trade off. Overall, the findings of this paper demonstrate that hybrid data-PINN surrogate models combined with optimization algorithms can solve realistic design optimization and have potential in a wide variety of application areas.
A New Proper Orthogonal Decomposition Method with Second Difference Quotients for the Wave Equation
Authors: Andrew Janes, John R. Singler
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2309.03375
Pdf link: https://arxiv.org/pdf/2309.03375
Abstract Recently, researchers have investigated the relationship between proper orthogonal decomposition (POD), difference quotients (DQs), and pointwise in time error bounds for POD reduced order models of partial differential equations. In a recent work (Eskew and Singler, Adv. Comput. Math., 49, 2023, no. 2, Paper No. 13), a new approach to POD with DQs was developed that is more computationally efficient than the standard DQ POD approach and it also retains the guaranteed pointwise in time error bounds of the standard method. In this work, we extend this new DQ POD approach to the case of second difference quotients (DDQs). Specifically, a new POD method utilizing DDQs and only one snapshot and one DQ is developed and used to prove ROM error bounds for the damped wave equation. This new approach eliminates data redundancy in the standard DDQ POD approach that uses all of the snapshots, DQs, and DDQs. We show that this new DDQ approach also has pointwise in time data error bounds similar to DQ POD and use it to prove pointwise and energy ROM error bounds. We provide numerical results for the POD errors and ROM errors to demonstrate the theoretical results. We also explore an application of POD to simulating ROMs past the training interval for collecting the snapshot data for the standard POD approach and the DDQ POD method.
Efficient Baselines for Motion Prediction in Autonomous Driving
Authors: Carlos Gómez-Huélamo, Marcos V. Conde, Rafael Barea, Manuel Ocaña, Luis M. Bergasa
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2309.03387
Pdf link: https://arxiv.org/pdf/2309.03387
Abstract Motion Prediction (MP) of multiple surroundings agents is a crucial task in arbitrarily complex environments, from simple robots to Autonomous Driving Stacks (ADS). Current techniques tackle this problem using end-to-end pipelines, where the input data is usually a rendered top-view of the physical information and the past trajectories of the most relevant agents; leveraging this information is a must to obtain optimal performance. In that sense, a reliable ADS must produce reasonable predictions on time. However, despite many approaches use simple ConvNets and LSTMs to obtain the social latent features, State-Of-The-Art (SOTA) models might be too complex for real-time applications when using both sources of information (map and past trajectories) as well as little interpretable, specially considering the physical information. Moreover, the performance of such models highly depends on the number of available inputs for each particular traffic scenario, which are expensive to obtain, particularly, annotated High-Definition (HD) maps. In this work, we propose several efficient baselines for the well-known Argoverse 1 Motion Forecasting Benchmark. We aim to develop compact models using SOTA techniques for MP, including attention mechanisms and GNNs. Our lightweight models use standard social information and interpretable map information such as points from the driveable area and plausible centerlines by means of a novel preprocessing step based on kinematic constraints, in opposition to black-box CNN-based or too-complex graphs methods for map encoding, to generate plausible multimodal trajectories achieving up-to-pair accuracy with less operations and parameters than other SOTA methods. Our code is publicly available at https://github.com/Cram3r95/mapfe4mp .
Are SNNs Truly Energy-efficient? $-$ A Hardware Perspective
Authors: Abhiroop Bhattacharjee, Ruokai Yin, Abhishek Moitra, Priyadarshini Panda
Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2309.03388
Pdf link: https://arxiv.org/pdf/2309.03388
Abstract Spiking Neural Networks (SNNs) have gained attention for their energy-efficient machine learning capabilities, utilizing bio-inspired activation functions and sparse binary spike-data representations. While recent SNN algorithmic advances achieve high accuracy on large-scale computer vision tasks, their energy-efficiency claims rely on certain impractical estimation metrics. This work studies two hardware benchmarking platforms for large-scale SNN inference, namely SATA and SpikeSim. SATA is a sparsity-aware systolic-array accelerator, while SpikeSim evaluates SNNs implemented on In-Memory Computing (IMC) based analog crossbars. Using these tools, we find that the actual energy-efficiency improvements of recent SNN algorithmic works differ significantly from their estimated values due to various hardware bottlenecks. We identify and address key roadblocks to efficient SNN deployment on hardware, including repeated computations & data movements over timesteps, neuronal module overhead, and vulnerability of SNNs towards crossbar non-idealities.
Requirements Analysis of Variability Constraints in a Configurable Flight Software System
Authors: Chin Khor, Robyn Lutz
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2309.03392
Pdf link: https://arxiv.org/pdf/2309.03392
Abstract Variability constraints are an integral part of the requirements for a configurable system. The constraints specified in the requirements on the legal combinations of options define the space of potential valid configurations for the system-to-be. This paper reports on our experience with the variability-related requirements constraints of a flight software framework used by multiple space missions. A challenge that we saw for practitioners using the current framework, now open-sourced, is that the specifications of its variability-related requirements and constraints are dispersed across several documents, rather than being centralized in the software requirements specification. Such dispersion can contribute to misunderstandings of the side-effects of design choices, increased effort for developers, and bugs during operations. Based on our experience, we propose a new software variability model, similar to a product-line feature model, in the flight software framework. We describe the structured technique by which our model is developed, demonstrate its use, and evaluate it on a key service module of the flight software. Results show that our lightweight modeling technique helped find missing and inconsistent variability-related requirements and constraints. More generally, we suggest that a variability modeling technique such as this can be an efficient way for developers to centralize the specification and improve the analysis of dispersed variability-related requirements and constraints in other configurable systems.
Predicting Defective Visual Code Changes in a Multi-Language AAA Video Game Project
Authors: Kalvin Eng, Abram Hindle, Alexander Senchenko
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2309.03414
Pdf link: https://arxiv.org/pdf/2309.03414
Abstract Video game development increasingly relies on using visual programming languages as the primary way to build video game features. The aim of using visual programming is to move game logic into the hands of game designers, who may not be as well versed in textual coding. In this paper, we empirically observe that there are more defect-inducing commits containing visual code than textual code in a AAA video game project codebase. This indicates that the existing textual code Just-in-Time (JIT) defect prediction models under evaluation by Electronic Arts (EA) may be ineffective as they do not account for changes in visual code. Thus, we focus our research on constructing visual code defect prediction models that encompass visual code metrics and evaluate the models against defect prediction models that use language agnostic features, and textual code metrics. We test our models using features extracted from the historical codebase of a AAA video game project, as well as the historical codebases of 70 open source projects that use textual and visual code. We find that defect prediction models have better performance overall in terms of the area under the ROC curve (AUC), and Mathews Correlation Coefficient (MCC) when incorporating visual code features for projects that contain more commits with visual code than textual code.
RIS-Assisted Wireless Communications: Long-Term versus Short-Term Phase Shift Designs
Authors: Trinh Van Chien, Lam Thanh Tu, Waqas Khalid, Heejung Yu, Symeon Chatzinotas, Marco Di Renzo
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2309.03436
Pdf link: https://arxiv.org/pdf/2309.03436
Abstract Reconfigurable intelligent surface (RIS) has recently gained significant interest as an emerging technology for future wireless networks thanks to its potential for improving the coverage probability in challenging propagation environments. This paper studies an RIS-assisted propagation environment, where a source transmits data to a destination in the presence of a weak direct link. We analyze and compare RIS designs based on long-term and short-term channel statistics in terms of coverage probability and ergodic rate. For the considered optimization designs, we derive closed-form expressions for the coverage probability and ergodic rate, which explicitly unveil the impact of both the propagation environment and the RIS on the system performance. Besides the optimization of the RIS phase profile, we formulate an RIS placement optimization problem with the aim of maximizing the coverage probability by relying only on partial channel state information. An efficient algorithm is proposed based on the gradient ascent method. Simulation results are illustrated in order to corroborate the analytical framework and findings. The proposed RIS phase profile is shown to outperform several heuristic benchmarks in terms of outage probability and ergodic rate. In addition, the proposed RIS placement strategy provides an extra degree of freedom that remarkably improves system performance.
Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation
Authors: Xiangjie Sui, Hanwei Zhu, Xuelin Liu, Yuming Fang, Shiqi Wang, Zhou Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.03472
Pdf link: https://arxiv.org/pdf/2309.03472
Abstract Despite substantial efforts dedicated to the design of heuristic models for omnidirectional (i.e., 360$^\circ$) image quality assessment (OIQA), a conspicuous gap remains due to the lack of consideration for the diversity of viewing behaviors that leads to the varying perceptual quality of 360$^\circ$ images. Two critical aspects underline this oversight: the neglect of viewing conditions that significantly sway user gaze patterns and the overreliance on a single viewport sequence from the 360$^\circ$ image for quality inference. To address these issues, we introduce a unique generative scanpath representation (GSR) for effective quality inference of 360$^\circ$ images, which aggregates varied perceptual experiences of multi-hypothesis users under a predefined viewing condition. More specifically, given a viewing condition characterized by the starting point of viewing and exploration time, a set of scanpaths consisting of dynamic visual fixations can be produced using an apt scanpath generator. Following this vein, we use the scanpaths to convert the 360$^\circ$ image into the unique GSR, which provides a global overview of gazed-focused contents derived from scanpaths. As such, the quality inference of the 360$^\circ$ image is swiftly transformed to that of GSR. We then propose an efficient OIQA computational framework by learning the quality maps of GSR. Comprehensive experimental results validate that the predictions of the proposed framework are highly consistent with human perception in the spatiotemporal domain, especially in the challenging context of locally distorted 360$^\circ$ images under varied viewing conditions. The code will be released at https://github.com/xiangjieSui/GSR
Temporal Collection and Distribution for Referring Video Object Segmentation
Authors: Jiajin Tang, Ge Zheng, Sibei Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03473
Pdf link: https://arxiv.org/pdf/2309.03473
Abstract Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.
HOPPER: Interpretative Fuzzing for Libraries
Authors: Peng Chen, Yuxuan Xie, Yunlong Lyu, Yuxiao Wang, Hao Chen
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03496
Pdf link: https://arxiv.org/pdf/2309.03496
Abstract Despite the fact that the state-of-the-art fuzzers can generate inputs efficiently, existing fuzz drivers still cannot adequately cover entries in libraries. Most of these fuzz drivers are crafted manually by developers, and their quality depends on the developers' understanding of the code. Existing works have attempted to automate the generation of fuzz drivers by learning API usage from code and execution traces. However, the generated fuzz drivers are limited to a few specific call sequences by the code being learned. To address these challenges, we present HOPPER, which can fuzz libraries without requiring any domain knowledge to craft fuzz drivers. It transforms the problem of library fuzzing into the problem of interpreter fuzzing. The interpreters linked against libraries under test can interpret the inputs that describe arbitrary API usage. To generate semantically correct inputs for the interpreter, HOPPER learns the intra- and inter-API constraints in the libraries and mutates the program with grammar awareness. We implemented HOPPER and evaluated its effectiveness on 11 real-world libraries against manually crafted fuzzers and other automatic solutions. Our results show that HOPPER greatly outperformed the other fuzzers in both code coverage and bug finding, having uncovered 25 previously unknown bugs that other fuzzers couldn't. Moreover, we have demonstrated that the proposed intra- and inter-API constraint learning methods can correctly learn constraints implied by the library and, therefore, significantly improve the fuzzing efficiency. The experiment results indicate that HOPPER is able to explore a vast range of API usages for library fuzzing out of the box.
Dynamic Frame Interpolation in Wavelet Domain
Authors: Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Ying Tai, Chengjie Wang, Jie Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03508
Pdf link: https://arxiv.org/pdf/2309.03508
Abstract Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at https://github.com/ltkong218/WaveletVFI.
Learning Compact Compositional Embeddings via Regularized Pruning for Recommendation
Authors: Xurong Liang, Tong Chen, Quoc Viet Hung Nguyen, Jianxin Li, Hongzhi Yin
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2309.03518
Pdf link: https://arxiv.org/pdf/2309.03518
Abstract Latent factor models are the dominant backbones of contemporary recommender systems (RSs) given their performance advantages, where a unique vector embedding with a fixed dimensionality (e.g., 128) is required to represent each entity (commonly a user/item). Due to the large number of users and items on e-commerce sites, the embedding table is arguably the least memory-efficient component of RSs. For any lightweight recommender that aims to efficiently scale with the growing size of users/items or to remain applicable in resource-constrained settings, existing solutions either reduce the number of embeddings needed via hashing, or sparsify the full embedding table to switch off selected embedding dimensions. However, as hash collision arises or embeddings become overly sparse, especially when adapting to a tighter memory budget, those lightweight recommenders inevitably have to compromise their accuracy. To this end, we propose a novel compact embedding framework for RSs, namely Compositional Embedding with Regularized Pruning (CERP). Specifically, CERP represents each entity by combining a pair of embeddings from two independent, substantially smaller meta-embedding tables, which are then jointly pruned via a learnable element-wise threshold. In addition, we innovatively design a regularized pruning mechanism in CERP, such that the two sparsified meta-embedding tables are encouraged to encode information that is mutually complementary. Given the compatibility with agnostic latent factor models, we pair CERP with two popular recommendation models for extensive experiments, where results on two real-world datasets under different memory budgets demonstrate its superiority against state-of-the-art baselines. The codebase of CERP is available in https://github.com/xurong-liang/CERP.
DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph Partitioning by Chunks
Authors: Fahao Chen, Peng Li, Celimuge Wu
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03523
Pdf link: https://arxiv.org/pdf/2309.03523
Abstract Dynamic Graph Neural Network (DGNN) has shown a strong capability of learning dynamic graphs by exploiting both spatial and temporal features. Although DGNN has recently received considerable attention by AI community and various DGNN models have been proposed, building a distributed system for efficient DGNN training is still challenging. It has been well recognized that how to partition the dynamic graph and assign workloads to multiple GPUs plays a critical role in training acceleration. Existing works partition a dynamic graph into snapshots or temporal sequences, which only work well when the graph has uniform spatio-temporal structures. However, dynamic graphs in practice are not uniformly structured, with some snapshots being very dense while others are sparse. To address this issue, we propose DGC, a distributed DGNN training system that achieves a 1.25x - 7.52x speedup over the state-of-the-art in our testbed. DGC's success stems from a new graph partitioning method that partitions dynamic graphs into chunks, which are essentially subgraphs with modest training workloads and few inter connections. This partitioning algorithm is based on graph coarsening, which can run very fast on large graphs. In addition, DGC has a highly efficient run-time, powered by the proposed chunk fusion and adaptive stale aggregation techniques. Extensive experimental results on 3 typical DGNN models and 4 popular dynamic graph datasets are presented to show the effectiveness of DGC.
Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs
Authors: Arne Moos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2309.03530
Pdf link: https://arxiv.org/pdf/2309.03530
Abstract This paper proposes a novel approach for detecting objects using mobile robots in the context of the RoboCup Standard Platform League, with a primary focus on detecting the ball. The challenge lies in detecting a dynamic object in varying lighting conditions and blurred images caused by fast movements. To address this challenge, the paper presents a convolutional neural network architecture designed specifically for computationally constrained robotic platforms. The proposed CNN is trained to achieve high precision classification of single objects in image patches and to determine their precise spatial positions. The paper further integrates Early Exits into the existing high-precision CNN architecture to reduce the computational cost of easily rejectable cases in the background class. The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 $\mu$s per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN. Overall, this paper provides an efficient solution for an enhanced detection of objects, especially the ball, in computationally constrained robotic platforms.
A new numerical mesoscopic scale one-domain approach solver for free fluid/porous medium interaction
Authors: Costanza Arico, Rainer Helmig, Daniele Puleo, Martin Schneider
Subjects: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2309.03543
Pdf link: https://arxiv.org/pdf/2309.03543
Abstract A new numerical continuum \textit{one-domain} approach (ODA) solver is presented for the simulation of the transfer processes between a free fluid and a porous medium. The solver is developed in the \textit{mesoscopic} scale framework, where a continuous variation of the physical parameters of the porous medium (e.g., porosity and permeability) is assumed. The Navier-Stokes-Brinkman equations are solved along with the continuity equation, under the hypothesis of incompressible fluid. The porous medium is assumed to be fully saturated and can potentially be anisotropic. The domain is discretized with unstructured meshes allowing local refinements. A fractional time step procedure is applied, where one predictor and two corrector steps are solved within each time iteration. The predictor step is solved in the framework of a marching in space and time procedure, with some important numerical advantages. The two corrector steps require the solution of large linear systems, whose matrices are sparse, symmetric and positive definite, with $\mathcal{M}$-matrix property over Delaunay-meshes. A fast and efficient solution is obtained using a preconditioned conjugate gradient method. The discretization adopted for the two corrector steps can be regarded as a Two-Point-Flux-Approximation (TPFA) scheme, which, unlike the standard TPFA schemes, does not require the grid mesh to be $\mathbf{K}$-orthogonal, (with $\mathbf{K}$ the anisotropy tensor). As demonstrated with the provided test cases, the proposed scheme correctly retains the anisotropy effects within the porous medium. Furthermore, it overcomes the restrictions of existing mesoscopic scale one-domain approachs proposed in the literature.
MVD:A Novel Methodology and Dataset for Acoustic Vehicle Type Classification
Authors: Mohd Ashhad, Omar Ahmed, Sooraj K. Ambat, Zeeshan Ali Haq, Mansaf Alam
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2309.03544
Pdf link: https://arxiv.org/pdf/2309.03544
Abstract Rising urban populations have led to a surge in vehicle use and made traffic monitoring and management indispensable. Acoustic traffic monitoring (ATM) offers a cost-effective and efficient alternative to more computationally expensive methods of monitoring traffic such as those involving computer vision technologies. In this paper, we present MVD and MVDA: two open datasets for the development of acoustic traffic monitoring and vehicle-type classification algorithms, which contain audio recordings of moving vehicles. The dataset contain four classes- Trucks, Cars, Motorbikes, and a No-vehicle class. Additionally, we propose a novel and efficient way to accurately classify these acoustic signals using cepstrum and spectrum based local and global audio features, and a multi-input neural network. Experimental results show that our methodology improves upon the established baselines of previous works and achieves an accuracy of 91.98% and 96.66% on MVD and MVDA Datasets, respectively. Finally, the proposed model was deployed through an Android application to make it accessible for testing and demonstrate its efficacy.
Region Generation and Assessment Network for Occluded Person Re-Identification
Authors: Shuting He, Weihua Chen, Kai Wang, Hao Luo, Fan Wang, Wei Jiang, Henghui Ding
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03558
Pdf link: https://arxiv.org/pdf/2309.03558
Abstract Person Re-identification (ReID) plays a more and more crucial role in recent years with a wide range of applications. Existing ReID methods are suffering from the challenges of misalignment and occlusions, which degrade the performance dramatically. Most methods tackle such challenges by utilizing external tools to locate body parts or exploiting matching strategies. Nevertheless, the inevitable domain gap between the datasets utilized for external tools and the ReID datasets and the complicated matching process make these methods unreliable and sensitive to noises. In this paper, we propose a Region Generation and Assessment Network (RGANet) to effectively and efficiently detect the human body regions and highlight the important regions. In the proposed RGANet, we first devise a Region Generation Module (RGM) which utilizes the pre-trained CLIP to locate the human body regions using semantic prototypes extracted from text descriptions. Learnable prompt is designed to eliminate domain gap between CLIP datasets and ReID datasets. Then, to measure the importance of each generated region, we introduce a Region Assessment Module (RAM) that assigns confidence scores to different regions and reduces the negative impact of the occlusion regions by lower scores. The RAM consists of a discrimination-aware indicator and an invariance-aware indicator, where the former indicates the capability to distinguish from different identities and the latter represents consistency among the images of the same class of human body regions. Extensive experimental results for six widely-used benchmarks including three tasks (occluded, partial, and holistic) demonstrate the superiority of RGANet against state-of-the-art methods.
Enhancing 5G Radio Planning with Graph Representations and Deep Learning
Authors: Paul Almasan, José Suárez-Varela, Andra Lutu, Albert Cabellos-Aparicio, Pere Barlet-Ros
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2309.03603
Pdf link: https://arxiv.org/pdf/2309.03603
Abstract The roll out of new mobile network generations poses hard challenges due to various factors such as cost-benefit tradeoffs, existing infrastructure, and new technology aspects. In particular, one of the main challenges for the 5G deployment lies in optimal 5G radio coverage while accounting for diverse service performance metrics. This paper introduces a Deep Learning-based approach to assist in 5G radio planning by utilizing data from previous-generation cells. Our solution relies on a custom graph representation to leverage the information available from existing cells, and employs a Graph Neural Network (GNN) model to process such data efficiently. In our evaluation, we test its potential to model the transition from 4G to 5G NSA using real-world data from a UK mobile network operator. The experimental results show that our solution achieves high accuracy in predicting key performance indicators in new 5G cells, with a Mean Absolute Percentage Error (MAPE)~<17\% when evaluated on samples from the same area where it was trained. Moreover, we test its generalization capability over various geographical areas not included in the training, achieving a MAPE~<19\%. This suggests beneficial properties for achieving robust solutions applicable to 5G planning in new areas without the need of retraining.
Spiking Structured State Space Model for Monaural Speech Enhancement
Authors: Yu Du, Xu Liu, Yansong Chua
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2309.03641
Pdf link: https://arxiv.org/pdf/2309.03641
Abstract Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs).
Formal Verification of Chase-Lev Deque in Concurrent Separation Logic
Authors: Jaemin Choi
Subjects: Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2309.03642
Pdf link: https://arxiv.org/pdf/2309.03642
Abstract Chase-Lev deque is a concurrent data structure designed for efficient load balancing in multiprocessor scheduling. It employs a work-stealing strategy, where each thread possesses its own work-stealing deque to store tasks, and idle threads steal tasks from other threads. However, given the inherent risk of bugs in software, particularly in a multiprocessor environment, it is crucial to formally establish the correctness of programs and data structures. To our knowledge, no formal verification work for the Chase-Lev deque has met three key criteria: (1) utilizing a minimal trusted computing base, (2) using a realistic and unrestricted implementation, and (3) proving a strong specification. In this thesis, we address this gap by presenting the formal verification of the Chase-Lev deque using a concurrent separation logic. Our work is mechanized in the Coq proof assistant, and our verified implementation is both realistic and unbounded in terms of the number of tasks it can handle. Also, we adopt linearizability as the specification, as it is widely recognized as a strong specification for concurrent data structures. Consequently, our work satisfies all three aforementioned criteria for formal verification. Additionally, we extend our verification to support safe memory reclamation, and provide a basis for verifying the Chase-Lev deque in the relaxed memory model.
Characterizing Lipschitz Stability of GNN for Fairness
Authors: Yaning Jia, Chunhui Zhang, Jundong Li, Chuxu Zhang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2309.03648
Pdf link: https://arxiv.org/pdf/2309.03648
Abstract The Lipschitz bound, a technique from robust statistics, can limit the maximum changes in the output concerning the input, taking into account associated irrelevant biased factors. It is an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. Recently, Graph Neural Networks (GNNs), which operate on non-Euclidean data, have gained significant attention. However, no previous research has investigated the GNN Lipschitz bounds to shed light on stabilizing model outputs, especially when working on non-Euclidean data with inherent biases. Given the inherent biases in common graph data used for GNN training, it poses a serious challenge to constraining the GNN output perturbations induced by input biases, thereby safeguarding fairness during training. Recently, despite the Lipschitz constant's use in controlling the stability of Euclideanneural networks, the calculation of the precise Lipschitz constant remains elusive for non-Euclidean neural networks like GNNs, especially within fairness contexts. To narrow this gap, we begin with the general GNNs operating on an attributed graph, and formulate a Lipschitz bound to limit the changes in the output regarding biases associated with the input. Additionally, we theoretically analyze how the Lipschitz constant of a GNN model could constrain the output perturbations induced by biases learned from data for fairness training. We experimentally validate the Lipschitz bound's effectiveness in limiting biases of the model output. Finally, from a training dynamics perspective, we demonstrate why the theoretical Lipschitz bound can effectively guide the GNN training to better trade-off between accuracy and fairness.
Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains
Authors: Peiyang Li, Ye Wang, Qi Li, Zhuotao Liu, Ke Xu, Ju Ren, Zhiying Liu, Ruilin Lin
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03660
Pdf link: https://arxiv.org/pdf/2309.03660
Abstract Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for model training and deployment. In this paper, we propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains in an organization with limited training data. Specifically, it utilizes meta-learning to share knowledge across these domains, e.g., the relationship between HTTP requests in heterogeneous domains, to efficiently train detection models. Moreover, we develop an adaptive preprocessing module to facilitate semantic analysis of Web requests across different domains and design a multi-domain representation method to capture semantic correlations between different domains for cross-domain model training. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. The experimental results demonstrate that RETSINA outperforms the existing unsupervised Web attack detection methods with limited training data, e.g., RETSINA needs only 5-minute training data to achieve comparable detection performance to the existing methods that train separate models for different domains using 1-day training data. We also conduct real-world deployment in an Internet company. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
How adversarial attacks can disrupt seemingly stable accurate classifiers
Authors: Oliver J. Sutton, Qinghua Zhou, Ivan Y. Tyukin, Alexander N. Gorban, Alexander Bastounis, Desmond J. Higham
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03665
Pdf link: https://arxiv.org/pdf/2309.03665
Abstract Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.
Short-Term Load Forecasting Using A Particle-Swarm Optimized Multi-Head Attention-Augmented CNN-LSTM Network
Authors: Paapa Kwesi Quansah
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.03694
Pdf link: https://arxiv.org/pdf/2309.03694
Abstract Short-term load forecasting is of paramount importance in the efficient operation and planning of power systems, given its inherent non-linear and dynamic nature. Recent strides in deep learning have shown promise in addressing this challenge. However, these methods often grapple with hyperparameter sensitivity, opaqueness in interpretability, and high computational overhead for real-time deployment. In this paper, I propose a novel solution that surmounts these obstacles. Our approach harnesses the power of the Particle-Swarm Optimization algorithm to autonomously explore and optimize hyperparameters, a Multi-Head Attention mechanism to discern the salient features crucial for accurate forecasting, and a streamlined framework for computational efficiency. Our method undergoes rigorous evaluation using a genuine electricity demand dataset. The results underscore its superiority in terms of accuracy, robustness, and computational efficiency. Notably, our Mean Absolute Percentage Error of 1.9376 marks a significant advancement over existing state-of-the-art approaches, heralding a new era in short-term load forecasting.
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory
Authors: Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, Yang Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03696
Pdf link: https://arxiv.org/pdf/2309.03696
Abstract Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM.
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption
Authors: Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03729
Pdf link: https://arxiv.org/pdf/2309.03729
Abstract Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.
Medoid Silhouette clustering with automatic cluster number selection
Authors: Lars Lenssen, Erich Schubert
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2309.03751
Pdf link: https://arxiv.org/pdf/2309.03751
Abstract The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, provide two fast versions for the direct optimization, and discuss the use to choose the optimal number of clusters. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm. Additionally, we provide a variant to choose the optimal number of clusters directly.
Extending Transductive Knowledge Graph Embedding Models for Inductive Logical Relational Inference
Authors: Thomas Gebhart, John Cobb
Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2309.03773
Pdf link: https://arxiv.org/pdf/2309.03773
Abstract Many downstream inference tasks for knowledge graphs, such as relation prediction, have been handled successfully by knowledge graph embedding techniques in the transductive setting. To address the inductive setting wherein new entities are introduced into the knowledge graph at inference time, more recent work opts for models which learn implicit representations of the knowledge graph through a complex function of a network's subgraph structure, often parametrized by graph neural network architectures. These come at the cost of increased parametrization, reduced interpretability and limited generalization to other downstream inference tasks. In this work, we bridge the gap between traditional transductive knowledge graph embedding approaches and more recent inductive relation prediction models by introducing a generalized form of harmonic extension which leverages representations learned through transductive embedding methods to infer representations of new entities introduced at inference time as in the inductive setting. This harmonic extension technique provides the best such approximation, can be implemented via an efficient iterative scheme, and can be employed to answer a family of conjunctive logical queries over the knowledge graph, further expanding the capabilities of transductive embedding methods. In experiments on a number of large-scale knowledge graph embedding benchmarks, we find that this approach for extending the functionality of transductive knowledge graph embedding models to perform knowledge graph completion and answer logical queries in the inductive setting is competitive with--and in some scenarios outperforms--several state-of-the-art models derived explicitly for such inductive tasks.
CPU frequency scheduling of real-time applications on embedded devices with temporal encoding-based deep reinforcement learning
Authors: Ti Zhou, Man Lin
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Operating Systems (cs.OS); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.03779
Pdf link: https://arxiv.org/pdf/2309.03779
Abstract Small devices are frequently used in IoT and smart-city applications to perform periodic dedicated tasks with soft deadlines. This work focuses on developing methods to derive efficient power-management methods for periodic tasks on small devices. We first study the limitations of the existing Linux built-in methods used in small devices. We illustrate three typical workload/system patterns that are challenging to manage with Linux's built-in solutions. We develop a reinforcement-learning-based technique with temporal encoding to derive an effective DVFS governor even with the presence of the three system patterns. The derived governor uses only one performance counter, the same as the built-in Linux mechanism, and does not require an explicit task model for the workload. We implemented a prototype system on the Nvidia Jetson Nano Board and experimented with it with six applications, including two self-designed and four benchmark applications. Under different deadline constraints, our approach can quickly derive a DVFS governor that can adapt to performance requirements and outperform the built-in Linux approach in energy saving. On Mibench workloads, with performance slack ranging from 0.04 s to 0.4 s, the proposed method can save 3% - 11% more energy compared to Ondemand. AudioReg and FaceReg applications tested have 5%- 14% energy-saving improvement. We have open-sourced the implementation of our in-kernel quantized neural network engine. The codebase can be found at: https://github.com/coladog/tinyagent.
Managing the Uncertainty in System Dynamics Through Distributionally Robust Stability-Constrained Optimization
Authors: Zhongda Chu, Fei Teng
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.03798
Pdf link: https://arxiv.org/pdf/2309.03798
Abstract With the increasing penetration of Inverter-Based Resources (IBRs) and their impact on power system stability and operation, the concept of stability-constrained optimization has drawn significant attentions from researchers. In order to manage the parametric uncertainty due to inaccurate modeling that influences the system dynamics, this work proposes a distributionally robust stability constraint formulation, where the propagation mechanism from uncertainty of the system dynamic parameters to the stability constraint coefficients is established and managed. Since these coefficients are connected to the uncertain parameters through highly nonlinear and implicit functions, an approximation approach utilizing Taylor expansion and the Delta method is developed to estimate the statistical moments of the stability constraint coefficients based on the first and second-order derivatives, with which an ambiguity set for the distributionally robust optimization can be formulated. The accuracy of the uncertainty propagation as well as the effectiveness of the distributionally robust stability constraints are demonstrated through detailed case studies in the modified IEEE 39-bus system.
Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck
Authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2309.03800
Pdf link: https://arxiv.org/pdf/2309.03800
Abstract This work investigates the nuanced algorithm design choices for deep learning in the presence of computational-statistical gaps. We begin by considering offline sparse parity learning, a supervised classification problem which admits a statistical query lower bound for gradient-based training of a multilayer perceptron. This lower bound can be interpreted as a multi-resource tradeoff frontier: successful learning can only occur if one is sufficiently rich (large model), knowledgeable (large dataset), patient (many training iterations), or lucky (many random guesses). We show, theoretically and experimentally, that sparse initialization and increasing network width yield significant improvements in sample efficiency in this setting. Here, width plays the role of parallel search: it amplifies the probability of finding "lottery ticket" neurons, which learn sparse features more sample-efficiently. Finally, we show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning. We demonstrate improved sample efficiency on tabular classification benchmarks by using wide, sparsely-initialized MLP models; these networks sometimes outperform tuned random forests.
Mapping of CNNs on multi-core RRAM-based CIM architectures
Authors: Rebecca Pelke, Nils Bosbach, Jose Cubero, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph
Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2309.03805
Pdf link: https://arxiv.org/pdf/2309.03805
Abstract RRAM-based multi-core systems improve the energy efficiency and performance of CNNs. Thereby, the distributed parallel execution of convolutional layers causes critical data dependencies that limit the potential speedup. This paper presents synchronization techniques for parallel inference of convolutional layers on RRAM-based CIM architectures. We propose an architecture optimization that enables efficient data exchange and discuss the impact of different architecture setups on the performance. The corresponding compiler algorithms are optimized for high speedup and low memory consumption during CNN inference. We achieve more than 99% of the theoretical acceleration limit with a marginal data transmission overhead of less than 4% for state-of-the-art CNN benchmarks.
On the Reduction of the Spherical Point-in-Polygon Problem for Antipode-Excluding Spherical Polygons
Authors: Ziqiang Li, Jindi Sun
Subjects: Computational Geometry (cs.CG); General Topology (math.GN)
Arxiv link: https://arxiv.org/abs/2309.03822
Pdf link: https://arxiv.org/pdf/2309.03822
Abstract Spherical polygons used in practice are nice, but the spherical point-in-polygon problem (SPiP) has long eluded solutions based on the winding number (wn). That a punctured sphere is simply connected is to blame. As a workaround, we prove that requiring the boundary of a spherical polygon to never intersect its antipode is sufficient to reduce its SPiP problem to the planar, point-in-polygon (PiP) problem, whose state-of-the-art solution uses wn and does not utilize known interior points (KIP). We refer to such spherical polygons as boundary antipode-excluding (BAE) and show that all spherical polygons fully contained within an open hemisphere is BAE. We document two successful reduction methods, one based on rotation and the other on shearing, and address a common concern. Both reduction algorithms, when combined with a wn-PiP algorithm, solve SPiP correctly and efficiently for BAE spherical polygons. The MATLAB code provided demonstrates scenarios that are problematic for previous work.
On Large Language Models' Selection Bias in Multi-Choice Questions
Authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2309.03882
Pdf link: https://arxiv.org/pdf/2309.03882
Abstract Multi-choice questions (MCQs) serve as a common yet important task format in the research of large language models (LLMs). Our work shows that LLMs exhibit an inherent "selection bias" in MCQs, which refers to LLMs' preferences to select options located at specific positions (like "Option C"). This bias is prevalent across various LLMs, making their performance vulnerable to option position changes in MCQs. We identify that one primary cause resulting in selection bias is option numbering, i.e., the ID symbols A/B/C/D associated with the options. To mitigate selection bias, we propose a new method called PriDe. PriDe first decomposes the observed model prediction distribution into an intrinsic prediction over option contents and a prior distribution over option IDs. It then estimates the prior by permutating option contents on a small number of test samples, which is used to debias the subsequent test samples. We demonstrate that, as a label-free, inference-time method, PriDe achieves a more effective and computation-efficient debiasing than strong baselines. We further show that the priors estimated by PriDe generalize well across different domains, highlighting its practical potential in broader scenarios.
ProPainter: Improving Propagation and Transformer for Video Inpainting
Authors: Shangchen Zhou, Chongyi Li, Kelvin C.K. Chan, Chen Change Loy
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03897
Pdf link: https://arxiv.org/pdf/2309.03897
Abstract Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.
Keyword: faster

A Circuit Domain Generalization Framework for Efficient Logic Synthesis in Chip Design
Authors: Zhihai Wang, Lei Chen, Jie Wang, Xing Li, Yinqi Bai, Xijun Li, Mingxuan Yuan, Jianye Hao, Yongdong Zhang, Feng Wu
Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03208
Pdf link: https://arxiv.org/pdf/2309.03208
Abstract Logic Synthesis (LS) plays a vital role in chip design -- a cornerstone of the semiconductor industry. A key task in LS is to transform circuits -- modeled by directed acyclic graphs (DAGs) -- into simplified circuits with equivalent functionalities. To tackle this task, many LS operators apply transformations to subgraphs -- rooted at each node on an input DAG -- sequentially. However, we found that a large number of transformations are ineffective, which makes applying these operators highly time-consuming. In particular, we notice that the runtime of the Resub and Mfs2 operators often dominates the overall runtime of LS optimization processes. To address this challenge, we propose a novel data-driven LS operator paradigm, namely PruneX, to reduce ineffective transformations. The major challenge of developing PruneX is to learn models that well generalize to unseen circuits, i.e., the out-of-distribution (OOD) generalization problem. Thus, the major technical contribution of PruneX is the novel circuit domain generalization framework, which learns domain-invariant representations based on the transformation-invariant domain-knowledge. To the best of our knowledge, PruneX is the first approach to tackle the OOD problem in LS operators. We integrate PruneX with the aforementioned Resub and Mfs2 operators. Experiments demonstrate that PruneX significantly improves their efficiency while keeping comparable optimization performance on industrial and very large-scale circuits, achieving up to $3.1\times$ faster runtime.
ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles
Authors: Irfan Tito Kurniawan, Bambang Riyanto Trilaksono
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03734
Pdf link: https://arxiv.org/pdf/2309.03734
Abstract Thanks to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods may reliably produce accurate detections even in low-visibility conditions. This makes them preferable to use in autonomous vehicles' perception systems, especially as the combined cost of both sensors is cheaper than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion which often involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. While performing fusion on the image plane is generally simpler and faster, projecting radar points onto the image plane flattens the depth dimension of the point cloud which might lead to information loss and makes extracting the spatial features of the point cloud harder. We proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters: a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.
Medoid Silhouette clustering with automatic cluster number selection
Authors: Lars Lenssen, Erich Schubert
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2309.03751
Pdf link: https://arxiv.org/pdf/2309.03751
Abstract The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, provide two fast versions for the direct optimization, and discuss the use to choose the optimal number of clusters. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm. Additionally, we provide a variant to choose the optimal number of clusters directly.
Convergence Analysis of Decentralized ASGD
Authors: Mauro DL Tosi, Martin Theobald
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2309.03754
Pdf link: https://arxiv.org/pdf/2309.03754
Abstract Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming task. To reduce training time, it is common to distribute the training process across multiple devices. Recently, it has been shown that the convergence of asynchronous SGD (ASGD) will always be faster than mini-batch SGD. However, despite these improvements in the theoretical bounds, most ASGD convergence-rate proofs still rely on a centralized parameter server, which is prone to become a bottleneck when scaling out the gradient computations across many distributed processes. In this paper, we present a novel convergence-rate analysis for decentralized and asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Specifically, we provide a bound of $\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(QS{avg}\epsilon^{-3/2}) + \mathcal{O}(S{avg}\epsilon^{-1})$ for the convergence rate of DASGD, where $S{avg}$ is the average staleness between models, $Q$ is a constant that bounds the norm of the gradients, and $\epsilon$ is a (small) error that is allowed within the bound. Furthermore, when gradients are not bounded, we prove the convergence rate of DASGD to be $\mathcal{O}(\sigma\epsilon^{-2}) + \mathcal{O}(\sqrt{\hat{S}{avg}\hat{S}{max}}\epsilon^{-1})$, with $\hat{S}{max}$ and $\hat{S}_{avg}$ representing a loose version of the average and maximum staleness, respectively. Our convergence proof holds for a fixed stepsize and any non-convex, homogeneous, and L-smooth objective function. We anticipate that our results will be of high relevance for the adoption of DASGD by a broad community of researchers and developers.
Keyword: mobile

MALITE: Lightweight Malware Detection and Classification for Constrained Devices
Authors: Sidharth Anand, Barsha Mitra, Soumyadeep Dey, Abhinav Rao, Rupsa Dhar, Jaideep Vaidya
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03294
Pdf link: https://arxiv.org/pdf/2309.03294
Abstract Today, malware is one of the primary cyberthreats to organizations. Malware has pervaded almost every type of computing device including the ones having limited memory, battery and computation power such as mobile phones, tablets and embedded devices like Internet-of-Things (IoT) devices. Consequently, the privacy and security of the malware infected systems and devices have been heavily jeopardized. In recent years, researchers have leveraged machine learning based strategies for malware detection and classification. Malware analysis approaches can only be employed in resource constrained environments if the methods are lightweight in nature. In this paper, we present MALITE, a lightweight malware analysis system, that can classify various malware families and distinguish between benign and malicious binaries. MALITE converts a binary into a gray scale or an RGB image and employs low memory and battery power consuming as well as computationally inexpensive malware analysis strategies. We have designed MALITE-MN, a lightweight neural network based architecture and MALITE-HRF, an ultra lightweight random forest based method that uses histogram features extracted by a sliding window. We evaluate the performance of both on six publicly available datasets (Malimg, Microsoft BIG, Dumpware10, MOTIF, Drebin and CICAndMal2017), and compare them to four state-of-the-art malware classification techniques. The results show that MALITE-MN and MALITE-HRF not only accurately identify and classify malware but also respectively consume several orders of magnitude lower resources (in terms of both memory as well as computation capabilities), making them much more suitable for resource constrained environments.
Resource Management for IRS-assisted WP-MEC Networks with Practical Phase Shift Model
Authors: Nana Li, Wanming Hao, Fuhui Zhou, Zheng Chu, Shouyi Yang, Pei Xiao
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2309.03471
Pdf link: https://arxiv.org/pdf/2309.03471
Abstract Wireless powered mobile edge computing (WP-MEC) has been recognized as a promising solution to enhance the computational capability and sustainable energy supply for low-power wireless devices (WDs). However, when the communication links between the hybrid access point (HAP) and WDs are hostile, the energy transfer efficiency and task offloading rate are compromised. To tackle this problem, we propose to employ multiple intelligent reflecting surfaces (IRSs) to WP-MEC networks. Based on the practical IRS phase shift model, we formulate a total computation rate maximization problem by jointly optimizing downlink/uplink IRSs passive beamforming, downlink energy beamforming and uplink multi-user detection (MUD) vector at HAPs, task offloading power and local computing frequency of WDs, and the time slot allocation. Specifically, we first derive the optimal time allocation for downlink wireless energy transmission (WET) to IRSs and the corresponding energy beamforming. Next, with fixed time allocation for the downlink WET to WDs, the original optimization problem can be divided into two independent subproblems. For the WD charging subproblem, the optimal IRSs passive beamforming is derived by utilizing the successive convex approximation (SCA) method and the penalty-based optimization technique, and for the offloading computing subproblem, we propose a joint optimization framework based on the fractional programming (FP) method. Finally, simulation results validate that our proposed optimization method based on the practical phase shift model can achieve a higher total computation rate compared to the baseline schemes.
Password-Stealing without Hacking: Wi-Fi Enabled Practical Keystroke Eavesdropping
Authors: Jingyang Hu, Hongbo Wang, Tianyue Zheng, Jingzhi Hu, Zhe Chen, Hongbo Jiang, Jun Luo
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03492
Pdf link: https://arxiv.org/pdf/2309.03492
Abstract The contact-free sensing nature of Wi-Fi has been leveraged to achieve privacy breaches, yet existing attacks relying on Wi-Fi CSI (channel state information) demand hacking Wi-Fi hardware to obtain desired CSIs. Since such hacking has proven prohibitively hard due to compact hardware, its feasibility in keeping up with fast-developing Wi-Fi technology becomes very questionable. To this end, we propose WiKI-Eve to eavesdrop keystrokes on smartphones without the need for hacking. WiKI-Eve exploits a new feature, BFI (beamforming feedback information), offered by latest Wi-Fi hardware: since BFI is transmitted from a smartphone to an AP in clear-text, it can be overheard (hence eavesdropped) by any other Wi-Fi devices switching to monitor mode. As existing keystroke inference methods offer very limited generalizability, WiKI-Eve further innovates in an adversarial learning scheme to enable its inference generalizable towards unseen scenarios. We implement WiKI-Eve and conduct extensive evaluation on it; the results demonstrate that WiKI-Eve achieves 88.9% inference accuracy for individual keystrokes and up to 65.8% top-10 accuracy for stealing passwords of mobile applications (e.g., WeChat).
Deep Reinforcement Learning Enabled Joint Deployment and Beamforming in STAR-RIS Assisted Networks
Authors: Zhuoyuan Ma, Qi Zhao, Bai Yan, Jin Zhang
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.03520
Pdf link: https://arxiv.org/pdf/2309.03520
Abstract In the new generation of wireless communication systems, reconfigurable intelligent surfaces (RIS) and simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) have become competitive network components to achieve intelligent and reconfigurable network environments. However, existing work has not fully studied the deployment freedom of STAR-RIS, which limits further improvements in network communication performance. Therefore, this paper proposes a solution based on a deep reinforcement learning algorithm to dynamically deploy STAR-RIS and hybrid beamforming to improve the total communication rate of users in mobile wireless networks. The paper constructs a STAR-RIS assisted multi-user multiple-input single-output (MU-MISO) mobile wireless network and jointly optimizes the dynamic deployment strategy of STAR-RIS and the hybrid beamforming strategy to maximize the long-term total communication rate of users. To solve this problem, the paper uses the Proximal Policy Optimization (PPO) algorithm to optimize the deployment of STAR-RIS and the joint beamforming strategy of STAR-RIS and the base station. The trained policy can maximize the downlink transmission rate of the system and meet the real-time decision-making needs of the system. Numerical simulation results show that compared with the traditional scheme without using STAR-RIS and fixed STAR-RIS deployment, the PPO method proposed in this paper can effectively improve the total communication rate of wireless network users in the service area.
ReuNify: A Step Towards Whole Program Analysis for React Native Android Apps
Authors: Yonghui Liu, Xiao Chen, Pei Liu, John Grundy, Chunyang Chen, Li Li
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2309.03524
Pdf link: https://arxiv.org/pdf/2309.03524
Abstract React Native is a widely-used open-source framework that facilitates the development of cross-platform mobile apps. The framework enables JavaScript code to interact with native-side code, such as Objective-C/Swift for iOS and Java/Kotlin for Android, via a communication mechanism provided by React Native. However, previous research and tools have overlooked this mechanism, resulting in incomplete analysis of React Native app code. To address this limitation, we have developed REUNIFY, a prototype tool that integrates the JavaScript and native-side code of React Native apps into an intermediate language that can be processed by the Soot static analysis framework. By doing so, REUNIFY enables the generation of a comprehensive model of the app's behavior. Our evaluation indicates that, by leveraging REUNIFY, the Soot-based framework can improve its coverage of static analysis for the 1,007 most popular React Native Android apps, augmenting the number of lines of Jimple code by 70%. Additionally, we observed an average increase of 84% in new nodes reached in the callgraph for these apps, after integrating REUNIFY. When REUNIFY is used for taint flow analysis, an average of two additional privacy leaks were identified. Overall, our results demonstrate that REUNIFY significantly enhances the Soot-based framework's capability to analyze React Native Android apps.
Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs
Authors: Arne Moos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2309.03530
Pdf link: https://arxiv.org/pdf/2309.03530
Abstract This paper proposes a novel approach for detecting objects using mobile robots in the context of the RoboCup Standard Platform League, with a primary focus on detecting the ball. The challenge lies in detecting a dynamic object in varying lighting conditions and blurred images caused by fast movements. To address this challenge, the paper presents a convolutional neural network architecture designed specifically for computationally constrained robotic platforms. The proposed CNN is trained to achieve high precision classification of single objects in image patches and to determine their precise spatial positions. The paper further integrates Early Exits into the existing high-precision CNN architecture to reduce the computational cost of easily rejectable cases in the background class. The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 $\mu$s per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN. Overall, this paper provides an efficient solution for an enhanced detection of objects, especially the ball, in computationally constrained robotic platforms.
Enhancing 5G Radio Planning with Graph Representations and Deep Learning
Authors: Paul Almasan, José Suárez-Varela, Andra Lutu, Albert Cabellos-Aparicio, Pere Barlet-Ros
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2309.03603
Pdf link: https://arxiv.org/pdf/2309.03603
Abstract The roll out of new mobile network generations poses hard challenges due to various factors such as cost-benefit tradeoffs, existing infrastructure, and new technology aspects. In particular, one of the main challenges for the 5G deployment lies in optimal 5G radio coverage while accounting for diverse service performance metrics. This paper introduces a Deep Learning-based approach to assist in 5G radio planning by utilizing data from previous-generation cells. Our solution relies on a custom graph representation to leverage the information available from existing cells, and employs a Graph Neural Network (GNN) model to process such data efficiently. In our evaluation, we test its potential to model the transition from 4G to 5G NSA using real-world data from a UK mobile network operator. The experimental results show that our solution achieves high accuracy in predicting key performance indicators in new 5G cells, with a Mean Absolute Percentage Error (MAPE)~<17\% when evaluated on samples from the same area where it was trained. Moreover, we test its generalization capability over various geographical areas not included in the training, achieving a MAPE~<19\%. This suggests beneficial properties for achieving robust solutions applicable to 5G planning in new areas without the need of retraining.
Multivariate, Multi-step, and Spatiotemporal Traffic Prediction for NextG Network Slicing under SLA Constraints
Authors: Evren Tuna, Alkan Soysal
Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2309.03898
Pdf link: https://arxiv.org/pdf/2309.03898
Abstract This study presents a spatiotemporal traffic prediction approach for NextG mobile networks, ensuring the service-level agreements (SLAs) of each network slice. Our approach is multivariate, multi-step, and spatiotemporal. Leveraging 20 radio access network (RAN) features, peak traffic hour data, and mobility-based clustering, we propose a parametric SLA-based loss function to guarantee an SLA violation rate. We focus on single-cell, multi-cell, and slice-based prediction approaches and present a detailed comparative analysis of their performances, strengths, and limitations. First, we address the application of single-cell and multi-cell training architectures. While single-cell training offers individual cell-level prediction, multi-cell training involves training a model using traffic from multiple cells from the same or different base stations. We show that the single-cell approach outperforms the multi-cell approach and results in test loss improvements of 11.4% and 38.1% compared to baseline SLA-based and MAE-based models, respectively. Next, we explore slice-based traffic prediction. We present single-slice and multi-slice methods for slice-based downlink traffic volume prediction, arguing that multi-slice prediction offers a more accurate forecast. The slice-based model we introduce offers substantial test loss improvements of 28.2%, 36.4%, and 55.6% compared to our cell-based model, the baseline SLA-based model, and the baseline MAE-based model, respectively.
Keyword: pruning

Learning Compact Compositional Embeddings via Regularized Pruning for Recommendation
Authors: Xurong Liang, Tong Chen, Quoc Viet Hung Nguyen, Jianxin Li, Hongzhi Yin
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2309.03518
Pdf link: https://arxiv.org/pdf/2309.03518
Abstract Latent factor models are the dominant backbones of contemporary recommender systems (RSs) given their performance advantages, where a unique vector embedding with a fixed dimensionality (e.g., 128) is required to represent each entity (commonly a user/item). Due to the large number of users and items on e-commerce sites, the embedding table is arguably the least memory-efficient component of RSs. For any lightweight recommender that aims to efficiently scale with the growing size of users/items or to remain applicable in resource-constrained settings, existing solutions either reduce the number of embeddings needed via hashing, or sparsify the full embedding table to switch off selected embedding dimensions. However, as hash collision arises or embeddings become overly sparse, especially when adapting to a tighter memory budget, those lightweight recommenders inevitably have to compromise their accuracy. To this end, we propose a novel compact embedding framework for RSs, namely Compositional Embedding with Regularized Pruning (CERP). Specifically, CERP represents each entity by combining a pair of embeddings from two independent, substantially smaller meta-embedding tables, which are then jointly pruned via a learnable element-wise threshold. In addition, we innovatively design a regularized pruning mechanism in CERP, such that the two sparsified meta-embedding tables are encouraged to encode information that is mutually complementary. Given the compatibility with agnostic latent factor models, we pair CERP with two popular recommendation models for extensive experiments, where results on two real-world datasets under different memory budgets demonstrate its superiority against state-of-the-art baselines. The codebase of CERP is available in https://github.com/xurong-liang/CERP.
Keyword: diffusion

SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction
Authors: Nivetha Jayakumar, Tonmoy Hossain, Miaomiao Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.03335
Pdf link: https://arxiv.org/pdf/2309.03335
Abstract 3D image reconstruction from a limited number of 2D images has been a long-standing challenge in computer vision and image analysis. While deep learning-based approaches have achieved impressive performance in this area, existing deep networks often fail to effectively utilize the shape structures of objects presented in images. As a result, the topology of reconstructed objects may not be well preserved, leading to the presence of artifacts such as discontinuities, holes, or mismatched connections between different parts. In this paper, we propose a shape-aware network based on diffusion models for 3D image reconstruction, named SADIR, to address these issues. In contrast to previous methods that primarily rely on spatial correlations of image intensities for 3D reconstruction, our model leverages shape priors learned from the training data to guide the reconstruction process. To achieve this, we develop a joint learning network that simultaneously learns a mean shape under deformation models. Each reconstructed image is then considered as a deformed variant of the mean shape. We validate our model, SADIR, on both brain and cardiac magnetic resonance images (MRIs). Experimental results show that our method outperforms the baselines with lower reconstruction error and better preservation of the shape structure of objects within the images.
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
Authors: Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03350
Pdf link: https://arxiv.org/pdf/2309.03350
Abstract Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.
Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature
Authors: Kyungguen Byun, Sunkuk Moon, Erik Visser
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2309.03364
Pdf link: https://arxiv.org/pdf/2309.03364
Abstract We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and improved intelligibility compared to a SOTA approach. It can cover a varying range of fundamental frequency (F0), energy and speed modulation while maintaining converted speech quality.
Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy
Authors: Yi Tang, Takafumi Iwaguchi, Hiroshi Kawasaki
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03445
Pdf link: https://arxiv.org/pdf/2309.03445
Abstract In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. Our method adapts conditional denoising diffusion probabilistic models to generate the corresponding enhanced images by using the underwater images and the Gaussian noise as the inputs. Additionally, in order to improve the efficiency of the reverse process in the diffusion model, we adopt two different ways. We firstly propose a lightweight transformer-based denoising network, which can effectively promote the time of network forward per iteration. On the other hand, we introduce a skip sampling strategy to reduce the number of iterations. Besides, based on the skip sampling strategy, we propose two different non-uniform sampling methods for the sequence of the time step, namely piecewise sampling and searching with the evolutionary algorithm. Both of them are effective and can further improve performance by using the same steps against the previous uniform sampling. In the end, we conduct a relative evaluation of the widely used underwater enhancement datasets between the recent state-of-the-art methods and the proposed approach. The experimental results prove that our approach can achieve both competitive performance and high efficiency. Our code is available at \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater}}.
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Authors: Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, Wenping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2309.03453
Pdf link: https://arxiv.org/pdf/2309.03453
Abstract In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Authors: Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2309.03549
Pdf link: https://arxiv.org/pdf/2309.03549
Abstract Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model
Authors: Sungwon Hwang, Junha Hyung, Jaegul Choo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03550
Pdf link: https://arxiv.org/pdf/2309.03550
Abstract Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.
DiffDefense: Defending against Adversarial Attacks via Diffusion Models
Authors: Hondamunige Prasanna Silva, Lorenzo Seidenari, Alberto Del Bimbo
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03702
Pdf link: https://arxiv.org/pdf/2309.03702
Abstract This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption
Authors: Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03729
Pdf link: https://arxiv.org/pdf/2309.03729
Abstract Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.
Text-to-feature diffusion for audio-visual few-shot learning
Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03869
Pdf link: https://arxiv.org/pdf/2309.03869
Abstract Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection
Authors: Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03893
Pdf link: https://arxiv.org/pdf/2309.03893
Abstract Data is the cornerstone of deep learning. This paper reveals that the recently developed Diffusion Model is a scalable data engine for object detection. Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we presentDiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based adapter to scale up data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Authors: Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03895
Pdf link: https://arxiv.org/pdf/2309.03895
Abstract We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.
Keyword: adaptive

A Human-Machine Joint Learning Framework to Boost Endogenous BCI Training
Authors: Hanwen Wang, Yu Qi, Lin Yao, Yueming Wang, Dario Farina, Gang Pan
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03209
Pdf link: https://arxiv.org/pdf/2309.03209
Abstract Brain-computer interfaces (BCIs) provide a direct pathway from the brain to external devices and have demonstrated great potential for assistive and rehabilitation technologies. Endogenous BCIs based on electroencephalogram (EEG) signals, such as motor imagery (MI) BCIs, can provide some level of control. However, mastering spontaneous BCI control requires the users to generate discriminative and stable brain signal patterns by imagery, which is challenging and is usually achieved over a very long training time (weeks/months). Here, we propose a human-machine joint learning framework to boost the learning process in endogenous BCIs, by guiding the user to generate brain signals towards an optimal distribution estimated by the decoder, given the historical brain signals of the user. To this end, we firstly model the human-machine joint learning process in a uniform formulation. Then a human-machine joint learning framework is proposed: 1) for the human side, we model the learning process in a sequential trial-and-error scenario and propose a novel copy/new'' feedback paradigm to help shape the signal generation of the subject toward the optimal distribution; 2) for the machine side, we propose a novel adaptive learning algorithm to learn an optimal signal distribution along with the subject's learning process. Specifically, the decoder reweighs the brain signals generated by the subject to focus more ongood'' samples to cope with the learning process of the subject. Online and psuedo-online BCI experiments with 18 healthy subjects demonstrated the advantages of the proposed joint learning process over co-adaptive approaches in both learning efficiency and effectiveness.
Adaptive Sampling of 3D Spatial Correlations for Focus+Context Visualization
Authors: Christoph Neuhauser, Josef Stumpfegger, Rüdiger Westermann
Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2309.03308
Pdf link: https://arxiv.org/pdf/2309.03308
Abstract Visualizing spatial structures in 3D ensembles is challenging due to the vast amounts of information that need to be conveyed. Memory and time constraints make it unfeasible to pre-compute and store the correlations between all pairs of domain points. We propose the embedding of adaptive correlation sampling into chord diagrams with hierarchical edge bundling to alleviate these constraints. Entities representing spatial regions are arranged along the circular chord layout via a space-filling curve, and Bayesian optimal sampling is used to efficiently estimate the maximum occurring correlation between any two points from different regions. Hierarchical edge bundling reduces visual clutter and emphasizes the major correlation structures. By selecting an edge, the user triggers a focus diagram in which only the two regions connected via this edge are refined and arranged in a specific way in a second chord layout. For visualizing correlations between two different variables, which are not symmetric anymore, we switch to showing a full correlation matrix. This avoids drawing the same edges twice with different correlation values. We introduce GPU implementations of both linear and non-linear correlation measures to further reduce the time that is required to generate the context and focus views, and to even enable the analysis of correlations in a 1000-member ensemble.
Cyber Recovery from Dynamic Load Altering Attacks: Linking Electricity, Transportation, and Cyber Networks
Authors: Mengxiang Liu, Zhongda Chu, Fei Teng
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.03380
Pdf link: https://arxiv.org/pdf/2309.03380
Abstract To address the increasing vulnerability of power grids, significant attention has been focused on the attack detection and impact mitigation. However, it is still unclear how to effectively and quickly recover the cyber and physical networks from a cyberattack. In this context, this paper presents the first investigation of the Cyber Recovery from Dynamic load altering Attack (CRDA). Considering the interconnection among electricity, transportation, and cyber networks, two essential sub-tasks are formulated for the CRDA: i) Optimal design of repair crew routes to remove installed malware and ii) Adaptive adjustment of system operation to eliminate the mitigation costs while guaranteeing stability. To achieve this, linear stability constraints are obtained by estimating the related eigenvalues under the variation of multiple IBR droop gains based on the sensitivity information of strategically selected sampling points. Moreover, to obtain the robust recovery strategy, the potential counter-measures from the adversary during the recovery process are modeled as maximizing the attack impact of remaining compromised resources in each step. A Mixed-Integer Linear Programming (MILP) problem can be finally formulated for the CRDA with the primary objective to reset involved droop gains and secondarily to repair all compromised loads. Case studies are performed in the modified IEEE 39-bus power system to illustrate the effectiveness of the proposed CRDA compared to the benchmark case.
Privacy-preserving Continual Federated Clustering via Adaptive Resonance Theory
Authors: Naoki Masuyama, Yusuke Nojima, Yuichiro Toda, Chu Kiong Loo, Hisao Ishibuchi, Naoyuki Kubota
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2309.03487
Pdf link: https://arxiv.org/pdf/2309.03487
Abstract With the increasing importance of data privacy protection, various privacy-preserving machine learning methods have been proposed. In the clustering domain, various algorithms with a federated learning framework (i.e., federated clustering) have been actively studied and showed high clustering performance while preserving data privacy. However, most of the base clusterers (i.e., clustering algorithms) used in existing federated clustering algorithms need to specify the number of clusters in advance. These algorithms, therefore, are unable to deal with data whose distributions are unknown or continually changing. To tackle this problem, this paper proposes a privacy-preserving continual federated clustering algorithm. In the proposed algorithm, an adaptive resonance theory-based clustering algorithm capable of continual learning is used as a base clusterer. Therefore, the proposed algorithm inherits the ability of continual learning. Experimental results with synthetic and real-world datasets show that the proposed algorithm has superior clustering performance to state-of-the-art federated clustering algorithms while realizing data privacy protection and continual learning ability. The source code is available at \url{https://github.com/Masuyama-lab/FCAC}.
An Adaptive and Modular Blockchain Enabled Architecture for a Decentralized Metaverse
Authors: Ye Cheng, Yihao Guo, Minghui Xu, Qin Hu, Dongxiao Yu, Xiuzhen Cheng
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03502
Pdf link: https://arxiv.org/pdf/2309.03502
Abstract A metaverse breaks the boundaries of time and space between people, realizing a more realistic virtual experience, improving work efficiency, and creating a new business model. Blockchain, as one of the key supporting technologies for a metaverse design, provides a trusted interactive environment. However, the rich and varied scenes of a metaverse have led to excessive consumption of on-chain resources, raising the threshold for ordinary users to join, thereby losing the human-centered design. Therefore, we propose an adaptive and modular blockchain-enabled architecture for a decentralized metaverse to address these issues. The solution includes an adaptive consensus/ledger protocol based on a modular blockchain, which can effectively adapt to the ever-changing scenarios of the metaverse, reduce resource consumption, and provide a secure and reliable interactive environment. In addition, we propose the concept of Non-Fungible Resource (NFR) to virtualize idle resources. Users can establish a temporary trusted environment and rent others' NFR to meet their computing needs. Finally, we simulate and test our solution based on XuperChain, and the experimental results prove the feasibility of our design.
DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph Partitioning by Chunks
Authors: Fahao Chen, Peng Li, Celimuge Wu
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.03523
Pdf link: https://arxiv.org/pdf/2309.03523
Abstract Dynamic Graph Neural Network (DGNN) has shown a strong capability of learning dynamic graphs by exploiting both spatial and temporal features. Although DGNN has recently received considerable attention by AI community and various DGNN models have been proposed, building a distributed system for efficient DGNN training is still challenging. It has been well recognized that how to partition the dynamic graph and assign workloads to multiple GPUs plays a critical role in training acceleration. Existing works partition a dynamic graph into snapshots or temporal sequences, which only work well when the graph has uniform spatio-temporal structures. However, dynamic graphs in practice are not uniformly structured, with some snapshots being very dense while others are sparse. To address this issue, we propose DGC, a distributed DGNN training system that achieves a 1.25x - 7.52x speedup over the state-of-the-art in our testbed. DGC's success stems from a new graph partitioning method that partitions dynamic graphs into chunks, which are essentially subgraphs with modest training workloads and few inter connections. This partitioning algorithm is based on graph coarsening, which can run very fast on large graphs. In addition, DGC has a highly efficient run-time, powered by the proposed chunk fusion and adaptive stale aggregation techniques. Extensive experimental results on 3 typical DGNN models and 4 popular dynamic graph datasets are presented to show the effectiveness of DGC.
Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning
Authors: Guan Gui, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03598
Pdf link: https://arxiv.org/pdf/2309.03598
Abstract In semi-supervised learning, unlabeled samples can be utilized through augmentation and consistency regularization. However, we observed certain samples, even undergoing strong augmentation, are still correctly classified with high confidence, resulting in a loss close to zero. It indicates that these samples have been already learned well and do not provide any additional optimization benefits to the model. We refer to these samples as ``naive samples". Unfortunately, existing SSL models overlook the characteristics of naive samples, and they just apply the same learning strategy to all samples. To further optimize the SSL model, we emphasize the importance of giving attention to naive samples and augmenting them in a more diverse manner. Sample adaptive augmentation (SAA) is proposed for this stated purpose and consists of two modules: 1) sample selection module; 2) sample augmentation module. Specifically, the sample selection module picks out {naive samples} based on historical training information at each epoch, then the naive samples will be augmented in a more diverse manner in the sample augmentation module. Thanks to the extreme ease of implementation of the above modules, SAA is advantageous for being simple and lightweight. We add SAA on top of FixMatch and FlexMatch respectively, and experiments demonstrate SAA can significantly improve the models. For example, SAA helped improve the accuracy of FixMatch from 92.50% to 94.76% and that of FlexMatch from 95.01% to 95.31% on CIFAR-10 with 40 labels.
Towards Comparable Knowledge Distillation in Semantic Image Segmentation
Authors: Onno Niemann, Christopher Vox, Thorben Werner
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03659
Pdf link: https://arxiv.org/pdf/2309.03659
Abstract Knowledge Distillation (KD) is one proposed solution to large model sizes and slow inference speed in semantic segmentation. In our research we identify 25 proposed distillation loss terms from 14 publications in the last 4 years. Unfortunately, a comparison of terms based on published results is often impossible, because of differences in training configurations. A good illustration of this problem is the comparison of two publications from 2022. Using the same models and dataset, Structural and Statistical Texture Distillation (SSTKD) reports an increase of student mIoU of 4.54 and a final performance of 29.19, while Adaptive Perspective Distillation (APD) only improves student performance by 2.06 percentage points, but achieves a final performance of 39.25. The reason for such extreme differences is often a suboptimal choice of hyperparameters and a resulting underperformance of the student model used as reference point. In our work, we reveal problems of insufficient hyperparameter tuning by showing that distillation improvements of two widely accepted frameworks, SKD and IFVD, vanish when hyperparameters are optimized sufficiently. To improve comparability of future research in the field, we establish a solid baseline for three datasets and two student models and provide extensive information on hyperparameter tuning. We find that only two out of eight techniques can compete with our simple baseline on the ADE20K dataset.
Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains
Authors: Peiyang Li, Ye Wang, Qi Li, Zhuotao Liu, Ke Xu, Ju Ren, Zhiying Liu, Ruilin Lin
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03660
Pdf link: https://arxiv.org/pdf/2309.03660
Abstract Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for model training and deployment. In this paper, we propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains in an organization with limited training data. Specifically, it utilizes meta-learning to share knowledge across these domains, e.g., the relationship between HTTP requests in heterogeneous domains, to efficiently train detection models. Moreover, we develop an adaptive preprocessing module to facilitate semantic analysis of Web requests across different domains and design a multi-domain representation method to capture semantic correlations between different domains for cross-domain model training. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. The experimental results demonstrate that RETSINA outperforms the existing unsupervised Web attack detection methods with limited training data, e.g., RETSINA needs only 5-minute training data to achieve comparable detection performance to the existing methods that train separate models for different domains using 1-day training data. We also conduct real-world deployment in an Internet company. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory
Authors: Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, Yang Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03696
Pdf link: https://arxiv.org/pdf/2309.03696
Abstract Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM.
Adjacency Sketches in Adversarial Environments
Authors: Moni Naor, Eugene Pekel
Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2309.03728
Pdf link: https://arxiv.org/pdf/2309.03728
Abstract An adjacency sketching or implicit labeling scheme for a family $\cal F$ of graphs is a method that defines for any $n$ vertex $G \in \cal F$ an assignment of labels to each vertex in $G$, so that the labels of two vertices tell you whether or not they are adjacent. The goal is to come up with labeling schemes that use as few bits as possible to represent the labels. By using randomness when assigning labels, it is sometimes possible to produce adjacency sketches with much smaller label sizes, but this comes at the cost of introducing some probability of error. Both deterministic and randomized labeling schemes have been extensively studied, as they have applications for distributed data structures and deeper connections to universal graphs and communication complexity. The main question of interest is which graph families have schemes using short labels, usually $O(\log n)$ in the deterministic case or constant for randomized sketches. In this work we consider the resilience of probabilistic adjacency sketches against an adversary making adaptive queries to the labels. This differs from the previously analyzed probabilistic setting which is ``one shot". We show that in the adaptive adversarial case the size of the labels is tightly related to the maximal degree of the graphs in $\cal F$. This results in a stronger characterization compared to what is known in the non-adversarial setting. In more detail, we construct sketches that fail with probability $\varepsilon$ for graphs with maximal degree $d$ using $2d\log (1/\varepsilon)$ bit labels and show that this is roughly the best that can be done for any specific graph of maximal degree $d$, e.g.\ a $d$-ary tree.
Bootstrapping Adaptive Human-Machine Interfaces with Offline Reinforcement Learning
Authors: Jensen Gao, Siddharth Reddy, Glen Berseth, Anca D. Dragan, Sergey Levine
Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.03839
Pdf link: https://arxiv.org/pdf/2309.03839
Abstract Adaptive interfaces can help users perform sequential decision-making tasks like robotic teleoperation given noisy, high-dimensional command signals (e.g., from a brain-computer interface). Recent advances in human-in-the-loop machine learning enable such systems to improve by interacting with users, but tend to be limited by the amount of data that they can collect from individual users in practice. In this paper, we propose a reinforcement learning algorithm to address this by training an interface to map raw command signals to actions using a combination of offline pre-training and online fine-tuning. To address the challenges posed by noisy command signals and sparse rewards, we develop a novel method for representing and inferring the user's long-term intent for a given trajectory. We primarily evaluate our method's ability to assist users who can only communicate through noisy, high-dimensional input channels through a user study in which 12 participants performed a simulated navigation task by using their eye gaze to modulate a 128-dimensional command signal from their webcam. The results show that our method enables successful goal navigation more often than a baseline directional interface, by learning to denoise user commands signals and provide shared autonomy assistance. We further evaluate on a simulated Sawyer pushing task with eye gaze control, and the Lunar Lander game with simulated user commands, and find that our method improves over baseline interfaces in these domains as well. Extensive ablation experiments with simulated user commands empirically motivate each component of our method.
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
Authors: Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.03904
Pdf link: https://arxiv.org/pdf/2309.03904
Abstract Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.
Keyword: quantization

There is no result

A-suozhang / GetArxivDaily

New submissions for Fri, 8 Sep 23 #149

Keyword: efficient

A 9 Transistor SRAM Featuring Array-level XOR Parallelism with Secure Data Toggling Operation

Explainable and Trustworthy Traffic Sign Detection for Safe Autonomous Driving: An Inductive Logic Programming Approach

Companion Animal Disease Diagnostics based on Literal-aware Medical Knowledge Graph Representation Learning

SPAIC: A sub-$μ$W/Channel, 16-Channel General-Purpose Event-Based Analog Front-End with Dual-Mode Encoders

Retail store customer behavior analysis system: Design and Implementation

RepSGG: Novel Representations of Entities and Relationships for Scene Graph Generation

Testing properties of distributions in the streaming model

Graph Theory Applications in Advanced Geospatial Research

Scalable Learning of Intrusion Responses through Recursive Decomposition

A Novel Approach for Invoice Management using Blockchain

Adaptive Sampling of 3D Spatial Correlations for Focus+Context Visualization

REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous Manipulation

MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation

Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation

ViewMix: Augmentation for Robust Representation in Self-Supervised Learning

Self-Supervised Masked Digital Elevation Models Encoding for Low-Resource Downstream Tasks

Towards Solving Industry-Grade Surrogate Modeling Problems using Physics Informed Machine Learning

A New Proper Orthogonal Decomposition Method with Second Difference Quotients for the Wave Equation

Efficient Baselines for Motion Prediction in Autonomous Driving

Are SNNs Truly Energy-efficient? $-$ A Hardware Perspective

Requirements Analysis of Variability Constraints in a Configurable Flight Software System

Predicting Defective Visual Code Changes in a Multi-Language AAA Video Game Project

RIS-Assisted Wireless Communications: Long-Term versus Short-Term Phase Shift Designs

Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation

Temporal Collection and Distribution for Referring Video Object Segmentation

HOPPER: Interpretative Fuzzing for Libraries

Dynamic Frame Interpolation in Wavelet Domain

Learning Compact Compositional Embeddings via Regularized Pruning for Recommendation

DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph Partitioning by Chunks

Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs

A new numerical mesoscopic scale one-domain approach solver for free fluid/porous medium interaction

MVD:A Novel Methodology and Dataset for Acoustic Vehicle Type Classification

Region Generation and Assessment Network for Occluded Person Re-Identification

Enhancing 5G Radio Planning with Graph Representations and Deep Learning

Spiking Structured State Space Model for Monaural Speech Enhancement

Formal Verification of Chase-Lev Deque in Concurrent Separation Logic

Characterizing Lipschitz Stability of GNN for Fairness

Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains

How adversarial attacks can disrupt seemingly stable accurate classifiers

Short-Term Load Forecasting Using A Particle-Swarm Optimized Multi-Head Attention-Augmented CNN-LSTM Network

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

Medoid Silhouette clustering with automatic cluster number selection

Extending Transductive Knowledge Graph Embedding Models for Inductive Logical Relational Inference

CPU frequency scheduling of real-time applications on embedded devices with temporal encoding-based deep reinforcement learning

Managing the Uncertainty in System Dynamics Through Distributionally Robust Stability-Constrained Optimization

Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

Mapping of CNNs on multi-core RRAM-based CIM architectures

On the Reduction of the Spherical Point-in-Polygon Problem for Antipode-Excluding Spherical Polygons

On Large Language Models' Selection Bias in Multi-Choice Questions

ProPainter: Improving Propagation and Transformer for Video Inpainting

Keyword: faster

A Circuit Domain Generalization Framework for Efficient Logic Synthesis in Chip Design

ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles

Medoid Silhouette clustering with automatic cluster number selection

Convergence Analysis of Decentralized ASGD

Keyword: mobile

MALITE: Lightweight Malware Detection and Classification for Constrained Devices

Resource Management for IRS-assisted WP-MEC Networks with Practical Phase Shift Model

Password-Stealing without Hacking: Wi-Fi Enabled Practical Keystroke Eavesdropping

Deep Reinforcement Learning Enabled Joint Deployment and Beamforming in STAR-RIS Assisted Networks

ReuNify: A Step Towards Whole Program Analysis for React Native Android Apps

Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs

Enhancing 5G Radio Planning with Graph Representations and Deep Learning

Multivariate, Multi-step, and Spatiotemporal Traffic Prediction for NextG Network Slicing under SLA Constraints

Keyword: pruning

Learning Compact Compositional Embeddings via Regularized Pruning for Recommendation

Keyword: diffusion

SADIR: Shape-Aware Diffusion Models for 3D Image Reconstruction

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

DiffDefense: Defending against Adversarial Attacks via Diffusion Models

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption