Abstract
Digital forensics and cloud forensics are increasingly important fields that face a range of challenges. This study aims to assess the general challenges faced in these fields. A literature review was conducted to identify the major challenges in digital and cloud forensics, including data acquisition, data analysis, data preservation, privacy concerns, and legal issues. The challenges were analyzed in detail, considering the reasons why they are challenges, the impact they have on digital and cloud forensics, and any potential solutions. The study concludes that the challenges faced in digital and cloud forensics are significant and varied, and that addressing these challenges is critical for the effective and efficient use of digital and cloud forensics in investigations. This study provides a valuable overview of the current state of digital and cloud forensic challenges and can help guide future research in this important field.
Inverse scattering of periodic surfaces with a superlens
Authors: Peijun Li, Yuliang Wang
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
Abstract
We propose a scheme for imaging periodic surfaces using a superlens. By employing an inverse scattering model and the transformed field expansion method, we derive an approximate reconstruction formula for the surface profile, assuming small amplitude. This formula suggests that unlimited resolution can be achieved for the linearized inverse problem with perfectly matched parameters. Our method requires only a single incident wave at a fixed frequency and can be efficiently implemented using fast Fourier transform. Through numerical experiments, we demonstrate that our method achieves resolution significantly surpassing the resolution limit for both smooth and non-smooth surface profiles with either perfect or marginally imperfect parameters.
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
Authors: Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Chenhao Ma, Kevin C.C. Chang, Fei Huang, Reynold Cheng, Yongbin Li
Abstract
Text-to-SQL parsing, which aims at converting natural language instructions into executable SQLs, has gained increasing attention in recent years. In particular, Codex and ChatGPT have shown impressive results in this task. However, most of the prevalent benchmarks, i.e., Spider, and WikiSQL, focus on database schema with few rows of database contents leaving the gap between academic study and real-world applications. To mitigate this gap, we present Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domains. Our emphasis on database values highlights the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing. The experimental results demonstrate the significance of database values in generating accurate text-to-SQLs for big databases. Furthermore, even the most effective text-to-SQL models, i.e. ChatGPT, only achieves 40.08% in execution accuracy, which is still far from the human result of 92.96%, proving that challenges still stand. Besides, we also provide an efficiency analysis to offer insights into generating text-to-efficient-SQLs that are beneficial to industries. We believe that BIRD will contribute to advancing real-world applications of text-to-SQL research. The leaderboard and source code are available: https://bird-bench.github.io/.
A Subjective Dataset for Multi-Screen Video Streaming Applications
Authors: Nabajeet Barman, Yuriy Reznik, Maria G. Martini
Abstract
In modern-era video streaming systems, videos are streamed and displayed on a wide range of devices. Such devices vary from large-screen UHD and HDTVs to medium-screen Desktop PCs and Laptops to smaller-screen devices such as mobile phones and tablets. It is well known that a video is perceived differently when displayed on different devices. The viewing experience for a particular video on smaller screen devices such as smartphones and tablets, which have high pixel density, will be different with respect to the case where the same video is played on a large screen device such as a TV or PC monitor. Being able to model such relative differences in perception effectively can help in the design of better quality metrics and in the design of more efficient and optimized encoding profiles, leading to lower storage, encoding, and transmission costs. This paper presents a new, open-source dataset consisting of subjective ratings for various encoded video sequences of different resolutions and bitrates (quality) when viewed on three devices of varying screen sizes: TV, Tablet, and Mobile. Along with the subjective scores, an evaluation of some of the most famous and commonly used open-source objective quality metrics is also presented. It is observed that the performance of the metrics varies a lot across different device types, with the recently standardized ITU-T P.1204.3 Model, on average, outperforming their full-reference counterparts. The dataset consisting of the videos, along with their subjective and objective scores, is available freely on Github at https://github.com/NabajeetBarman/Multiscreen-Dataset.
NEUROPULS: NEUROmorphic energy-efficient secure accelerators based on Phase change materials aUgmented siLicon photonicS
Authors: Fabio Pavanello, Cedric Marchand, Ian O'Connor, Regis Orobtchouk, Fabien Mandorlo, Xavier Letartre, Sebastien Cueff, Elena Ioana Vatajelu, Giorgio Di Natale, Benoit Cluzel, Aurelien Coillet, Benoit Charbonnier, Pierre Noe, Frantisek Kavan, Martin Zoldak, Michal Szaj, Peter Bienstman, Thomas Van Vaerenbergh, Ulrich Ruhrmair, Paulo Flores, Luis Guerra e Silva, Ricardo Chaves, Luis-Miguel Silveira, Mariano Ceccato, Dimitris Gizopoulos, George Papadimitriou, Vasileios Karakostas, Axel Brando, Francisco J.Cazorla, Ramon Canal, Pau Closas, Adria Gusi-Amigo, Paolo Crovetti, Alessio Carpegna, Tzamn Melendez Carmona, Stefano Di Carlo, Alessandro Savino
Subjects: Hardware Architecture (cs.AR); Signal Processing (eess.SP)
Abstract
This special session paper introduces the Horizon Europe NEUROPULS project, which targets the development of secure and energy-efficient RISC-V interfaced neuromorphic accelerators using augmented silicon photonics technology. Our approach aims to develop an augmented silicon photonics platform, an FPGA-powered RISC-V-connected computing platform, and a complete simulation platform to demonstrate the neuromorphic accelerator capabilities. In particular, their main advantages and limitations will be addressed concerning the underpinning technology for each platform. Then, we will discuss three targeted use cases for edge-computing applications: Global National Satellite System (GNSS) anti-jamming, autonomous driving, and anomaly detection in edge devices. Finally, we will address the reliability and security aspects of the stand-alone accelerator implementation and the project use cases.
Testing Convex Truncation
Authors: Anindya De, Shivam Nadimpalli, Rocco A. Servedio
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Probability (math.PR); Statistics Theory (math.ST)
Abstract
We study the basic statistical problem of testing whether normally distributed $n$-dimensional data has been truncated, i.e. altered by only retaining points that lie in some unknown truncation set $S \subseteq \mathbb{R}^n$. As our main algorithmic results, (1) We give a computationally efficient $O(n)$-sample algorithm that can distinguish the standard normal distribution $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary convex set $S$. (2) We give a different computationally efficient $O(n)$-sample algorithm that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary mixture of symmetric convex sets. These results stand in sharp contrast with known results for learning or testing convex bodies with respect to the normal distribution or learning convex-truncated normal distributions, where state-of-the-art algorithms require essentially $n^{\sqrt{n}}$ samples. An easy argument shows that no finite number of samples suffices to distinguish $N(0,I_n)$ from an unknown and arbitrary mixture of general (not necessarily symmetric) convex sets, so no common generalization of results (1) and (2) above is possible. We also prove that any algorithm (computationally efficient or otherwise) that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown symmetric convex set must use $\Omega(n)$ samples. This shows that the sample complexity of each of our algorithms is optimal up to a constant factor.
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Authors: Sai Qian Zhang, Thierry Tambe, Nestor Cuevas, Gu-Yeon Wei, David Brooks
Abstract
The emergence of the Internet of Things (IoT) has resulted in a remarkable amount of data generated on edge devices, which are often processed using AI algorithms. On-device learning enables edge platforms to continually adapt the AI models to user personal data and further allows for a better service quality. However, AI training on resource-limited devices is extremely difficult because of the intensive computing workload and the significant amount of on-chip memory consumption exacted by deep neural networks (DNNs). To mitigate this, we propose to use embedded dynamic random-access memory (eDRAM) as the main storage medium of training data. Compared with static random-access memory (SRAM), eDRAM introduces more than $2\times$ improvement on storage density, enabling reduced off-chip memory traffic. However, to keep the stored data intact, eDRAM is required to perform the power-hungry data refresh operations. eDRAM refresh can be eliminated if the data is stored for a period of time that is shorter than the eDRAM retention time. To achieve this, we design a novel reversible DNN architecture that enables a significantly reduced data lifetime during the training process and removes the need for eDRAM refresh. We further design an efficient on-device training engine, termed~\textit{CAMEL}, that uses eDRAM as the main on-chip memory. CAMEL enables the intermediate results during training to fit fully in on-chip eDRAM arrays and completely eliminates the off-chip DRAM traffic during the training process. We evaluate our CAMEL system on multiple DNNs with different datasets, demonstrating a more than $3\times$ saving on total DNN training energy consumption than the other baselines, while achieving a similar (even better) performance in validation accuracy.
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records
Authors: Kai Zhang, Xiaoqian Jiang
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
In the era of big data, there is an increasing need for healthcare providers, communities, and researchers to share data and collaborate to improve health outcomes, generate valuable insights, and advance research. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information by defining regulations for protected health information (PHI). However, it does not provide efficient tools for detecting or removing PHI before data sharing. One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties. This variability makes rule-based sensitive variable identification systems that work on one database fail on another. To address this issue, our paper explores the use of machine learning algorithms to identify sensitive variables in structured data, thus facilitating the de-identification process. We made a key observation that the distributions of metadata of PHI fields and non-PHI fields are very different. Based on this novel finding, we engineered over 30 features from the metadata of the original features and used machine learning to build classification models to automatically identify PHI fields in structured Electronic Health Record (EHR) data. We trained the model on a variety of large EHR databases from different data sources and found that our algorithm achieves 99% accuracy when detecting PHI-related fields for unseen datasets. The implications of our study are significant and can benefit industries that handle sensitive data.
Abstract
We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter $\varepsilon$ and a set $P$ of $n$ points in $\mathbb{R}^d$ where each point is assigned a color from a set $C$, and we want to build a structure s.t. given any geometric range $\gamma$, we can efficiently find a list of approximate heavy hitters in $\gamma\cap P$, i.e., colors that appear at least $\varepsilon |\gamma \cap P|$ times in $\gamma \cap P$, as well as their frequencies with an additive error of $\varepsilon |\gamma \cap P|$. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence $S$ of $1+1/\varepsilon$ weights s.t. the $i$-th weight in $S$ has approximate rank $i\varepsilon|\gamma\cap P|$, meaning, rank $i\varepsilon|\gamma\cap P|$ up to an additive error of $\varepsilon|\gamma\cap P|$. Previously, optimal results were only known in 1D [WY11] but a few sub-optimal methods were available in higher dimensions [AW17, ACH+12]. We study the problems for 3D halfspace and dominance queries. We consider the real RAM model with integer registers of size $w=\Theta(\log n)$ bits. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time $O(\log n + 1/\varepsilon)$. Note that as the output size is $\frac{1}{\varepsilon}$, after investing the initial $O(\log n)$ searching time, our structure takes on average $O(1)$ time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra $\log_w\frac{1}{\varepsilon}$ (resp. $\log\log_w\frac{1}{\varepsilon}$) factor in 3D (resp. 2D). By spending extra $\log^{O(1)}\frac{1}{\varepsilon}$ factors in time and space, we can also support quantile queries.
A Generative Modeling Framework for Inferring Families of Biomechanical Constitutive Laws in Data-Sparse Regimes
Authors: Minglang Yin, Zongren Zou, Enrui Zhang, Cristina Cavinato, Jay D. Humphrey, George Em Karniadakis
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Abstract
Quantifying biomechanical properties of the human vasculature could deepen our understanding of cardiovascular diseases. Standard nonlinear regression in constitutive modeling requires considerable high-quality data and an explicit form of the constitutive model as prior knowledge. By contrast, we propose a novel approach that combines generative deep learning with Bayesian inference to efficiently infer families of constitutive relationships in data-sparse regimes. Inspired by the concept of functional priors, we develop a generative adversarial network (GAN) that incorporates a neural operator as the generator and a fully-connected neural network as the discriminator. The generator takes a vector of noise conditioned on measurement data as input and yields the predicted constitutive relationship, which is scrutinized by the discriminator in the following step. We demonstrate that this framework can accurately estimate means and standard deviations of the constitutive relationships of the murine aorta using data collected either from model-generated synthetic data or ex vivo experiments for mice with genetic deficiencies. In addition, the framework learns priors of constitutive models without explicitly knowing their functional form, providing a new model-agnostic approach to learning hidden constitutive behaviors from data.
Abstract
We consider a multi-agent delegated search without money, which is the first to study the multi-agent extension of Kleinberg and Kleinberg (EC'18). In our model, given a set of agents, each agent samples a fixed number of solutions, and privately sends a signal, e.g., a subset of solutions, to the principal. Then, the principal selects a final solution based on the agents' signals. Our model captures a variety of real-world scenarios, spanning classical economical applications to modern intelligent system. In stark contrast to single-agent setting by Kleinberg and Kleinberg (EC'18) with an approximate Bayesian mechanism, we show that there exist efficient approximate prior-independent mechanisms with both information and performance gain, thanks to the competitive tension between the agents. Interestingly, however, the amount of such a compelling power significantly varies with respect to the information available to the agents, and the degree of correlation between the principal's and the agent's utility. Technically, we conduct a comprehensive study on the multi-agent delegated search problem and derive several results on the approximation factors of Bayesian/prior-independent mechanisms in complete/incomplete information settings. As a special case of independent interest, we obtain comparative statics regarding the number of agents which implies the dominance of the multi-agent setting ($n \ge 2$) over the single-agent setting ($n=1$) in terms of the principal's utility. We further extend our problem by considering an examination cost of the mechanism and derive some analogous results in the complete information setting.
Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution
Authors: Hyojoon Park, Sangeetha Grama Srinivasan, Matthew Cong, Doyub Kim, Byungsoo Kim, Jonathan Swartz, Ken Museth, Eftychios Sifakis
Abstract
We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, realtime physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (26x element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct - via simulation - a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme.
HeteroEdge: Addressing Asymmetry in Heterogeneous Collaborative Autonomous Systems
Authors: Mohammad Saeid Anwar, Emon Dey, Maloy Kumar Devnath, Indrajeet Ghosh, Naima Khan, Jade Freeman, Timothy Gregory, Niranjan Suri, Kasthuri Jayaraja, Sreenivasan Ramasamy Ramamurthy, Nirmalya Roy
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Gathering knowledge about surroundings and generating situational awareness for IoT devices is of utmost importance for systems developed for smart urban and uncontested environments. For example, a large-area surveillance system is typically equipped with multi-modal sensors such as cameras and LIDARs and is required to execute deep learning algorithms for action, face, behavior, and object recognition. However, these systems face power and memory constraints due to their ubiquitous nature, making it crucial to optimize data processing, deep learning algorithm input, and model inference communication. In this paper, we propose a self-adaptive optimization framework for a testbed comprising two Unmanned Ground Vehicles (UGVs) and two NVIDIA Jetson devices. This framework efficiently manages multiple tasks (storage, processing, computation, transmission, inference) on heterogeneous nodes concurrently. It involves compressing and masking input image frames, identifying similar frames, and profiling devices to obtain boundary conditions for optimization.. Finally, we propose and optimize a novel parameter split-ratio, which indicates the proportion of the data required to be offloaded to another device while considering the networking bandwidth, busy factor, memory (CPU, GPU, RAM), and power constraints of the devices in the testbed. Our evaluations captured while executing multiple tasks (e.g., PoseNet, SegNet, ImageNet, DetectNet, DepthNet) simultaneously, reveal that executing 70% (split-ratio=70%) of the data on the auxiliary node minimizes the offloading latency by approx. 33% (18.7 ms/image to 12.5 ms/image) and the total operation time by approx. 47% (69.32s to 36.43s) compared to the baseline configuration (executing on the primary node).
Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization
Authors: Yangyang Zhao, Zhenyu Wang, Mehdi Dastani, Shihan Wang
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
Abstract
Training a dialogue policy using deep reinforcement learning requires a lot of exploration of the environment. The amount of wasted invalid exploration makes their learning inefficient. In this paper, we find and define an important reason for the invalid exploration: dead-ends. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a dead-end resurrection (DDR) algorithm that detects the initial dead-end state in a timely and efficient manner and provides a rescue action to guide and correct the exploration direction. To prevent dialogue policies from repeatedly making the same mistake, DDR also performs dialogue data augmentation by adding relevant experiences containing dead-end states. We first validate the dead-end detection reliability and then demonstrate the effectiveness and generality of the method by reporting experimental results on several dialogue datasets from different domains.
Semantic Segmentation using Vision Transformers: A survey
Abstract
Semantic segmentation has a broad range of applications in a variety of domains including land coverage analysis, autonomous driving, and medical image analysis. Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation. Even though ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection since ViT is not a general purpose backbone due to its patch partitioning scheme. In this survey, we discuss some of the different ViT architectures that can be used for semantic segmentation and how their evolution managed the above-stated challenge. The rise of ViT and its performance with a high success rate motivated the community to slowly replace the traditional convolutional neural networks in various computer vision tasks. This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets. This will be worthwhile for the community to yield knowledge regarding the implementations carried out in semantic segmentation and to discover more efficient methodologies using ViTs.
Numerical stability analysis of shock-capturing methods for strong shocks I: second-order MUSCL schemes
Authors: Weijie Ren, Wenjia Xie, Ye Zhang, Hang Yu, Zhengyu Tian
Abstract
Modern shock-capturing schemes often suffer from numerical shock anomalies if the flow field contains strong shocks, which may limit their further application in hypersonic flow computations. In the current study, we devote our efforts to exploring the primary numerical characteristics and the underlying mechanism of shock instability for second-order finite-volume schemes. To this end, we, for the first time, develop the matrix stability analysis method for the finite-volume MUSCL approach. Such a linearized analysis method allows to investigate the shock instability problem of the finite-volume shock-capturing schemes in a quantitative and efficient manner. Results of the stability analysis demonstrate that the shock stability of second-order scheme is strongly related to the Riemann solver, Mach number, limiter function, numerical shock structure, and computational grid. Unique stability characteristics associated with these factors for second-order methods are revealed quantitatively with the established method. Source location of instability is also clarified by the matrix stability analysis method. Results show that the shock instability originates from the numerical shock structure. Such conclusions pave the way to better understand the shock instability problem and may shed new light on developing more reliable shock-capturing methods for compressible flows with high Mach number.
Composite Motion Learning with Task Control
Authors: Pei Xu, Xiumin Shang, Victor Zordan, Ioannis Karamouzas
Abstract
We present a deep learning method for composite and task-driven motion control for physically simulated characters. In contrast to existing data-driven approaches using reinforcement learning that imitate full-body motions, we learn decoupled motions for specific body parts from multiple reference motions simultaneously and directly by leveraging the use of multiple discriminators in a GAN-like setup. In this process, there is no need of any manual work to produce composite reference motions for learning. Instead, the control policy explores by itself how the composite motions can be combined automatically. We further account for multiple task-specific rewards and train a single, multi-objective control policy. To this end, we propose a novel framework for multi-objective learning that adaptively balances the learning of disparate motions from multiple sources and multiple goal-directed control objectives. In addition, as composite motions are typically augmentations of simpler behaviors, we introduce a sample-efficient method for training composite control policies in an incremental manner, where we reuse a pre-trained policy as the meta policy and train a cooperative policy that adapts the meta one for new composite tasks. We show the applicability of our approach on a variety of challenging multi-objective tasks involving both composite motion imitation and multiple goal-directed control.
StarPlat: A Versatile DSL for Graph Analytics
Authors: Nibedita Behera, Ashwina Kumar, Ebenezer Rajadurai T, Sai Nitish, Rajesh Pandian M, Rupesh Nasre
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Graphs model several real-world phenomena. With the growth of unstructured and semi-structured data, parallelization of graph algorithms is inevitable. Unfortunately, due to inherent irregularity of computation, memory access, and communication, graph algorithms are traditionally challenging to parallelize. To tame this challenge, several libraries, frameworks, and domain-specific languages (DSLs) have been proposed to reduce the parallel programming burden of the users, who are often domain experts. However, existing frameworks to model graph algorithms typically target a single architecture. In this paper, we present a graph DSL, named StarPlat, that allows programmers to specify graph algorithms in a high-level format, but generates code for three different backends from the same algorithmic specification. In particular, the DSL compiler generates OpenMP for multi-core, MPI for distributed, and CUDA for many-core GPUs. Since these three are completely different parallel programming paradigms, binding them together under the same language is challenging. We share our experience with the language design. Central to our compiler is an intermediate representation which allows a common representation of the high-level program, from which individual backend code generations begin. We demonstrate the expressiveness of StarPlat by specifying four graph algorithms: betweenness centrality computation, page rank computation, single-source shortest paths, and triangle counting. We illustrate the effectiveness of our approach by comparing the performance of the generated codes with that obtained with hand-crafted library codes. We find that the generated code is competitive to library-based codes in many cases. More importantly, we show the feasibility to generate efficient codes for different target architectures from the same algorithmic specification of graph algorithms.
Solution existence, uniqueness, and stability of discrete basis sinograms in multispectral CT
Authors: Yu Gao, Xiaochuan Pan, Chong Chen
Subjects: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Abstract
This work investigates conditions for quantitative image reconstruction in multispectral computed tomography (MSCT), which remains a topic of active research. In MSCT, one seeks to obtain from data the spatial distribution of linear attenuation coefficient, referred to as a virtual monochromatic image (VMI), at a given X-ray energy, within the subject imaged. As a VMI is decomposed often into a linear combination of basis images with known decomposition coefficients, the reconstruction of a VMI is thus tantamount to that of the basis images. An empirical, but highly effective, two-step data-domain-decomposition (DDD) method has been developed and used widely for quantitative image reconstruction in MSCT. In the two-step DDD method, step (1) estimates the so-called basis sinogram from data through solving a nonlinear transform, whereas step (2) reconstructs basis images from their basis sinograms estimated. Subsequently, a VMI can readily be obtained from the linear combination of basis images reconstructed. As step (2) involves the inversion of a straightforward linear system, step (1) is the key component of the DDD method in which a nonlinear system needs to be inverted for estimating the basis sinograms from data. In this work, we consider a {\it discrete} form of the nonlinear system in step (1), and then carry out theoretical and numerical analyses of conditions on the existence, uniqueness, and stability of a solution to the discrete nonlinear system for accurately estimating the discrete basis sinograms, leading to quantitative reconstruction of VMIs in MSCT.
The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation
Authors: Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, Björn W. Schuller
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract
The MuSe 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems: In the Mimicked Emotions Sub-Challenge (MuSe-Mimic), participants predict three continuous emotion targets. This sub-challenge utilises the Hume-Vidmimic dataset comprising of user-generated videos. For the Cross-Cultural Humour Detection Sub-Challenge (MuSe-Humour), an extension of the Passau Spontaneous Football Coach Humour (Passau-SFCH) dataset is provided. Participants predict the presence of spontaneous humour in a cross-cultural setting. The Personalisation Sub-Challenge (MuSe-Personalisation) is based on the Ulm-Trier Social Stress Test (Ulm-TSST) dataset, featuring recordings of subjects in a stressed situation. Here, arousal and valence signals are to be predicted, whereas parts of the test labels are made available in order to facilitate personalisation. MuSe 2023 seeks to bring together a broad audience from different research communities such as audio-visual emotion recognition, natural language processing, signal processing, and health informatics. In this baseline paper, we introduce the datasets, sub-challenges, and provided feature sets. As a competitive baseline system, a Gated Recurrent Unit (GRU)-Recurrent Neural Network (RNN) is employed. On the respective sub-challenges' test datasets, it achieves a mean (across three continuous intensity targets) Pearson's Correlation Coefficient of .4727 for MuSe-Mimic, an Area Under the Curve (AUC) value of .8310 for MuSe-Humor and Concordance Correlation Coefficient (CCC) values of .7482 for arousal and .7827 for valence in the MuSe-Personalisation sub-challenge.
DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation
Abstract
Given a small set of images of a specific subject, subject-driven text-to-image generation aims to generate customized images of the subject according to new text descriptions, which has attracted increasing attention in the community recently. Current subject-driven text-to-image generation methods are mainly based on finetuning a pretrained large-scale text-to-image generation model. However, these finetuning methods map the images of the subject into an embedding highly entangled with subject-identity-unrelated information, which may result in the inconsistency between the generated images and the text descriptions and the changes in the subject identity. To tackle the problem, we propose DisenBooth, a disentangled parameter-efficient tuning framework for subject-driven text-to-image generation. DisenBooth enables generating new images that simultaneously preserve the subject identity and conform to the text descriptions, by disentangling the embedding into an identity-related and an identity-unrelated part. Specifically, DisenBooth is based on the pretrained diffusion models and conducts finetuning in the diffusion denoising process, where a shared identity embedding and an image-specific identity-unrelated embedding are utilized jointly for denoising each image. To make the two embeddings disentangled, two auxiliary objectives are proposed. Additionally, to improve the finetuning efficiency, a parameter-efficient finetuning strategy is adopted. Extensive experiments show that our DisenBooth can faithfully learn well-disentangled identity-related and identity-unrelated embeddings. With the shared identity embedding, DisenBooth demonstrates superior subject-driven text-to-image generation ability. Additionally, DisenBooth provides a more flexible and controllable framework with different combinations of the disentangled embeddings.
Compressing audio CNNs with graph centrality based filter pruning
Authors: James A King, Arshdeep Singh, Mark D. Plumbley
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Abstract
Convolutional neural networks (CNNs) are commonplace in high-performing solutions to many real-world problems, such as audio classification. CNNs have many parameters and filters, with some having a larger impact on the performance than others. This means that networks may contain many unnecessary filters, increasing a CNN's computation and memory requirements while providing limited performance benefits. To make CNNs more efficient, we propose a pruning framework that eliminates filters with the highest "commonality". We measure this commonality using the graph-theoretic concept of "centrality". We hypothesise that a filter with a high centrality should be eliminated as it represents commonality and can be replaced by other filters without affecting the performance of a network much. An experimental evaluation of the proposed framework is performed on acoustic scene classification and audio tagging. On the DCASE 2021 Task 1A baseline network, our proposed method reduces computations per inference by 71\% with 50\% fewer parameters at less than a two percentage point drop in accuracy compared to the original network. For large-scale CNNs such as PANNs designed for audio tagging, our method reduces 24\% computations per inference with 41\% fewer parameters at a slight improvement in performance.
Adaptive Graph Convolutional Subspace Clustering
Authors: Lai Wei, Zhengwei Chen, Jun Yin, Changming Zhu, Rigui Zhou, Jin Liu
Abstract
Spectral-type subspace clustering algorithms have shown excellent performance in many subspace clustering applications. The existing spectral-type subspace clustering algorithms either focus on designing constraints for the reconstruction coefficient matrix or feature extraction methods for finding latent features of original data samples. In this paper, inspired by graph convolutional networks, we use the graph convolution technique to develop a feature extraction method and a coefficient matrix constraint simultaneously. And the graph-convolutional operator is updated iteratively and adaptively in our proposed algorithm. Hence, we call the proposed method adaptive graph convolutional subspace clustering (AGCSC). We claim that by using AGCSC, the aggregated feature representation of original data samples is suitable for subspace clustering, and the coefficient matrix could reveal the subspace structure of the original data set more faithfully. Finally, plenty of subspace clustering experiments prove our conclusions and show that AGCSC outperforms some related methods as well as some deep models.
Abstract
Entity Matching is the task of deciding if two entity descriptions refer to the same real-world entity. State-of-the-art entity matching methods often rely on fine-tuning Transformer models such as BERT or RoBERTa. Two major drawbacks of using these models for entity matching are that (i) the models require significant amounts of fine-tuning data for reaching a good performance and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. In this paper, we investigate using ChatGPT for entity matching as a more robust, training data-efficient alternative to traditional Transformer models. We perform experiments along three dimensions: (i) general prompt design, (ii) in-context learning, and (iii) provision of higher-level matching knowledge. We show that ChatGPT is competitive with a fine-tuned RoBERTa model, reaching an average zero-shot performance of 83% F1 on a challenging matching task on which RoBERTa requires 2000 training examples for reaching a similar performance. Adding in-context demonstrations to the prompts further improves the F1 by up to 5% even using only a small set of 20 handpicked examples. Finally, we show that guiding the zero-shot model by stating higher-level matching rules leads to similar gains as providing in-context examples.
Repair of Reed-Solomon Codes in the Presence of Erroneous Nodes
Authors: Stanislav Kruglik, Gaojun Luo, Wilton Kim, Shubhransh Singhvi, Han Mao Kiah, San Ling, Huaxiong Wang
Abstract
We consider the repair scheme of Guruswami-Wootters for the Reed-Solomon code and ask: can we correctly repair a failed node in the presence of erroneous nodes? Equivalently, we consider the collection of downloaded traces as a code and investigate its code-distance properties. We propose three lower bounds on its minimum distance and study methods to efficiently correct errors close to these bounds.
Descend: A Safe GPU Systems Programming Language
Authors: Bastian Köpcke, Sergei Gorlatch, Michel Steuwer
Abstract
Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream GPU programming languages, such as CUDA and OpenCL, are based on C/C++ inheriting their fundamentally unsafe ways to access memory via raw pointers. This facilitates easy to make, but hard to detect bugs such as data races and deadlocks. In this paper, we present Descend: a safe GPU systems programming language. In the spirit of Rust, Descend's type system enforces safe CPU and GPU memory management by tracking Ownership and Lifetimes. Descend introduces a new holistic GPU programming model where computations are hierarchically scheduled over the GPU's execution resources: grid, blocks, and threads. Descend's extended Borrow checking ensures that execution resources safely access memory regions without introducing data races. For this, we introduced views describing safe parallel access patterns of memory regions. We discuss the memory safety guarantees offered by Descend's type system and evaluate our implementation of Descend using a number of benchmarks, showing that no significant runtime overhead is introduced compared to manually written CUDA programs lacking Descend's safety guarantees.
Interactive Acquisition of Fine-grained Visual Concepts by Exploiting Semantics of Generic Characterizations in Discourse
Authors: Jonghyuk Park, Alex Lascarides, Subramanian Ramamoorthy
Abstract
Interactive Task Learning (ITL) concerns learning about unforeseen domain concepts via natural interactions with human users. The learner faces a number of significant constraints: learning should be online, incremental and few-shot, as it is expected to perform tangible belief updates right after novel words denoting unforeseen concepts are introduced. In this work, we explore a challenging symbol grounding task--discriminating among object classes that look very similar--within the constraints imposed by ITL. We demonstrate empirically that more data-efficient grounding results from exploiting the truth-conditions of the teacher's generic statements (e.g., "Xs have attribute Z.") and their implicatures in context (e.g., as an answer to "How are Xs and Ys different?", one infers Y lacks attribute Z).
Modular Polynomial Codes for Secure and Robust Distributed Matrix Multiplication
Abstract
We present Modular Polynomial (MP) Codes for Secure Distributed Matrix Multiplication (SDMM). The construction is based on the observation that one can decode certain proper subsets of the coefficients of a polynomial with fewer evaluations than is necessary to interpolate the entire polynomial. These codes are proven to outperform, in terms of recovery threshold, the currently best-known polynomial codes for the inner product partition. We also present Generalized Gap Additive Secure Polynomial (GGASP) codes for the grid partition. These two families of codes are shown experimentally to perform favorably in terms of recovery threshold when compared to other comparable polynomials codes for SDMM. Both MP and GGASP codes achieve the recovery threshold of Entangled Polynomial Codes for robustness against stragglers, but MP codes can decode below this recovery threshold depending on the set of worker nodes which fails. The decoding complexity of MP codes is shown to be lower than other approaches in the literature, due to the user not being tasked with interpolating an entire polynomial.
Abstract
Generative steganography (GS) is an emerging technique that generates stego images directly from secret data. Various GS methods based on GANs or Flow have been developed recently. However, existing GAN-based GS methods cannot completely recover the hidden secret data due to the lack of network invertibility, while Flow-based methods produce poor image quality due to the stringent reversibility restriction in each module. To address this issue, we propose a novel GS scheme called "Generative Steganography Diffusion" (GSD) by devising an invertible diffusion model named "StegoDiffusion". It not only generates realistic stego images but also allows for 100\% recovery of the hidden secret data. The proposed StegoDiffusion model leverages a non-Markov chain with a fast sampling technique to achieve efficient stego image generation. By constructing an ordinary differential equation (ODE) based on the transition probability of the generation process in StegoDiffusion, secret data and stego images can be converted to each other through the approximate solver of ODE -- Euler iteration formula, enabling the use of irreversible but more expressive network structures to achieve model invertibility. Our proposed GSD has the advantages of both reversibility and high performance, significantly outperforming existing GS methods in all metrics.
Zoo Guide to Network Embedding
Authors: Anthony Baptista, Rubén J. Sánchez-García, Anaïs Baudot, Ginestra Bianconi
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
Abstract
Networks have provided extremely successful models of data and complex systems. Yet, as combinatorial objects, networks do not have in general intrinsic coordinates and do not typically lie in an ambient space. The process of assigning an embedding space to a network has attracted lots of interest in the past few decades, and has been efficiently applied to fundamental problems in network inference, such as link prediction, node classification, and community detection. In this review, we provide a user-friendly guide to the network embedding literature and current trends in this field which will allow the reader to navigate through the complex landscape of methods and approaches emerging from the vibrant research activity on these subjects.
High-Level Context Representation for Emotion Recognition in Images
Authors: Willams de Lima Costa, Estefania Talavera Martinez, Lucas Silva Figueiredo, Veronica Teichrieb
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Abstract
Emotion recognition is the task of classifying perceived emotions in people. Previous works have utilized various nonverbal cues to extract features from images and correlate them to emotions. Of these cues, situational context is particularly crucial in emotion perception since it can directly influence the emotion of a person. In this paper, we propose an approach for high-level context representation extraction from images. The model relies on a single cue and a single encoding stream to correlate this representation with emotions. Our model competes with the state-of-the-art, achieving an mAP of 0.3002 on the EMOTIC dataset while also being capable of execution on consumer-grade hardware at approximately 90 frames per second. Overall, our approach is more efficient than previous models and can be easily deployed to address real-world problems related to emotion recognition.
Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment
Authors: Zhen Zhang, Jialu Wang, Xin Eric Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts. Despite recent efforts to extend CLIP to support other languages, disparities in performance among different languages have been observed due to uneven resource availability. Additionally, current cross-lingual transfer methods of those pre-trained models would consume excessive resources for a large number of languages. Therefore, we propose a new parameter-efficient cross-lingual transfer learning framework that utilizes a translation-based alignment method to mitigate multilingual disparities and explores parameter-efficient fine-tuning methods for parameter-efficient cross-lingual transfer. Extensive experiments on XTD and Multi30K datasets, covering 11 languages under zero-shot, few-shot, and full-dataset learning scenarios, show that our framework significantly reduces the multilingual disparities among languages and improves cross-lingual transfer results, especially in low-resource scenarios, while only keeping and fine-tuning an extremely small number of parameters compared to the full model (e.g., Our framework only requires 0.16\% additional parameters of a full-model for each language in the few-shot learning scenario).
Over-the-Air Federated Averaging with Limited Power and Privacy Budgets
Authors: Na Yan, Kezhi Wang, Cunhua Pan, Kok Keong Chai, Feng Shu, Jiangzhou Wang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
Abstract
To jointly overcome the communication bottleneck and privacy leakage of wireless federated learning (FL), this paper studies a differentially private over-the-air federated averaging (DP-OTA-FedAvg) system with a limited sum power budget. With DP-OTA-FedAvg, the gradients are aligned by an alignment coefficient and aggregated over the air, and channel noise is employed to protect privacy. We aim to improve the learning performance by jointly designing the device scheduling, alignment coefficient, and the number of aggregation rounds of federated averaging (FedAvg) subject to sum power and privacy constraints. We first present the privacy analysis based on differential privacy (DP) to quantify the impact of the alignment coefficient on privacy preservation in each communication round. Furthermore, to study how the device scheduling, alignment coefficient, and the number of the global aggregation affect the learning process, we conduct the convergence analysis of DP-OTA-FedAvg in the cases of convex and non-convex loss functions. Based on these analytical results, we formulate an optimization problem to minimize the optimality gap of the DP-OTA-FedAvg subject to limited sum power and privacy budgets. The problem is solved by decoupling it into two sub-problems. Given the number of communication rounds, we conclude the relationship between the number of scheduled devices and the alignment coefficient, which offers a set of potential optimal solution pairs of device scheduling and the alignment coefficient. Thanks to the reduced search space, the optimal solution can be efficiently obtained. The effectiveness of the proposed policy is validated through simulations.
Energy-Latency Aware Intelligent Reflecting Surface Aided Multi-cell Mobile Edge Computing
Authors: Wenhan Xu, Jiadong Yu, Yuan Wu, Danny H.K. Tsang
Subjects: Networking and Internet Architecture (cs.NI)
Abstract
The explosive development of the Internet of Things (IoT) has led to increased interest in mobile edge computing (MEC), which provides computational resources at network edges to accommodate computation-intensive and latency-sensitive applications. Intelligent reflecting surfaces (IRSs) have gained attention as a solution to overcome blockage problems during the offloading uplink transmission in MEC systems. This paper explores IRS-aided multi-cell networks that enable servers to serve neighboring cells and cooperate to handle resource exhaustion. We aim to minimize the joint energy and latency cost, by jointly optimizing computation tasks, edge computing resources, user beamforming, and IRS phase shifts. The problem is decomposed into two subproblems--the MEC subproblem and the IRS communication subproblem--using the block coordinate descent (BCD) technique. The MEC subproblem is reformulated as a nonconvex quadratic constrained problem (QCP), while the IRS communication subproblem is transformed into a weight-sum-rate problem with auxiliary variables. We propose an efficient algorithm to iteratively optimize MEC resources and IRS communication until convergence. Numerical results show that our algorithm outperforms benchmarks and that multi-cell MEC systems achieve additional performance gains when supported by IRS.
Initial Steps Towards Tackling High-dimensional Surrogate Modeling for Neuroevolution Using Kriging Partial Least Squares
Authors: Fergal Stapleton, Edgar Galván
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Surrogate-assisted evolutionary algorithms (SAEAs) aim to use efficient computational models with the goal of approximating the fitness function in evolutionary computation systems. This area of research has been active for over two decades and has received significant attention from the specialised research community in different areas, for example, single and many objective optimisation or dynamic and stationary optimisation problems. An emergent and exciting area that has received little attention from the SAEAs community is in neuroevolution. This refers to the use of evolutionary algorithms in the automatic configuration of artificial neural network (ANN) architectures, hyper-parameters and/or the training of ANNs. However, ANNs suffer from two major issues: (a) the use of highly-intense computational power for their correct training, and (b) the highly specialised human expertise required to correctly configure ANNs necessary to get a well-performing network. This work aims to fill this important research gap in SAEAs in neuroevolution by addressing these two issues. We demonstrate how one can use a Kriging Partial Least Squares method that allows efficient computation of good approximate surrogate models compared to the well-known Kriging method, which normally cannot be used in neuroevolution due to the high dimensionality of the data.
Uncertainty Quantification for Fisher-Kolmogorov Equation on Graphs with Application to Patient-Specific Alzheimer Disease
Authors: Mattia Corti, Francesca Bonizzoni, Paola F. Antonietti, Alfio M. Quarteroni
Abstract
The Fisher-Kolmogorov equation is a diffusion-reaction PDE that is used to model the accumulation of prionic proteins, which are responsible for many different neurological disorders. Likely, the most important and studied misfolded protein in literature is the Amyloid-$\beta$, responsible for the onset of Alzheimer disease. Starting from medical images we construct a reduced-order model based on a graph brain connectome. The reaction coefficient of the proteins is modelled as a stochastic random field, taking into account all the many different underlying physical processes, which can hardly be measured. Its probability distribution is inferred by means of the Monte Carlo Markov Chain method applied to clinical data. The resulting model is patient-specific and can be employed for predicting the disease's future development. Forward uncertainty quantification techniques (Monte Carlo and sparse grid stochastic collocation) are applied with the aim of quantifying the impact of the variability of the reaction coefficient on the progression of protein accumulation within the next 20 years.
Verifiable Learning for Robust Tree Ensembles
Authors: Stefano Calzavara (1), Lorenzo Cazzaro (1), Giulio Ermanno Pibiri (1), Nicola Prezza (1) ((1) Università Ca' Foscari Venezia, Italy)
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
Abstract
Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on publicly available datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, while incurring in just a relatively small loss of accuracy in the non-adversarial setting.
On Preimage Approximation for Neural Networks
Authors: Xiyue Zhang, Benjie Wang, Marta Kwiatkowska
Abstract
Neural network verification mainly focuses on local robustness properties. However, often it is important to know whether a given property holds globally for the whole input domain, and if not then for what proportion of the input the property is true. While exact preimage generation can construct an equivalent representation of neural networks that can aid such (quantitative) global robustness verification, it is intractable at scale. In this work, we propose an efficient and practical anytime algorithm for generating symbolic under-approximations of the preimage of neural networks based on linear relaxation. Our algorithm iteratively minimizes the volume approximation error by partitioning the input region into subregions, where the neural network relaxation bounds become tighter. We further employ sampling and differentiable approximations to the volume in order to prioritize regions to split and optimize the parameters of the relaxation, leading to faster improvement and more compact under-approximations. Evaluation results demonstrate that our approach is able to generate preimage approximations significantly faster than exact methods and scales to neural network controllers for which exact preimage generation is intractable. We also demonstrate an application of our approach to quantitative global verification.
Keyword: faster
Communication-Efficient Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching
Authors: Tim Kaler, Alexandros-Stavros Iliopoulos, Philip Murzynowski, Tao B. Schardl, Charles E. Leiserson, Jie Chen
Abstract
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs, owing to the widespread use and success of GNNs in applications such as recommendation systems and financial forensics. This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings, where the necessary partitioning of vertex features across distributed storage causes feature communication to become a major bottleneck that hampers scalability. To significantly reduce the communication volume without compromising prediction accuracy, we propose a policy for caching data associated with frequently accessed vertices in remote partitions. The proposed policy is based on an analysis of vertex-wise inclusion probabilities (VIP) during multi-hop neighborhood sampling, which may expand the neighborhood far beyond the partition boundaries of the graph. VIP analysis not only enables the elimination of the communication bottleneck, but it also offers a means to organize in-memory data by prioritizing GPU storage for the most frequently accessed vertex features. We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data and leverages the VIP-driven caching policy. SALIENT++ retains the local training efficiency and scalability of SALIENT by using a deep pipeline and drastically reducing communication volume while consuming only a fraction of the storage required by SALIENT. We provide experimental results with the Open Graph Benchmark data sets and demonstrate that training a 3-layer GraphSAGE model with SALIENT++ on 8 single-GPU machines is 7.1 faster than with SALIENT on 1 single-GPU machine, and 12.7 faster than with DistDGL on 8 single-GPU machines.
Online Gesture Recognition using Transformer and Natural Language Processing
Authors: G.C.M. Silvestre, F. Balado, O. Akinremi, M. Ramo
Abstract
The Transformer architecture is shown to provide a powerful machine transduction framework for online handwritten gestures corresponding to glyph strokes of natural language sentences. The attention mechanism is successfully used to create latent representations of an end-to-end encoder-decoder model, solving multi-level segmentation while also learning some language features and syntax rules. The additional use of a large decoding space with some learned Byte-Pair-Encoding (BPE) is shown to provide robustness to ablated inputs and syntax rules. The encoder stack was directly fed with spatio-temporal data tokens potentially forming an infinitely large input vocabulary, an approach that finds applications beyond that of this work. Encoder transfer learning capabilities is also demonstrated on several languages resulting in faster optimisation and shared parameters. A new supervised dataset of online handwriting gestures suitable for generic handwriting recognition tasks was used to successfully train a small transformer model to an average normalised Levenshtein accuracy of 96% on English or German sentences and 94% in French.
On Preimage Approximation for Neural Networks
Authors: Xiyue Zhang, Benjie Wang, Marta Kwiatkowska
Abstract
Neural network verification mainly focuses on local robustness properties. However, often it is important to know whether a given property holds globally for the whole input domain, and if not then for what proportion of the input the property is true. While exact preimage generation can construct an equivalent representation of neural networks that can aid such (quantitative) global robustness verification, it is intractable at scale. In this work, we propose an efficient and practical anytime algorithm for generating symbolic under-approximations of the preimage of neural networks based on linear relaxation. Our algorithm iteratively minimizes the volume approximation error by partitioning the input region into subregions, where the neural network relaxation bounds become tighter. We further employ sampling and differentiable approximations to the volume in order to prioritize regions to split and optimize the parameters of the relaxation, leading to faster improvement and more compact under-approximations. Evaluation results demonstrate that our approach is able to generate preimage approximations significantly faster than exact methods and scales to neural network controllers for which exact preimage generation is intractable. We also demonstrate an application of our approach to quantitative global verification.
Fast Dynamic Programming in Trees in the MPC Model
Authors: Chetan Gupta, Rustam Latypov, Yannic Maus, Shreyas Pai, Simo Särkkä, Jan Studený, Jukka Suomela, Jara Uitto, Hossein Vahidi
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Abstract
We present a deterministic algorithm for solving a wide range of dynamic programming problems in trees in $O(\log D)$ rounds in the massively parallel computation model (MPC), with $O(n^\delta)$ words of local memory per machine, for any given constant $0 < \delta < 1$. Here $D$ is the diameter of the tree and $n$ is the number of nodes--we emphasize that our running time is independent of $n$. Our algorithm can solve many classical graph optimization problems such as maximum weight independent set, maximum weight matching, minimum weight dominating set, and minimum weight vertex cover. It can also be used to solve many accumulation tasks in which some aggregate information is propagated upwards or downwards in the tree--this includes, for example, computing the sum, minimum, or maximum of the input labels in each subtree, as well as many inference tasks commonly solved with belief propagation. Our algorithm can also solve any locally checkable labeling problem (LCLs) in trees. Our algorithm works for any reasonable representation of the input tree; for example, the tree can be represented as a list of edges or as a string with nested parentheses or tags. The running time of $O(\log D)$ rounds is also known to be necessary, assuming the widely-believed $2$-cycle conjecture. Our algorithm strictly improves on two prior algorithms: (i) Bateni, Behnezhad, Derakhshan, Hajiaghayi, and Mirrokni [ICALP'18] solve problems of these flavors in $O(\log n)$ rounds, while our algorithm is much faster in low-diameter trees. Furthermore, their algorithm also uses randomness, while our algorithm is deterministic. (ii) Balliu, Latypov, Maus, Olivetti, and Uitto [SODA'23] solve only locally checkable labeling problems in $O(\log D)$ rounds, while our algorithm can be applied to a much broader family of problems.
On the Benefits of Semi-Supervised Test Case Generation for Cyber-Physical Systems
Abstract
Testing complex Cyber Physical Systems (CPSs) can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels are then spread across the rest of the data. When applied to five open-source CPSs, GenClu's test generation can be multiple orders of magnitude faster than the prior state of the art. Further, when assessed via mutation testing, tests generated by GenClu were as good or better than anything else tested here. Hence, we recommend semi-supervised methods over prior methods (evolutionary search and fully-supervised learning).
Abstract
We consider the problem of federated offline reinforcement learning (RL), a scenario under which distributed learning agents must collaboratively learn a high-quality control policy only using small pre-collected datasets generated according to different unknown behavior policies. Naively combining a standard offline RL approach with a standard federated learning approach to solve this problem can lead to poorly performing policies. In response, we develop the Federated Ensemble-Directed Offline Reinforcement Learning Algorithm (FEDORA), which distills the collective wisdom of the clients using an ensemble learning approach. We develop the FEDORA codebase to utilize distributed compute resources on a federated learning platform. We show that FEDORA significantly outperforms other approaches, including offline RL over the combined data pool, in various complex continuous control environments and real world datasets. Finally, we demonstrate the performance of FEDORA in the real-world on a mobile robot.
A Subjective Dataset for Multi-Screen Video Streaming Applications
Authors: Nabajeet Barman, Yuriy Reznik, Maria G. Martini
Abstract
In modern-era video streaming systems, videos are streamed and displayed on a wide range of devices. Such devices vary from large-screen UHD and HDTVs to medium-screen Desktop PCs and Laptops to smaller-screen devices such as mobile phones and tablets. It is well known that a video is perceived differently when displayed on different devices. The viewing experience for a particular video on smaller screen devices such as smartphones and tablets, which have high pixel density, will be different with respect to the case where the same video is played on a large screen device such as a TV or PC monitor. Being able to model such relative differences in perception effectively can help in the design of better quality metrics and in the design of more efficient and optimized encoding profiles, leading to lower storage, encoding, and transmission costs. This paper presents a new, open-source dataset consisting of subjective ratings for various encoded video sequences of different resolutions and bitrates (quality) when viewed on three devices of varying screen sizes: TV, Tablet, and Mobile. Along with the subjective scores, an evaluation of some of the most famous and commonly used open-source objective quality metrics is also presented. It is observed that the performance of the metrics varies a lot across different device types, with the recently standardized ITU-T P.1204.3 Model, on average, outperforming their full-reference counterparts. The dataset consisting of the videos, along with their subjective and objective scores, is available freely on Github at https://github.com/NabajeetBarman/Multiscreen-Dataset.
A Large-scale Empirical Study of Online Automated Privacy Policy Generators for Mobile Apps
Authors: Shidong Pan, Dawen Zhang, Mark Staples, Zhenchang Xing, Jieshan Chen, Xiwei Xu, James Hoang
Abstract
Mobile phones and apps have become a ubiquitous part of digital life. There is a large variety and volume of personal data sent to and used by mobile apps, leading to various privacy issues. Privacy regulations protect and promote the privacy of individuals by requiring mobile apps to provide a privacy policy that explains what personal information is gathered and how these apps process and safely keep this information. However, developers often do not have sufficient legal knowledge to create such privacy policies. Online Automated Privacy Policy Generators (APPGs) can create privacy policies, but their quality and other characteristics can vary. In this paper, we conduct the first large-scale, comprehensive empirical study of APPGs for mobile apps. Specifically, we collected and analyzed 46,472 Android app privacy policies from the Google Play Store and systematically evaluated 10 APPGs on multiple dimensions. We reported analyses on how widely APPGs are used and whether policies are consistent with app permissions. We found that nearly 20.1% of privacy policies could be generated by APPGs and summarized the potential and limitations of APPGs.
Enhanced Low-Complexity FDD System Feedback with Variable Bit Lengths via Generative Modeling
Authors: Nurettin Turan, Benedikt Fesl, Wolfgang Utschick
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Recently, a versatile limited feedback scheme based on a Gaussian mixture model (GMM) was proposed for frequency division duplex (FDD) systems. This scheme provides high flexibility regarding various system parameters and is applicable to both point-to-point multiple-input multiple-output (MIMO) and multi-user MIMO (MU-MIMO) communications. The GMM is learned to cover the operation of all mobile terminals (MTs) located inside the base station (BS) cell, and each MT only needs to evaluate its strongest mixture component as feedback, eliminating the need for channel estimation at the MT. In this work, we extend the GMM-based feedback scheme to variable feedback lengths by leveraging a single learned GMM through merging or pruning of dispensable mixture components. Additionally, the GMM covariances are restricted to Toeplitz or circulant structure through model-based insights. These extensions significantly reduce the offloading amount and enhance the clustering ability of the GMM which, in turn, leads to an improved system performance. Simulation results for both point-to-point and multi-user systems demonstrate the effectiveness of the proposed extensions.
Abstract
State-of-the-art research of traditional computer vision is increasingly leveraged in the surgical domain. A particular focus in computer-assisted surgery is to replace marker-based tracking systems for instrument localization with pure image-based 6DoF pose estimation. However, the state of the art has not yet met the accuracy required for surgical navigation. In this context, we propose a high-fidelity marker-less optical tracking system for surgical instrument localization. We developed a multi-view camera setup consisting of static and mobile cameras and collected a large-scale RGB-D video dataset with dedicated synchronization and data fusions methods. Different state-of-the-art pose estimation methods were integrated into a deep learning pipeline and evaluated on multiple camera configurations. Furthermore, the performance impacts of different input modalities and camera positions, as well as training on purely synthetic data, were compared. The best model achieved an average position and orientation error of 1.3 mm and 1.0{\deg} for a surgical drill as well as 3.8 mm and 5.2{\deg} for a screwdriver. These results significantly outperform related methods in the literature and are close to clinical-grade accuracy, demonstrating that marker-less tracking of surgical instruments is becoming a feasible alternative to existing marker-based systems.
Energy-Latency Aware Intelligent Reflecting Surface Aided Multi-cell Mobile Edge Computing
Authors: Wenhan Xu, Jiadong Yu, Yuan Wu, Danny H.K. Tsang
Subjects: Networking and Internet Architecture (cs.NI)
Abstract
The explosive development of the Internet of Things (IoT) has led to increased interest in mobile edge computing (MEC), which provides computational resources at network edges to accommodate computation-intensive and latency-sensitive applications. Intelligent reflecting surfaces (IRSs) have gained attention as a solution to overcome blockage problems during the offloading uplink transmission in MEC systems. This paper explores IRS-aided multi-cell networks that enable servers to serve neighboring cells and cooperate to handle resource exhaustion. We aim to minimize the joint energy and latency cost, by jointly optimizing computation tasks, edge computing resources, user beamforming, and IRS phase shifts. The problem is decomposed into two subproblems--the MEC subproblem and the IRS communication subproblem--using the block coordinate descent (BCD) technique. The MEC subproblem is reformulated as a nonconvex quadratic constrained problem (QCP), while the IRS communication subproblem is transformed into a weight-sum-rate problem with auxiliary variables. We propose an efficient algorithm to iteratively optimize MEC resources and IRS communication until convergence. Numerical results show that our algorithm outperforms benchmarks and that multi-cell MEC systems achieve additional performance gains when supported by IRS.
Keyword: pruning
Program Synthesis for Robot Learning from Demonstrations
Abstract
This paper presents a new synthesis-based approach for solving the Learning from Demonstration (LfD) problem in robotics. Given a set of user demonstrations, the goal of programmatic LfD is to learn a policy in a programming language that can be used to control a robot's behavior. We address this problem through a novel program synthesis algorithm that leverages two key ideas: First, to perform fast and effective generalization from user demonstrations, our synthesis algorithm views these demonstrations as strings over a finite alphabet and abstracts programs in our DSL as regular expressions over the same alphabet. This regex abstraction facilitates synthesis by helping infer useful program sketches and pruning infeasible parts of the search space. Second, to deal with the large number of object types in the environment, our method leverages a Large Language Model (LLM) to guide search. We have implemented our approach in a tool called Prolex and present the results of a comprehensive experimental evaluation on 120 benchmarks involving 40 unique tasks in three different environments. We show that, given a 120 second time limit, Prolex can find a program consistent with the demonstrations in 80% of the cases. Furthermore, for 81% of the tasks for which a solution is returned, Prolex is able to find the ground truth program with just one demonstration. To put these results in perspective, we conduct a comparison against two baselines and show that both perform much worse.
Compressing audio CNNs with graph centrality based filter pruning
Authors: James A King, Arshdeep Singh, Mark D. Plumbley
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Abstract
Convolutional neural networks (CNNs) are commonplace in high-performing solutions to many real-world problems, such as audio classification. CNNs have many parameters and filters, with some having a larger impact on the performance than others. This means that networks may contain many unnecessary filters, increasing a CNN's computation and memory requirements while providing limited performance benefits. To make CNNs more efficient, we propose a pruning framework that eliminates filters with the highest "commonality". We measure this commonality using the graph-theoretic concept of "centrality". We hypothesise that a filter with a high centrality should be eliminated as it represents commonality and can be replaced by other filters without affecting the performance of a network much. An experimental evaluation of the proposed framework is performed on acoustic scene classification and audio tagging. On the DCASE 2021 Task 1A baseline network, our proposed method reduces computations per inference by 71\% with 50\% fewer parameters at less than a two percentage point drop in accuracy compared to the original network. For large-scale CNNs such as PANNs designed for audio tagging, our method reduces 24\% computations per inference with 41\% fewer parameters at a slight improvement in performance.
Enhanced Low-Complexity FDD System Feedback with Variable Bit Lengths via Generative Modeling
Authors: Nurettin Turan, Benedikt Fesl, Wolfgang Utschick
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
Recently, a versatile limited feedback scheme based on a Gaussian mixture model (GMM) was proposed for frequency division duplex (FDD) systems. This scheme provides high flexibility regarding various system parameters and is applicable to both point-to-point multiple-input multiple-output (MIMO) and multi-user MIMO (MU-MIMO) communications. The GMM is learned to cover the operation of all mobile terminals (MTs) located inside the base station (BS) cell, and each MT only needs to evaluate its strongest mixture component as feedback, eliminating the need for channel estimation at the MT. In this work, we extend the GMM-based feedback scheme to variable feedback lengths by leveraging a single learned GMM through merging or pruning of dispensable mixture components. Additionally, the GMM covariances are restricted to Toeplitz or circulant structure through model-based insights. These extensions significantly reduce the offloading amount and enhance the clustering ability of the GMM which, in turn, leads to an improved system performance. Simulation results for both point-to-point and multi-user systems demonstrate the effectiveness of the proposed extensions.
Regular Methods for Operator Precedence Languages
Authors: Thomas A. Henzinger, Pavol Kebis, Nicolas Mazzocchi, N. Ege Saraç
Subjects: Formal Languages and Automata Theory (cs.FL)
Abstract
The operator precedence languages (OPLs) represent the largest known subclass of the context-free languages which enjoys all desirable closure and decidability properties. This includes the decidability of language inclusion, which is the ultimate verification problem. Operator precedence grammars, automata, and logics have been investigated and used, for example, to verify programs with arithmetic expressions and exceptions (both of which are deterministic pushdown but lie outside the scope of the visibly pushdown languages). In this paper, we complete the picture and give, for the first time, an algebraic characterization of the class of OPLs in the form of a syntactic congruence that has finitely many equivalence classes exactly for the operator precedence languages. This is a generalization of the celebrated Myhill-Nerode theorem for the regular languages to OPLs. As one of the consequences, we show that universality and language inclusion for nondeterministic operator precedence automata can be solved by an antichain algorithm. Antichain algorithms avoid determinization and complementation through an explicit subset construction, by leveraging a quasi-order on words, which allows the pruning of the search space for counterexample words without sacrificing completeness. Antichain algorithms can be implemented symbolically, and these implementations are today the best-performing algorithms in practice for the inclusion of finite automata. We give a generic construction of the quasi-order needed for antichain algorithms from a finite syntactic congruence. This yields the first antichain algorithm for OPLs, an algorithm that solves the \textsc{ExpTime}-hard language inclusion problem for OPLs in exponential time.
Learn how to Prune Pixels for Multi-view Neural Image-based Synthesis
Authors: Marta Milovanović, Enzo Tartaglione, Marco Cagnazzo, Félix Henry
Abstract
Image-based rendering techniques stand at the core of an immersive experience for the user, as they generate novel views given a set of multiple input images. Since they have shown good performance in terms of objective and subjective quality, the research community devotes great effort to their improvement. However, the large volume of data necessary to render at the receiver's side hinders applications in limited bandwidth environments or prevents their employment in real-time applications. We present LeHoPP, a method for input pixel pruning, where we examine the importance of each input pixel concerning the rendered view, and we avoid the use of irrelevant pixels. Even without retraining the image-based rendering network, our approach shows a good trade-off between synthesis quality and pixel rate. When tested in the general neural rendering framework, compared to other pruning baselines, LeHoPP gains between $0.9$ dB and $3.6$ dB on average.
DSPDet3D: Dynamic Spatial Pruning for 3D Small Object Detection
Authors: Xiuwei Xu, Zhihao Sun, Ziwei Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this paper, we propose a new detection framework for 3D small object detection. Although deep learning-based 3D object detection methods have achieved great success in recent years, current methods still struggle on small objects due to weak geometric information. With in-depth study, we find increasing the spatial resolution of the feature maps significantly boosts the performance of 3D small object detection. And more interestingly, though the computational overhead increases dramatically with resolution, the growth mainly comes from the upsampling operation of the decoder. Inspired by this, we present a high-resolution multi-level detector with dynamic spatial pruning named DSPDet3D, which detects objects from large to small by iterative upsampling and meanwhile prunes the spatial representation of the scene at regions where there is no smaller object to be detected in higher resolution. As the 3D detector only needs to predict sparse bounding boxes, pruning a large amount of uninformative features does not degrade the detection performance but significantly reduces the computational cost of upsampling. In this way, our DSPDet3D achieves high accuracy on small object detection while requiring even less memory footprint and inference time. On ScanNet and TO-SCENE dataset, our method improves the detection performance of small objects to a new level while achieving leading inference speed among all mainstream indoor 3D object detection methods.
Keyword: voxel
Smaller3d: Smaller Models for 3D Semantic Segmentation Using Minkowski Engine and Knowledge Distillation Methods
Authors: Alen Adamyan, Erik Harutyunyan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
There are various optimization techniques in the realm of 3D, including point cloud-based approaches that use mesh, texture, and voxels which optimize how you store, and how do calculate in 3D. These techniques employ methods such as feed-forward networks, 3D convolutions, graph neural networks, transformers, and sparse tensors. However, the field of 3D is one of the most computationally expensive fields, and these methods have yet to achieve their full potential due to their large capacity, complexity, and computation limits. This paper proposes the application of knowledge distillation techniques, especially for sparse tensors in 3D deep learning, to reduce model sizes while maintaining performance. We analyze and purpose different loss functions, including standard methods and combinations of various losses, to simulate the performance of state-of-the-art models of different Sparse Convolutional NNs. Our experiments are done on the standard ScanNet V2 dataset, and we achieved around 2.6\% mIoU difference with a 4 times smaller model and around 8\% with a 16 times smaller model on the latest state-of-the-art spacio-temporal convents based models.
Keyword: lidar
HeteroEdge: Addressing Asymmetry in Heterogeneous Collaborative Autonomous Systems
Authors: Mohammad Saeid Anwar, Emon Dey, Maloy Kumar Devnath, Indrajeet Ghosh, Naima Khan, Jade Freeman, Timothy Gregory, Niranjan Suri, Kasthuri Jayaraja, Sreenivasan Ramasamy Ramamurthy, Nirmalya Roy
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Gathering knowledge about surroundings and generating situational awareness for IoT devices is of utmost importance for systems developed for smart urban and uncontested environments. For example, a large-area surveillance system is typically equipped with multi-modal sensors such as cameras and LIDARs and is required to execute deep learning algorithms for action, face, behavior, and object recognition. However, these systems face power and memory constraints due to their ubiquitous nature, making it crucial to optimize data processing, deep learning algorithm input, and model inference communication. In this paper, we propose a self-adaptive optimization framework for a testbed comprising two Unmanned Ground Vehicles (UGVs) and two NVIDIA Jetson devices. This framework efficiently manages multiple tasks (storage, processing, computation, transmission, inference) on heterogeneous nodes concurrently. It involves compressing and masking input image frames, identifying similar frames, and profiling devices to obtain boundary conditions for optimization.. Finally, we propose and optimize a novel parameter split-ratio, which indicates the proportion of the data required to be offloaded to another device while considering the networking bandwidth, busy factor, memory (CPU, GPU, RAM), and power constraints of the devices in the testbed. Our evaluations captured while executing multiple tasks (e.g., PoseNet, SegNet, ImageNet, DetectNet, DepthNet) simultaneously, reveal that executing 70% (split-ratio=70%) of the data on the auxiliary node minimizes the offloading latency by approx. 33% (18.7 ms/image to 12.5 ms/image) and the total operation time by approx. 47% (69.32s to 36.43s) compared to the baseline configuration (executing on the primary node).
Multi S-graphs: A Collaborative Semantic SLAM architecture
Authors: Miguel Fernandez-Cortizas, Hriday Bavle, Jose Luis Sanchez-Lopez, Pascual Campoy, Holger Voos
Abstract
Collaborative Simultaneous Localization and Mapping (CSLAM) is a critical capability for enabling multiple robots to operate in complex environments. Most CSLAM techniques rely on the transmission of low-level features for visual and LiDAR-based approaches, which are used for pose graph optimization. However, these low-level features can lead to incorrect loop closures, negatively impacting map generation.Recent approaches have proposed the use of high-level semantic information in the form of Hierarchical Semantic Graphs to improve the loop closure procedures and overall precision of SLAM algorithms. In this work, we present Multi S-Graphs, an S-graphs [1] based distributed CSLAM algorithm that utilizes high-level semantic information for cooperative map generation while minimizing the amount of information exchanged between robots. Experimental results demonstrate the promising performance of the proposed algorithm in map generation tasks.
DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception
Abstract
Closing the domain gap between training and deployment and incorporating multiple sensor modalities are two challenging yet critical topics for self-driving. Existing work only focuses on single one of the above topics, overlooking the simultaneous domain and modality shift which pervasively exists in real-world scenarios. A model trained with multi-sensor data collected in Europe may need to run in Asia with a subset of input sensors available. In this work, we propose DualCross, a cross-modality cross-domain adaptation framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point cloud knowledge from a LiDAR sensor in one domain during the training phase to the camera-only testing scenario in a different domain. This work results in the first open analysis of cross-domain cross-sensor perception and adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.
Keyword: diffusion
DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation
Abstract
Given a small set of images of a specific subject, subject-driven text-to-image generation aims to generate customized images of the subject according to new text descriptions, which has attracted increasing attention in the community recently. Current subject-driven text-to-image generation methods are mainly based on finetuning a pretrained large-scale text-to-image generation model. However, these finetuning methods map the images of the subject into an embedding highly entangled with subject-identity-unrelated information, which may result in the inconsistency between the generated images and the text descriptions and the changes in the subject identity. To tackle the problem, we propose DisenBooth, a disentangled parameter-efficient tuning framework for subject-driven text-to-image generation. DisenBooth enables generating new images that simultaneously preserve the subject identity and conform to the text descriptions, by disentangling the embedding into an identity-related and an identity-unrelated part. Specifically, DisenBooth is based on the pretrained diffusion models and conducts finetuning in the diffusion denoising process, where a shared identity embedding and an image-specific identity-unrelated embedding are utilized jointly for denoising each image. To make the two embeddings disentangled, two auxiliary objectives are proposed. Additionally, to improve the finetuning efficiency, a parameter-efficient finetuning strategy is adopted. Extensive experiments show that our DisenBooth can faithfully learn well-disentangled identity-related and identity-unrelated embeddings. With the shared identity embedding, DisenBooth demonstrates superior subject-driven text-to-image generation ability. Additionally, DisenBooth provides a more flexible and controllable framework with different combinations of the disentangled embeddings.
Guided Image Synthesis via Initial Image Editing in Diffusion Model
Abstract
Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image.
High-order BDF convolution quadrature for subdiffusion models with a singular source term
Abstract
Anomalous diffusion is often modelled in terms of the subdiffusion equation, which can involve a weakly singular source term. For this case, many predominant time stepping methods, including the correction of high-order BDF schemes [{\sc Jin, Li, and Zhou}, SIAM J. Sci. Comput., 39 (2017), A3129--A3152], may suffer from a severe order reduction. To fill in this gap, we propose a smoothing method for time stepping schemes, where the singular term is regularized by using a $m$-fold integral-differential calculus and the equation is discretized by the $k$-step BDF convolution quadrature, called ID$m$-BDF$k$ method. We prove that the desired $k$th-order convergence can be recovered even if the source term is a weakly singular and the initial data is not compatible. Numerical experiments illustrate the theoretical results.
Abstract
Generative steganography (GS) is an emerging technique that generates stego images directly from secret data. Various GS methods based on GANs or Flow have been developed recently. However, existing GAN-based GS methods cannot completely recover the hidden secret data due to the lack of network invertibility, while Flow-based methods produce poor image quality due to the stringent reversibility restriction in each module. To address this issue, we propose a novel GS scheme called "Generative Steganography Diffusion" (GSD) by devising an invertible diffusion model named "StegoDiffusion". It not only generates realistic stego images but also allows for 100\% recovery of the hidden secret data. The proposed StegoDiffusion model leverages a non-Markov chain with a fast sampling technique to achieve efficient stego image generation. By constructing an ordinary differential equation (ODE) based on the transition probability of the generation process in StegoDiffusion, secret data and stego images can be converted to each other through the approximate solver of ODE -- Euler iteration formula, enabling the use of irreversible but more expressive network structures to achieve model invertibility. Our proposed GSD has the advantages of both reversibility and high performance, significantly outperforming existing GS methods in all metrics.
Iterative $α$-(de)Blending: a Minimalist Deterministic Diffusion Model
Authors: Eric Heitz, Laurent Belcour, Thomas Chambon
Abstract
We derive a minimalist but powerful deterministic denoising-diffusion model. While denoising diffusion has shown great success in many domains, its underlying theory remains largely inaccessible to non-expert users. Indeed, an understanding of graduate-level concepts such as Langevin dynamics or score matching appears to be required to grasp how it works. We propose an alternative approach that requires no more than undergrad calculus and probability. We consider two densities and observe what happens when random samples from these densities are blended (linearly interpolated). We show that iteratively blending and deblending samples produces random paths between the two densities that converge toward a deterministic mapping. This mapping can be evaluated with a neural network trained to deblend samples. We obtain a model that behaves like deterministic denoising diffusion: it iteratively maps samples from one density (e.g., Gaussian noise) to another (e.g., cat images). However, compared to the state-of-the-art alternative, our model is simpler to derive, simpler to implement, more numerically stable, achieves higher quality results in our experiments, and has interesting connections to computer graphics.
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Authors: Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Duen Horng Chau
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract
Diffusion-based generative models' impressive ability to create convincing images has captured global attention. However, their complex internal structures and operations often make them difficult for non-experts to understand. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex components with detailed explanations of their underlying operations, enabling users to fluidly transition between multiple levels of abstraction through animations and interactive elements. By comparing the evolutions of image representations guided by two related text prompts over refinement timesteps, users can discover the impact of prompts on image generation. Diffusion Explainer runs locally in users' web browsers without the need for installation or specialized hardware, broadening the public's education access to modern AI techniques. Our open-sourced tool is available at: https://poloclub.github.io/diffusion-explainer/.
Data Curation for Image Captioning with Text-to-Image Generative Models
Authors: Wenyan Li, Jonas F. Lotz, Chen Qiu, Desmond Elliott
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract
Recent advances in image captioning are mainly driven by large-scale vision-language pretraining, relying heavily on computational resources and increasingly large multimodal datasets. Instead of scaling up pretraining data, we ask whether it is possible to improve performance by improving the quality of the samples in existing datasets. We pursue this question through two approaches to data curation: one that assumes that some examples should be avoided due to mismatches between the image and caption, and one that assumes that the mismatch can be addressed by replacing the image, for which we use the state-of-the-art Stable Diffusion model. These approaches are evaluated using the BLIP model on MS COCO and Flickr30K in both finetuning and few-shot learning settings. Our simple yet effective approaches consistently outperform baselines, indicating that better image captioning models can be trained by curating existing resources. Finally, we conduct a human study to understand the errors made by the Stable Diffusion model and highlight directions for future work in text-to-image generation.
Conditional Diffusion Feature Refinement for Continuous Sign Language Recognition
Abstract
In this work, we are dedicated to leveraging the denoising diffusion models' success and formulating feature refinement as the autoencoder-formed diffusion process. The state-of-the-art CSLR framework consists of a spatial module, a visual module, a sequence module, and a sequence learning function. However, this framework has faced sequence module overfitting caused by the objective function and small-scale available benchmarks, resulting in insufficient model training. To overcome the overfitting problem, some CSLR studies enforce the sequence module to learn more visual temporal information or be guided by more informative supervision to refine its representations. In this work, we propose a novel autoencoder-formed conditional diffusion feature refinement~(ACDR) to refine the sequence representations to equip desired properties by learning the encoding-decoding optimization process in an end-to-end way. Specifically, for the ACDR, a noising Encoder is proposed to progressively add noise equipped with semantic conditions to the sequence representations. And a denoising Decoder is proposed to progressively denoise the noisy sequence representations with semantic conditions. Therefore, the sequence representations can be imbued with the semantics of provided semantic conditions. Further, a semantic constraint is employed to prevent the denoised sequence representations from semantic corruption. Extensive experiments are conducted to validate the effectiveness of our ACDR, benefiting state-of-the-art methods and achieving a notable gain on three benchmarks.
Uncertainty Quantification for Fisher-Kolmogorov Equation on Graphs with Application to Patient-Specific Alzheimer Disease
Authors: Mattia Corti, Francesca Bonizzoni, Paola F. Antonietti, Alfio M. Quarteroni
Abstract
The Fisher-Kolmogorov equation is a diffusion-reaction PDE that is used to model the accumulation of prionic proteins, which are responsible for many different neurological disorders. Likely, the most important and studied misfolded protein in literature is the Amyloid-$\beta$, responsible for the onset of Alzheimer disease. Starting from medical images we construct a reduced-order model based on a graph brain connectome. The reaction coefficient of the proteins is modelled as a stochastic random field, taking into account all the many different underlying physical processes, which can hardly be measured. Its probability distribution is inferred by means of the Monte Carlo Markov Chain method applied to clinical data. The resulting model is patient-specific and can be employed for predicting the disease's future development. Forward uncertainty quantification techniques (Monte Carlo and sparse grid stochastic collocation) are applied with the aim of quantifying the impact of the variability of the reaction coefficient on the progression of protein accumulation within the next 20 years.
Training Is Everything: Artificial Intelligence, Copyright, and Fair Training
Authors: Andrew W. Torrance, Bill Tomlinson
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Abstract
To learn how to behave, the current revolutionary generation of AIs must be trained on vast quantities of published images, written works, and sounds, many of which fall within the core subject matter of copyright law. To some, the use of copyrighted works as training sets for AI is merely a transitory and non-consumptive use that does not materially interfere with owners' content or copyrights protecting it. Companies that use such content to train their AI engine often believe such usage should be considered "fair use" under United States law (sometimes known as "fair dealing" in other countries). By contrast, many copyright owners, as well as their supporters, consider the incorporation of copyrighted works into training sets for AI to constitute misappropriation of owners' intellectual property, and, thus, decidedly not fair use under the law. This debate is vital to the future trajectory of AI and its applications. In this article, we analyze the arguments in favor of, and against, viewing the use of copyrighted works in training sets for AI as fair use. We call this form of fair use "fair training". We identify both strong and spurious arguments on both sides of this debate. In addition, we attempt to take a broader perspective, weighing the societal costs (e.g., replacement of certain forms of human employment) and benefits (e.g., the possibility of novel AI-based approaches to global issues such as environmental disruption) of allowing AI to make easy use of copyrighted works as training sets to facilitate the development, improvement, adoption, and diffusion of AI. Finally, we suggest that the debate over AI and copyrighted works may be a tempest in a teapot when placed in the wider context of massive societal challenges such as poverty, equality, climate change, and loss of biodiversity, to which AI may be part of the solution.
Keyword: dynamic
Idiolect: A Reconfigurable Voice Coding Assistant
Authors: Breandan Considine, Nicholas Albion, Xujie Si
Abstract
This paper presents Idiolect, an open source (https://github.com/OpenASR/idiolect) IDE plugin for voice coding and a novel approach to building bots that allows for users to define custom commands on-the-fly. Unlike traditional chatbots, Idiolect does not pretend to be an omniscient virtual assistant but rather a reconfigurable voice programming system that empowers users to create their own commands and actions dynamically, without rebuilding or restarting the application. We offer an experience report describing the tool itself, illustrate some example use cases, and reflect on several lessons learned during the tool's development.
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Authors: Sai Qian Zhang, Thierry Tambe, Nestor Cuevas, Gu-Yeon Wei, David Brooks
Abstract
The emergence of the Internet of Things (IoT) has resulted in a remarkable amount of data generated on edge devices, which are often processed using AI algorithms. On-device learning enables edge platforms to continually adapt the AI models to user personal data and further allows for a better service quality. However, AI training on resource-limited devices is extremely difficult because of the intensive computing workload and the significant amount of on-chip memory consumption exacted by deep neural networks (DNNs). To mitigate this, we propose to use embedded dynamic random-access memory (eDRAM) as the main storage medium of training data. Compared with static random-access memory (SRAM), eDRAM introduces more than $2\times$ improvement on storage density, enabling reduced off-chip memory traffic. However, to keep the stored data intact, eDRAM is required to perform the power-hungry data refresh operations. eDRAM refresh can be eliminated if the data is stored for a period of time that is shorter than the eDRAM retention time. To achieve this, we design a novel reversible DNN architecture that enables a significantly reduced data lifetime during the training process and removes the need for eDRAM refresh. We further design an efficient on-device training engine, termed~\textit{CAMEL}, that uses eDRAM as the main on-chip memory. CAMEL enables the intermediate results during training to fit fully in on-chip eDRAM arrays and completely eliminates the off-chip DRAM traffic during the training process. We evaluate our CAMEL system on multiple DNNs with different datasets, demonstrating a more than $3\times$ saving on total DNN training energy consumption than the other baselines, while achieving a similar (even better) performance in validation accuracy.
Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications
Authors: Walid A. Hanafy, Limin Wang, Hyunseok Chang, Sarit Mukherjee, T. V. Lakshman, Prashant Shenoy
Abstract
It is commonly assumed that the end-to-end networking performance of edge offloading is purely dictated by that of the network connectivity between end devices and edge computing facilities, where ongoing innovation in 5G/6G networking can help. However, with the growing complexity of edge-offloaded computation and dynamic load balancing requirements, an offloaded task often goes through a multi-stage pipeline that spans across multiple compute nodes and proxies interconnected via a dedicated network fabric within a given edge computing facility. As the latest hardware-accelerated transport technologies such as RDMA and GPUDirect RDMA are adopted to build such network fabric, there is a need for good understanding of the full potential of these technologies in the context of computation offload and the effect of different factors such as GPU scheduling and characteristics of computation on the net performance gain achievable by these technologies. This paper unveils detailed insights into the latency overhead in typical machine learning (ML)-based computation pipelines and analyzes the potential benefits of adopting hardware-accelerated communication. To this end, we build a model-serving framework that supports various communication mechanisms. Using the framework, we identify performance bottlenecks in state-of-the-art model-serving pipelines and show how hardware-accelerated communication can alleviate them. For example, we show that GPUDirect RDMA can save 15--50\% of model-serving latency, which amounts to 70--160 ms.
Hardware in Loop Learning with Spin Stochastic Neurons
Authors: A N M Nafiul Islam, Kezhou Yang, Amit K. Shukla, Pravin Khanal, Bowei Zhou, Wei-Gang Wang, Abhronil Sengupta
Abstract
Despite the promise of superior efficiency and scalability, real-world deployment of emerging nanoelectronic platforms for brain-inspired computing have been limited thus far, primarily because of inter-device variations and intrinsic non-idealities. In this work, we demonstrate mitigating these issues by performing learning directly on practical devices through a hardware-in-loop approach, utilizing stochastic neurons based on heavy metal/ferromagnetic spin-orbit torque heterostructures. We characterize the probabilistic switching and device-to-device variability of our fabricated devices of various sizes to showcase the effect of device dimension on the neuronal dynamics and its consequent impact on network-level performance. The efficacy of the hardware-in-loop scheme is illustrated in a deep learning scenario achieving equivalent software performance. This work paves the way for future large-scale implementations of neuromorphic hardware and realization of truly autonomous edge-intelligent devices.
PMP: Learning to Physically Interact with Environments using Part-wise Motion Priors
Authors: Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, Young Min Kim
Abstract
We present a method to animate a character incorporating multiple part-wise motion priors (PMP). While previous works allow creating realistic articulated motions from reference data, the range of motion is largely limited by the available samples. Especially for the interaction-rich scenarios, it is impractical to attempt acquiring every possible interacting motion, as the combination of physical parameters increases exponentially. The proposed PMP allows us to assemble multiple part skills to animate a character, creating a diverse set of motions with different combinations of existing data. In our pipeline, we can train an agent with a wide range of part-wise priors. Therefore, each body part can obtain a kinematic insight of the style from the motion captures, or at the same time extract dynamics-related information from the additional part-specific simulation. For example, we can first train a general interaction skill, e.g. grasping, only for the dexterous part, and then combine the expert trajectories from the pre-trained agent with the kinematic priors of other limbs. Eventually, our whole-body agent learns a novel physical interaction skill even with the absence of the object trajectories in the reference motion sequence.
Bayesian Reinforcement Learning with Limited Cognitive Load
Authors: Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy
Abstract
All biological and artificial agents must learn and make decisions given limits on their ability to process information. As such, a general theory of adaptive behavior should be able to account for the complex interactions between an agent's learning history, decisions, and capacity constraints. Recent work in computer science has begun to clarify the principles that shape these dynamics by bridging ideas from reinforcement learning, Bayesian decision-making, and rate-distortion theory. This body of work provides an account of capacity-limited Bayesian reinforcement learning, a unifying normative framework for modeling the effect of processing constraints on learning and action selection. Here, we provide an accessible review of recent algorithms and theoretical results in this setting, paying special attention to how these ideas can be applied to studying questions in the cognitive and behavioral sciences.
Occupancy Prediction-Guided Neural Planner for Autonomous Driving
Abstract
Forecasting the scalable future states of surrounding traffic participants in complex traffic scenarios is a critical capability for autonomous vehicles, as it enables safe and feasible decision-making. Recent successes in learning-based prediction and planning have introduced two primary challenges: generating accurate joint predictions for the environment and integrating prediction guidance for planning purposes. To address these challenges, we propose a two-stage integrated neural planning framework, termed OPGP, that incorporates joint prediction guidance from occupancy forecasting. The preliminary planning phase simultaneously outputs the predicted occupancy for various types of traffic actors based on imitation learning objectives, taking into account shared interactions, scene context, and actor dynamics within a unified Transformer structure. Subsequently, the transformed occupancy prediction guides optimization to further inform safe and smooth planning under Frenet coordinates. We train our planner using a large-scale, real-world driving dataset and validate it in open-loop configurations. Our proposed planner outperforms strong learning-based methods, exhibiting improved performance due to occupancy prediction guidance.
LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition
Authors: Fuyan Ma, Bin Sun, Shutao Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract
Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs and computational costs. To solve these problems, the local-global spatio-temporal Transformer (LOGO-Former) is proposed to capture discriminative features within each frame and model contextual relationships among frames while balancing the complexity. Based on the priors that facial muscles move locally and facial expressions gradually change, we first restrict both the space attention and the time attention to a local window to capture local interactions among feature tokens. Furthermore, we perform the global attention by querying a token with features from each local window iteratively to obtain long-range information of the whole video sequence. In addition, we propose the compact loss regularization term to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic
Authors: Damien Sileo, Antoine Lernould
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Theory of Mind (ToM) is a critical component of intelligence, yet accurately measuring it continues to be a subject of debate. Prior research has attempted to apply human ToM assessments to natural language processing models using either human-created standardized tests or rule-based templates. However, these methods primarily focus on simplistic reasoning and require further validation. In this study, we utilize dynamic epistemic logic, which has established overlaps with ToM, to generate more intricate problems. We also introduce novel verbalization techniques to express these problems using natural language. Our findings indicate that certain language model scaling (from 70M to 6B and 350M to 174B) does not consistently yield results better than random chance. While GPT-4 demonstrates improved epistemic reasoning capabilities, there is still room for enhancement. Our code and datasets are publicly available https://github.com/antoinelrnld/modloghttps://huggingface.co/datasets/sileod/mindgames
CHAMELEON: OutSystems Live Bidirectional Transformations
Authors: Hugo Lourenço, João Costa Seco, Carla Ferreira, Tiago Simões, Vasco Silva, Filipe Assunção, André Menezes
Subjects: Software Engineering (cs.SE); Programming Languages (cs.PL)
Abstract
In model-driven engineering, the bidirectional transformation of models plays a crucial role in facilitating the use of editors that operate at different levels of abstraction. This is particularly important in the context of industrial-grade low-code platforms like OutSystems, which feature a comprehensive ecosystem of tools that complement the standard integrated development environment with domain-specific builders and abstract model viewers. We introduce CHAMELEON, a tool that enables the dynamic definition of a live bidirectional model transformation in a declarative manner by leveraging simple and intuitive component patterns. Through this approach, we can gradually define the view and synthesis paths to an abstract model built on top of a low-code metamodel. We devise a standard parser-generating technique for tree-like models that builds upon extended grammar definitions with constraints and name binders. We allow for a greater overlap of model patterns that can still be disambiguated for a clear lens-like behaviour of the transformation. CHAMELEON is evaluated in the fragment of the OutSystems language targeting the definition of user interfaces. To assess performance we used a large set of real OutSystems applications, with approximately 200K UI widgets, and a database of curated widget patterns. We found a worst-case processing time of 92ms for complete models in our benchmark, which is still suitable for the operation of an interactive model editor.
Context-triggered Abstraction-based Control Design
Authors: Satya Prakash Nayak, Lucas Neves Egidio, Matteo Della Rossa, Anne-Kathrin Schmuck, Raphaël Jungers
Subjects: Systems and Control (eess.SY); Computer Science and Game Theory (cs.GT)
Abstract
We consider the problem of automatically synthesizing a hybrid controller for non-linear dynamical systems which ensures that the closed-loop fulfills an arbitrary \emph{Linear Temporal Logic} specification. Moreover, the specification may take into account logical context switches induced by an external environment or the system itself. Finally, we want to avoid classical brute-force time- and space-discretization for scalability. We achieve these goals by a novel two-layer strategy synthesis approach, where the controller generated in the lower layer provides invariant sets and basins of attraction, which are exploited at the upper logical layer in an abstract way. In order to achieve this, we provide new techniques for both the upper- and lower-level synthesis. Our new methodology allows to leverage both the computing power of state space control techniques and the intelligence of finite game solving for complex specifications, in a scalable way.
Local Gaussian Modifiers (LGMs): UAV dynamic trajectory generation for onboard computation
Authors: Miguel Fernandez-Cortizas, David Perez-Saura, Javier Rodriguez-Vazquez, Pascual Campoy
Abstract
Agile autonomous drones are becoming increasingly popular in research due to the challenges they represent in fields like control, state estimation, or perception at high speeds. When all algorithms are computed onboard the uav, the computational limitations make the task of agile and robust flight even more difficult. One of the most computationally expensive tasks in agile flight is the generation of optimal trajectories that tackles the problem of planning a minimum time trajectory for a quadrotor over a sequence of specified waypoints. When these trajectories must be updated online due to changes in the environment or uncertainties, this high computational cost can leverage to not reach the desired waypoints or even crash in cluttered environments. In this paper, a fast lightweight dynamic trajectory modification approach is presented to allow modifying computational heavy trajectories using Local Gaussian Modifiers (LGMs), when recalculating a trajectory is not possible due to the time of computation. Our approach was validated in simulation, being able to pass through a race circuit with dynamic gates with top speeds up to 16.0 m/s, and was also validated in real flight reaching speeds up to 4.0 m/s in a fully autonomous onboard computing condition.
Data-inspired modeling of accidents in traffic flow networks
Abstract
We consider hyperbolic partial differential equations (PDEs) for a dynamic description of the traffic behavior in road networks. These equations are coupled to a Hawkes process that models traffic accidents taking into account their self-excitation property which means that accidents are more likely in areas in which another accident just occured. We discuss how both model components interact and influence each other. A data analysis reveals the self-excitation property of accidents and determines further parameters. Numerical simulations using risk measures underline and conclude the discussion of traffic accident effects in our model.
Iterative $α$-(de)Blending: a Minimalist Deterministic Diffusion Model
Authors: Eric Heitz, Laurent Belcour, Thomas Chambon
Abstract
We derive a minimalist but powerful deterministic denoising-diffusion model. While denoising diffusion has shown great success in many domains, its underlying theory remains largely inaccessible to non-expert users. Indeed, an understanding of graduate-level concepts such as Langevin dynamics or score matching appears to be required to grasp how it works. We propose an alternative approach that requires no more than undergrad calculus and probability. We consider two densities and observe what happens when random samples from these densities are blended (linearly interpolated). We show that iteratively blending and deblending samples produces random paths between the two densities that converge toward a deterministic mapping. This mapping can be evaluated with a neural network trained to deblend samples. We obtain a model that behaves like deterministic denoising diffusion: it iteratively maps samples from one density (e.g., Gaussian noise) to another (e.g., cat images). However, compared to the state-of-the-art alternative, our model is simpler to derive, simpler to implement, more numerically stable, achieves higher quality results in our experiments, and has interesting connections to computer graphics.
A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning
Abstract
In this paper, we present a multimodal \textit{and} dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
Initial Steps Towards Tackling High-dimensional Surrogate Modeling for Neuroevolution Using Kriging Partial Least Squares
Authors: Fergal Stapleton, Edgar Galván
Subjects: Neural and Evolutionary Computing (cs.NE)
Abstract
Surrogate-assisted evolutionary algorithms (SAEAs) aim to use efficient computational models with the goal of approximating the fitness function in evolutionary computation systems. This area of research has been active for over two decades and has received significant attention from the specialised research community in different areas, for example, single and many objective optimisation or dynamic and stationary optimisation problems. An emergent and exciting area that has received little attention from the SAEAs community is in neuroevolution. This refers to the use of evolutionary algorithms in the automatic configuration of artificial neural network (ANN) architectures, hyper-parameters and/or the training of ANNs. However, ANNs suffer from two major issues: (a) the use of highly-intense computational power for their correct training, and (b) the highly specialised human expertise required to correctly configure ANNs necessary to get a well-performing network. This work aims to fill this important research gap in SAEAs in neuroevolution by addressing these two issues. We demonstrate how one can use a Kriging Partial Least Squares method that allows efficient computation of good approximate surrogate models compared to the well-known Kriging method, which normally cannot be used in neuroevolution due to the high dimensionality of the data.
Retraining A Graph-based Recommender with Interests Disentanglement
Abstract
In a practical recommender system, new interactions are continuously observed. Some interactions are expected, because they largely follow users' long-term preferences. Some other interactions are indications of recent trends in user preference changes or marketing positions of new items. Accordingly, the recommender needs to be periodically retrained or updated to capture the new trends, and yet not to forget the long-term preferences. In this paper, we propose a novel and generic retraining framework called Disentangled Incremental Learning (DIL) for graph-based recommenders. We assume that long-term preferences are well captured in the existing model, in the form of model parameters learned from past interactions. New preferences can be learned from the user-item bipartite graph constructed using the newly observed interactions. In DIL, we design an Information Extraction Module to extract historical preferences from the existing model. Then we blend the historical and new preferences in the form of node embeddings in the new graph, through a Disentanglement Module. The essence of the disentanglement module is to decorrelate the historical and new preferences so that both can be well captured, via carefully designed losses. Through experiments on three benchmark datasets, we show the effectiveness of DIL in capturing dynamics of useritem interactions. We also demonstrate the robustness of DIL by attaching it to two base models - LightGCN and NGCF.
Asynchronous Events-based Panoptic Segmentation using Graph Mixer Neural Network
Abstract
In the context of robotic grasping, object segmentation encounters several difficulties when faced with dynamic conditions such as real-time operation, occlusion, low lighting, motion blur, and object size variability. In response to these challenges, we propose the Graph Mixer Neural Network that includes a novel collaborative contextual mixing layer, applied to 3D event graphs formed on asynchronous events. The proposed layer is designed to spread spatiotemporal correlation within an event graph at four nearest neighbor levels parallelly. We evaluate the effectiveness of our proposed method on the Event-based Segmentation (ESD) Dataset, which includes five unique image degradation challenges, including occlusion, blur, brightness, trajectory, scale variance, and segmentation of known and unknown objects. The results show that our proposed approach outperforms state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. Code available at: https://github.com/sanket0707/GNN-Mixer.git
On the Effectiveness of Equivariant Regularization for Robust Online Continual Learning
Authors: Lorenzo Bonicelli, Matteo Boschini, Emanuele Frascaroli, Angelo Porrello, Matteo Pennisi, Giovanni Bellitto, Simone Palazzo, Concetto Spampinato, Simone Calderara
Abstract
Humans can learn incrementally, whereas neural networks forget previously acquired information catastrophically. Continual Learning (CL) approaches seek to bridge this gap by facilitating the transfer of knowledge to both previous tasks (backward transfer) and future ones (forward transfer) during training. Recent research has shown that self-supervision can produce versatile models that can generalize well to diverse downstream tasks. However, contrastive self-supervised learning (CSSL), a popular self-supervision technique, has limited effectiveness in online CL (OCL). OCL only permits one iteration of the input dataset, and CSSL's low sample efficiency hinders its use on the input data-stream. In this work, we propose Continual Learning via Equivariant Regularization (CLER), an OCL approach that leverages equivariant tasks for self-supervision, avoiding CSSL's limitations. Our method represents the first attempt at combining equivariant knowledge with CL and can be easily integrated with existing OCL methods. Extensive ablations shed light on how equivariant pretext tasks affect the network's information flow and its impact on CL dynamics.
Fast Dynamic Programming in Trees in the MPC Model
Authors: Chetan Gupta, Rustam Latypov, Yannic Maus, Shreyas Pai, Simo Särkkä, Jan Studený, Jukka Suomela, Jara Uitto, Hossein Vahidi
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Abstract
We present a deterministic algorithm for solving a wide range of dynamic programming problems in trees in $O(\log D)$ rounds in the massively parallel computation model (MPC), with $O(n^\delta)$ words of local memory per machine, for any given constant $0 < \delta < 1$. Here $D$ is the diameter of the tree and $n$ is the number of nodes--we emphasize that our running time is independent of $n$. Our algorithm can solve many classical graph optimization problems such as maximum weight independent set, maximum weight matching, minimum weight dominating set, and minimum weight vertex cover. It can also be used to solve many accumulation tasks in which some aggregate information is propagated upwards or downwards in the tree--this includes, for example, computing the sum, minimum, or maximum of the input labels in each subtree, as well as many inference tasks commonly solved with belief propagation. Our algorithm can also solve any locally checkable labeling problem (LCLs) in trees. Our algorithm works for any reasonable representation of the input tree; for example, the tree can be represented as a list of edges or as a string with nested parentheses or tags. The running time of $O(\log D)$ rounds is also known to be necessary, assuming the widely-believed $2$-cycle conjecture. Our algorithm strictly improves on two prior algorithms: (i) Bateni, Behnezhad, Derakhshan, Hajiaghayi, and Mirrokni [ICALP'18] solve problems of these flavors in $O(\log n)$ rounds, while our algorithm is much faster in low-diameter trees. Furthermore, their algorithm also uses randomness, while our algorithm is deterministic. (ii) Balliu, Latypov, Maus, Olivetti, and Uitto [SODA'23] solve only locally checkable labeling problems in $O(\log D)$ rounds, while our algorithm can be applied to a much broader family of problems.
LMEye: An Interactive Perception Network for Large Language Models
Authors: Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Min Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper proposes an alternative method called LMEye, a play-plug-in Interactive Perception Network for Large Language Models (LLMs), aiming to improve the accuracy of image understanding for the LVLM. Previous methods that infuse visual information into LLMs utilize a static visual mapping network, but lack dynamic interaction between the LLMs and visual information. LMEye addresses this issue by allowing the LLM to incorporate the visual information that aligned with human instruction. Specifically, the LMEye network consists of a static visual mapping network to provide the basic perception of an image to LLMs. Then, it also contains additional linear layers responsible for acquiring requests from LLMs, decomposing image features, and transmitting the interleaved information to LLMs, respectively. In this way, LLMs act to be in charge of understanding human instructions, sending it to the interactive perception network, and generating the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal question answering and reasoning tasks, demonstrating that it significantly improves the zero-shot performance of LLMs on multimodal tasks compared to previous methods.
DSPDet3D: Dynamic Spatial Pruning for 3D Small Object Detection
Authors: Xiuwei Xu, Zhihao Sun, Ziwei Wang, Hongmin Liu, Jie Zhou, Jiwen Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this paper, we propose a new detection framework for 3D small object detection. Although deep learning-based 3D object detection methods have achieved great success in recent years, current methods still struggle on small objects due to weak geometric information. With in-depth study, we find increasing the spatial resolution of the feature maps significantly boosts the performance of 3D small object detection. And more interestingly, though the computational overhead increases dramatically with resolution, the growth mainly comes from the upsampling operation of the decoder. Inspired by this, we present a high-resolution multi-level detector with dynamic spatial pruning named DSPDet3D, which detects objects from large to small by iterative upsampling and meanwhile prunes the spatial representation of the scene at regions where there is no smaller object to be detected in higher resolution. As the 3D detector only needs to predict sparse bounding boxes, pruning a large amount of uninformative features does not degrade the detection performance but significantly reduces the computational cost of upsampling. In this way, our DSPDet3D achieves high accuracy on small object detection while requiring even less memory footprint and inference time. On ScanNet and TO-SCENE dataset, our method improves the detection performance of small objects to a new level while achieving leading inference speed among all mainstream indoor 3D object detection methods.
Keyword: efficient
Digital and Cloud Forensic Challenges
Inverse scattering of periodic surfaces with a superlens
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
A Subjective Dataset for Multi-Screen Video Streaming Applications
NEUROPULS: NEUROmorphic energy-efficient secure accelerators based on Phase change materials aUgmented siLicon photonicS
Testing Convex Truncation
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records
On Range Summary Queries
A Generative Modeling Framework for Inferring Families of Biomechanical Constitutive Laws in Data-Sparse Regimes
Multi-agent Delegated Search
Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution
HeteroEdge: Addressing Asymmetry in Heterogeneous Collaborative Autonomous Systems
Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization
Semantic Segmentation using Vision Transformers: A survey
Numerical stability analysis of shock-capturing methods for strong shocks I: second-order MUSCL schemes
Composite Motion Learning with Task Control
StarPlat: A Versatile DSL for Graph Analytics
Solution existence, uniqueness, and stability of discrete basis sinograms in multispectral CT
The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation
DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation
Compressing audio CNNs with graph centrality based filter pruning
Adaptive Graph Convolutional Subspace Clustering
Using ChatGPT for Entity Matching
Repair of Reed-Solomon Codes in the Presence of Erroneous Nodes
Descend: A Safe GPU Systems Programming Language
Interactive Acquisition of Fine-grained Visual Concepts by Exploiting Semantics of Generic Characterizations in Discourse
Modular Polynomial Codes for Secure and Robust Distributed Matrix Multiplication
Generative Steganography Diffusion
Zoo Guide to Network Embedding
High-Level Context Representation for Emotion Recognition in Images
Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment
Over-the-Air Federated Averaging with Limited Power and Privacy Budgets
Energy-Latency Aware Intelligent Reflecting Surface Aided Multi-cell Mobile Edge Computing
Initial Steps Towards Tackling High-dimensional Surrogate Modeling for Neuroevolution Using Kriging Partial Least Squares
Uncertainty Quantification for Fisher-Kolmogorov Equation on Graphs with Application to Patient-Specific Alzheimer Disease
Verifiable Learning for Robust Tree Ensembles
On Preimage Approximation for Neural Networks
Keyword: faster
Communication-Efficient Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching
Online Gesture Recognition using Transformer and Natural Language Processing
On Preimage Approximation for Neural Networks
Fast Dynamic Programming in Trees in the MPC Model
On the Benefits of Semi-Supervised Test Case Generation for Cyber-Physical Systems
Keyword: mobile
Federated Ensemble-Directed Offline Reinforcement Learning
A Subjective Dataset for Multi-Screen Video Streaming Applications
A Large-scale Empirical Study of Online Automated Privacy Policy Generators for Mobile Apps
Enhanced Low-Complexity FDD System Feedback with Variable Bit Lengths via Generative Modeling
Next-generation Surgical Navigation: Multi-view Marker-less 6DoF Pose Estimation of Surgical Instruments
Energy-Latency Aware Intelligent Reflecting Surface Aided Multi-cell Mobile Edge Computing
Keyword: pruning
Program Synthesis for Robot Learning from Demonstrations
Compressing audio CNNs with graph centrality based filter pruning
Enhanced Low-Complexity FDD System Feedback with Variable Bit Lengths via Generative Modeling
Regular Methods for Operator Precedence Languages
Learn how to Prune Pixels for Multi-view Neural Image-based Synthesis
DSPDet3D: Dynamic Spatial Pruning for 3D Small Object Detection
Keyword: voxel
Smaller3d: Smaller Models for 3D Semantic Segmentation Using Minkowski Engine and Knowledge Distillation Methods
Keyword: lidar
HeteroEdge: Addressing Asymmetry in Heterogeneous Collaborative Autonomous Systems
Multi S-graphs: A Collaborative Semantic SLAM architecture
DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception
Keyword: diffusion
DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation
Guided Image Synthesis via Initial Image Editing in Diffusion Model
High-order BDF convolution quadrature for subdiffusion models with a singular source term
Generative Steganography Diffusion
Iterative $α$-(de)Blending: a Minimalist Deterministic Diffusion Model
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Data Curation for Image Captioning with Text-to-Image Generative Models
Conditional Diffusion Feature Refinement for Continuous Sign Language Recognition
Uncertainty Quantification for Fisher-Kolmogorov Equation on Graphs with Application to Patient-Specific Alzheimer Disease
Training Is Everything: Artificial Intelligence, Copyright, and Fair Training
Keyword: dynamic
Idiolect: A Reconfigurable Voice Coding Assistant
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Understanding the Benefits of Hardware-Accelerated Communication in Model-Serving Applications
Hardware in Loop Learning with Spin Stochastic Neurons
PMP: Learning to Physically Interact with Environments using Part-wise Motion Priors
Bayesian Reinforcement Learning with Limited Cognitive Load
Occupancy Prediction-Guided Neural Planner for Autonomous Driving
LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition
MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic
CHAMELEON: OutSystems Live Bidirectional Transformations
Context-triggered Abstraction-based Control Design
Local Gaussian Modifiers (LGMs): UAV dynamic trajectory generation for onboard computation
Data-inspired modeling of accidents in traffic flow networks
Iterative $α$-(de)Blending: a Minimalist Deterministic Diffusion Model
A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning
Initial Steps Towards Tackling High-dimensional Surrogate Modeling for Neuroevolution Using Kriging Partial Least Squares
Retraining A Graph-based Recommender with Interests Disentanglement
Asynchronous Events-based Panoptic Segmentation using Graph Mixer Neural Network
On the Effectiveness of Equivariant Regularization for Robust Online Continual Learning
Fast Dynamic Programming in Trees in the MPC Model
LMEye: An Interactive Perception Network for Large Language Models
DSPDet3D: Dynamic Spatial Pruning for 3D Small Object Detection