New submissions for Fri, 31 Mar 23

Keyword: efficient

Machine learning-based spin structure detection

Authors: Isaac Labrie-Boulay, Thomas Brian Winkler, Daniel Franzen, Alena Romanova, Hans Fangohr, Mathias Kläui
Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Data Analysis, Statistics and Probability (physics.data-an)
Arxiv link: https://arxiv.org/abs/2303.16905
Pdf link: https://arxiv.org/pdf/2303.16905
Abstract One of the most important magnetic spin structure is the topologically stabilised skyrmion quasi-particle. Its interesting physical properties make them candidates for memory and efficient neuromorphic computation schemes. For the device operation, detection of the position, shape, and size of skyrmions is required and magnetic imaging is typically employed. A frequently used technique is magneto-optical Kerr microscopy where depending on the samples material composition, temperature, material growing procedures, etc., the measurements suffer from noise, low-contrast, intensity gradients, or other optical artifacts. Conventional image analysis packages require manual treatment, and a more automatic solution is required. We report a convolutional neural network specifically designed for segmentation problems to detect the position and shape of skyrmions in our measurements. The network is tuned using selected techniques to optimize predictions and in particular the number of detected classes is found to govern the performance. The results of this study shows that a well-trained network is a viable method of automating data pre-processing in magnetic microscopy. The approach is easily extendable to other spin structures and other magnetic imaging methods.
Optimizing Reconfigurable Intelligent Surfaces for Short Transmissions: How Detailed Configurations can be Afforded?
Authors: Anders Enqvist, Özlem Tuğfe Demir, Cicek Cavdar, Emil Björnson
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2303.16913
Pdf link: https://arxiv.org/pdf/2303.16913
Abstract In this paper, we examine how to minimize the total energy consumption of a user equipment (UE) when it transmits a finite-sized data payload of a given length. The receiving base station (BS) controls a reconfigurable intelligent surface (RIS) that can be utilized to improve the channel conditions, but only if additional pilot signals are transmitted to configure the RIS. The challenge is that the pilot resources spent on configuring the RIS increase the energy consumption, especially when small payloads are transmitted, so it must be balanced against the energy savings during data transmission. We derive a formula for the energy consumption, taking both the pilot and data transmission power into account. It also includes the effects of imperfect channel state information, the use of phase-shifts with finite resolution at the RIS, and the passive circuit energy consumption. We also consider how dividing the RIS into subarrays consisting of multiple RIS elements using the same reflection coefficient can shorten the pilot length. In particular, the pilot power and subarray size are tuned to the payload length to minimize the energy consumption while maintaining parts of the aperture gain. Our analytical results show that, for a given geometry and transmission payload length, there exists a unique energy-minimizing subarray size and pilot power. For small payloads and when the channel conditions between the BS and UE are favorable compared to the path to the RIS, the energy consumption is minimized using subarrays with many elements and low pilot transmission power. On the other hand, when the channel conditions to the RIS are better and the data payloads are large, it is preferable to use fewer elements per subarray, potentially configuring each element individually and transmitting the pilot signals with additional power.
T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals
Authors: James Giroux, Martin Bouchard, Robert Laganiere
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.16940
Pdf link: https://arxiv.org/pdf/2303.16940
Abstract Object detection utilizing Frequency Modulated Continous Wave radar is becoming increasingly popular in the field of autonomous systems. Radar does not possess the same drawbacks seen by other emission-based sensors such as LiDAR, primarily the degradation or loss of return signals due to weather conditions such as rain or snow. However, radar does possess traits that make it unsuitable for standard emission-based deep learning representations such as point clouds. Radar point clouds tend to be sparse and therefore information extraction is not efficient. To overcome this, more traditional digital signal processing pipelines were adapted to form inputs residing directly in the frequency domain via Fast Fourier Transforms. Commonly, three transformations were used to form Range-Azimuth-Doppler cubes in which deep learning algorithms could perform object detection. This too has drawbacks, namely the pre-processing costs associated with performing multiple Fourier Transforms and normalization. We explore the possibility of operating on raw radar inputs from analog to digital converters via the utilization of complex transformation layers. Moreover, we introduce hierarchical Swin Vision transformers to the field of radar object detection and show their capability to operate on inputs varying in pre-processing, along with different radar configurations, i.e. relatively low and high numbers of transmitters and receivers, while obtaining on par or better results than the state-of-the-art.
Concise QBF Encodings for Games on a Grid (extended version)
Authors: Irfansha Shaik, Jaco van de Pol
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.16949
Pdf link: https://arxiv.org/pdf/2303.16949
Abstract Encoding 2-player games in QBF correctly and efficiently is challenging and error-prone. To enable concise specifications and uniform encodings of games played on grid boards, like Tic-Tac-Toe, Connect-4, Domineering, Pursuer-Evader and Breakthrough, we introduce Board-game Domain Definition Language (BDDL), inspired by the success of PDDL in the planning domain. We provide an efficient translation from BDDL into QBF, encoding the existence of a winning strategy of bounded depth. Our lifted encoding treats board positions symbolically and allows concise definitions of conditions, effects and winning configurations, relative to symbolic board positions. The size of the encoding grows linearly in the input model and the considered depth. To show the feasibility of such a generic approach, we use QBF solvers to compute the critical depths of winning strategies for instances of several known games. For several games, our work provides the first QBF encoding. Unlike plan validation in SAT-based planning, validating QBF-based winning strategies is difficult. We show how to validate winning strategies using QBF certificates and interactive game play.
Fairness-Aware Data Valuation for Supervised Learning
Authors: José Pombal, Pedro Saleiro, Mário A. T. Figueiredo, Pedro Bizarro
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2303.16963
Pdf link: https://arxiv.org/pdf/2303.16963
Abstract Data valuation is a ML field that studies the value of training instances towards a given predictive task. Although data bias is one of the main sources of downstream model unfairness, previous work in data valuation does not consider how training instances may influence both performance and fairness of ML models. Thus, we propose Fairness-Aware Data vauatiOn (FADO), a data valuation framework that can be used to incorporate fairness concerns into a series of ML-related tasks (e.g., data pre-processing, exploratory data analysis, active learning). We propose an entropy-based data valuation metric suited to address our two-pronged goal of maximizing both performance and fairness, which is more computationally efficient than existing metrics. We then show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques. Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline -- and promote fairness in a data-centric way, where a deeper understanding of data quality takes center stage.
Computationally efficient sampling methods for sparsity promoting hierarchical Bayesian models
Authors: Daniela Calvetti, Erkki Somersalo
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2303.16988
Pdf link: https://arxiv.org/pdf/2303.16988
Abstract Bayesian hierarchical models have been demonstrated to provide efficient algorithms for finding sparse solutions to ill-posed inverse problems. The models comprise typically a conditionally Gaussian prior model for the unknown, augmented by a hyperprior model for the variances. A widely used choice for the hyperprior is a member of the family of generalized gamma distributions. Most of the work in the literature has concentrated on numerical approximation of the maximum a posteriori (MAP) estimates, and less attention has been paid on sampling methods or other means for uncertainty quantification. Sampling from the hierarchical models is challenging mainly for two reasons: The hierarchical models are typically high-dimensional, thus suffering from the curse of dimensionality, and the strong correlation between the unknown of interest and its variance can make sampling rather inefficient. This work addresses mainly the first one of these obstacles. By using a novel reparametrization, it is shown how the posterior distribution can be transformed into one dominated by a Gaussian white noise, allowing sampling by using the preconditioned Crank-Nicholson (pCN) scheme that has been shown to be efficient for sampling from distributions dominated by a Gaussian component. Furthermore, a novel idea for speeding up the pCN in a special case is developed, and the question of how strongly the hierarchical models are concentrated on sparse solutions is addressed in light of a computed example.
The G-invariant graph Laplacian
Authors: Eitan Rosen, Yoel Shkolnisky
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2303.17001
Pdf link: https://arxiv.org/pdf/2303.17001
Abstract Graph Laplacian based algorithms for data lying on a manifold have been proven effective for tasks such as dimensionality reduction, clustering, and denoising. In this work, we consider data sets whose data point not only lie on a manifold, but are also closed under the action of a continuous group. An example of such data set is volumes that line on a low dimensional manifold, where each volume may be rotated in three-dimensional space. We introduce the G-invariant graph Laplacian that generalizes the graph Laplacian by accounting for the action of the group on the data set. We show that like the standard graph Laplacian, the G-invariant graph Laplacian converges to the Laplace-Beltrami operator on the data manifold, but with a significantly improved convergence rate. Furthermore, we show that the eigenfunctions of the G-invariant graph Laplacian admit the form of tensor products between the group elements and eigenvectors of certain matrices, which can be computed efficiently using FFT-type algorithms. We demonstrate our construction and its advantages on the problem of filtering data on a noisy manifold closed under the action of the special unitary group SU(2).
The secret of immersion: actor driven camera movement generation for auto-cinematography
Authors: Xinyi Wu, Haohong Wang, Aggelos K. Katsaggelos
Subjects: Multimedia (cs.MM); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17041
Pdf link: https://arxiv.org/pdf/2303.17041
Abstract Immersion plays a vital role when designing cinematic creations, yet the difficulty in immersive shooting prevents designers to create satisfactory outputs. In this work, we analyze the specific components that contribute to cinematographic immersion considering spatial, emotional, and aesthetic level, while these components are then combined into a high-level evaluation mechanism. Guided by such a immersion mechanism, we propose a GAN-based camera control system that is able to generate actor-driven camera movements in the 3D virtual environment to obtain immersive film sequences. The proposed encoder-decoder architecture in the generation flow transfers character motion into camera trajectory conditioned on an emotion factor. This ensures spatial and emotional immersion by performing actor-camera synchronization physically and psychologically. The emotional immersion is further strengthened by incorporating regularization that controls camera shakiness for expressing different mental statuses. To achieve aesthetic immersion, we make effort to improve aesthetic frame compositions by modifying the synthesized camera trajectory. Based on a self-supervised adjustor, the adjusted camera placements can project the character to the appropriate on-frame locations following aesthetic rules. The experimental results indicate that our proposed camera control system can efficiently offer immersive cinematic videos, both quantitatively and qualitatively, based on a fine-grained immersive shooting. Live examples are shown in the supplementary video.
Material-agnostic Shaping of Granular Materials with Optimal Transport
Authors: Nikhilesh Alatur, Olov Andersson, Roland Siegwart, Lionel Ott
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17047
Pdf link: https://arxiv.org/pdf/2303.17047
Abstract From construction materials, such as sand or asphalt, to kitchen ingredients, like rice, sugar, or salt; the world is full of granular materials. Despite impressive progress in robotic manipulation, manipulating and interacting with granular material remains a challenge due to difficulties in perceiving, representing, modelling, and planning for these variable materials that have complex internal dynamics. While some prior work has looked into estimating or learning accurate dynamics models for granular materials, the literature is still missing a more abstract planning method that can be used for planning manipulation actions for granular materials with unknown material properties. In this work, we leverage tools from optimal transport and connect them to robot motion planning. We propose a heuristics-based sweep planner that does not require knowledge of the material's properties and directly uses a height map representation to generate promising sweeps. These sweeps transform granular material from arbitrary start shapes into arbitrary target shapes. We apply the sweep planner in a fast and reactive feedback loop and avoid the need for model-based planning over multiple time steps. We validate our approach with a large set of simulation and hardware experiments where we show that our method is capable of efficiently solving several complex tasks, including gathering, separating, and shaping of several types of granular materials into different target shapes.
Transductive few-shot adapters for medical image segmentation
Authors: Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17051
Pdf link: https://arxiv.org/pdf/2303.17051
Abstract With the recent raise of foundation models in computer vision and NLP, the pretrain-and-adapt strategy, where a large-scale model is fine-tuned on downstream tasks, is gaining popularity. However, traditional fine-tuning approaches may still require significant resources and yield sub-optimal results when the labeled data of the target task is scarce. This is especially the case in clinical settings. To address this challenge, we formalize few-shot efficient fine-tuning (FSEFT), a novel and realistic setting for medical image segmentation. Furthermore, we introduce a novel parameter-efficient fine-tuning strategy tailored to medical image segmentation, with (a) spatial adapter modules that are more appropriate for dense prediction tasks; and (b) a constrained transductive inference, which leverages task-specific prior knowledge. Our comprehensive experiments on a collection of public CT datasets for organ segmentation reveal the limitations of standard fine-tuning methods in few-shot scenarios, point to the potential of vision adapters and transductive inference, and confirm the suitability of foundation models.
A Tensor-based Convolutional Neural Network for Small Dataset Classification
Authors: Zhenhua Chen, David Crandall
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2303.17061
Pdf link: https://arxiv.org/pdf/2303.17061
Abstract Inspired by the ConvNets with structured hidden representations, we propose a Tensor-based Neural Network, TCNN. Different from ConvNets, TCNNs are composed of structured neurons rather than scalar neurons, and the basic operation is neuron tensor transformation. Unlike other structured ConvNets, where the part-whole relationships are modeled explicitly, the relationships are learned implicitly in TCNNs. Also, the structured neurons in TCNNs are high-rank tensors rather than vectors or matrices. We compare TCNNs with current popular ConvNets, including ResNets, MobileNets, EfficientNets, RegNets, etc., on CIFAR10, CIFAR100, and Tiny ImageNet. The experiment shows that TCNNs have higher efficiency in terms of parameters. TCNNs also show higher robustness against white-box adversarial attacks on MNIST compared to ConvNets.
Reading Strategies for Graph Visualizations that Wrap Around in Torus Topology
Authors: Kun-Ting Chen, Quynh Quang Ngo, Kuno Kurzhals, Kim Marriott, Tim Dwyer, Michael Sedlmair, Daniel Weiskopf
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2303.17066
Pdf link: https://arxiv.org/pdf/2303.17066
Abstract We investigate reading strategies for node-link diagrams that wrap around the boundaries in a flattened torus topology by examining eye tracking data recorded in a previous controlled study. Prior work showed that torus drawing affords greater flexibility in clutter reduction than traditional node-link representations, but impedes link-and-path exploration tasks, while repeating tiles around boundaries aids comprehension. However, it remains unclear what strategies users apply in different wrapping settings. This is important for design implications for future work on more effective wrapped visualizations for network applications, and cyclic data that could benefit from wrapping. We perform visual-exploratory data analysis of gaze data, and conduct statistical tests derived from the patterns identified. Results show distinguishable gaze behaviors, with more visual glances and transitions between areas of interest in the non-replicated layout. Full-context has more successful visual searches than partial-context, but the gaze allocation indicates that the layout could be more space-efficient.
Dependent Task Offloading in Edge Computing Using GNN and Deep Reinforcement Learning
Authors: Zequn Cao, Xiaoheng Deng
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2303.17100
Pdf link: https://arxiv.org/pdf/2303.17100
Abstract Task offloading is a widely used technology in Mobile Edge Computing (MEC), which declines the completion time of user task with the help of resourceful edge servers. Existing works mainly focus on the case that the computation density of a user task is homogenous so that it can be offloaded in full or by percentage. However, various user tasks in real life consist of several inner dependent subtasks, each of which is a minimum execution unit logically. Motivated by this gap, we aim to solve the Dependent Task Offloading (DTO) problem under multi-user multi-edge scenario in this paper. We firstly use Directed Acyclic Graph (DAG) to represent dependent task where nodes indicate subtasks and directed edges indicate dependencies among subtasks. Then we propose a scheme based on Graph Attention Network (GAT) and Deep Reinforcement Learning (DRL) to minimize the makespan of user tasks. To utilize GAT efficiently, we put the training of it on resourceful cloud in unsupervised style due to the numerous data and computation resource requirements. In addition, we design a multi-discrete Action space for DRL algorithm to enhance the applicability of our proposed scheme. Experiments are conducted on broadly distributed synthetic data. The results demonstrate that our proposed approach can be adapted to both simple and complex MEC environments and outperforms other methods.
Deep Generative Model and Its Applications in Efficient Wireless Network Management: A Tutorial and Case Study
Authors: Yinqiu Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Dong In Kim, Abbas Jamalipour
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17114
Pdf link: https://arxiv.org/pdf/2303.17114
Abstract With the phenomenal success of diffusion models and ChatGPT, deep generation models (DGMs) have been experiencing explosive growth from 2022. Not limited to content generation, DGMs are also widely adopted in Internet of Things, Metaverse, and digital twin, due to their outstanding ability to represent complex patterns and generate plausible samples. In this article, we explore the applications of DGMs in a crucial task, i.e., improving the efficiency of wireless network management. Specifically, we firstly overview the generative AI, as well as three representative DGMs. Then, a DGM-empowered framework for wireless network management is proposed, in which we elaborate the issues of the conventional network management approaches, why DGMs can address them efficiently, and the step-by-step workflow for applying DGMs in managing wireless networks. Moreover, we conduct a case study on network economics, using the state-of-the-art DGM model, i.e., diffusion model, to generate effective contracts for incentivizing the mobile AI-Generated Content (AIGC) services. Last but not least, we discuss important open directions for the further research.
Conservation and stability in a discontinuous Galerkin method for the vector invariant spherical shallow water equations
Authors: Kieran Ricardo, David Lee, Kenneth Duru
Subjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2303.17120
Pdf link: https://arxiv.org/pdf/2303.17120
Abstract We develop a novel and efficient discontinuous Galerkin spectral element method (DG-SEM) for the spherical rotating shallow water equations in vector invariant form. We prove that the DG-SEM is energy stable, and discretely conserves mass, vorticity, and linear geostrophic balance on general curvlinear meshes. These theoretical results are possible due to our novel entropy stable numerical DG fluxes for the shallow water equations in vector invariant form. We experimentally verify these results on a cubed sphere mesh. Additionally, we show that our method is robust, that is can be run stably without any dissipation. The entropy stable fluxes are sufficient to control the grid scale noise generated by geostrophic turbulence without the need for artificial stabilisation.
C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation
Authors: Nazmul Karim, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-pang Chiu, Supun Samarasekera, Nazanin Rahnavard
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17132
Pdf link: https://arxiv.org/pdf/2303.17132
Abstract Unsupervised domain adaptation (UDA) approaches focus on adapting models trained on a labeled source domain to an unlabeled target domain. UDA methods have a strong assumption that the source data is accessible during adaptation, which may not be feasible in many real-world scenarios due to privacy concerns and resource constraints of devices. In this regard, source-free domain adaptation (SFDA) excels as access to source data is no longer required during adaptation. Recent state-of-the-art (SOTA) methods on SFDA mostly focus on pseudo-label refinement based self-training which generally suffers from two issues: i) inevitable occurrence of noisy pseudo-labels that could lead to early training time memorization, ii) refinement process requires maintaining a memory bank which creates a significant burden in resource constraint scenarios. To address these concerns, we propose C-SFDA, a curriculum learning aided self-training framework for SFDA that adapts efficiently and reliably to changes across domains based on selective pseudo-labeling. Specifically, we employ a curriculum learning scheme to promote learning from a restricted amount of pseudo labels selected based on their reliabilities. This simple yet effective step successfully prevents label noise propagation during different stages of adaptation and eliminates the need for costly memory-bank based label refinement. Our extensive experimental evaluations on both image recognition and semantic segmentation tasks confirm the effectiveness of our method. C-SFDA is readily applicable to online test-time domain adaptation and also outperforms previous SOTA methods in this task.
DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving
Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Wangmeng Xiang, Binghui Chen, Bin Luo, Yifeng Geng, Xuansong Xie
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17144
Pdf link: https://arxiv.org/pdf/2303.17144
Abstract Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovations of DAMO-StreamNet are: (1) A robust neck structure incorporating deformable convolution, enhancing the receptive field and feature alignment capabilities. (2) A dual-branch structure that integrates short-path semantic features and long-path temporal features, improving motion state prediction accuracy. (3) Logits-level distillation for efficient optimization, aligning the logits of teacher and student networks in semantic space. (4) A real-time forecasting mechanism that updates support frame features with the current frame, ensuring seamless streaming perception during inference. Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data. This work not only sets a new benchmark for real-time perception but also provides valuable insights for future research. Additionally, DAMO-StreamNet can be applied to various autonomous systems, such as drones and robots, paving the way for real-time perception.
Convergence of the CEM-GMsFEM for compressible flow in highly heterogeneous media
Authors: Leonardo A. Poveda, Shubin Fu, Eric T. Chung, Lina Zhao
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2303.17157
Pdf link: https://arxiv.org/pdf/2303.17157
Abstract This paper presents and analyses a Constraint Energy Minimization Generalized Multiscale Finite Element Method (CEM-GMsFEM) for solving single-phase non-linear compressible flows in highly heterogeneous media. The construction of CEM-GMsFEM hinges on two crucial steps: First, the auxiliary space is constructed by solving local spectral problems, where the basis functions corresponding to small eigenvalues are captured. Then the basis functions are obtained by solving local energy minimization problems over the oversampling domains using the auxiliary space. The basis functions have exponential decay outside the corresponding local oversampling regions. The convergence of the proposed method is provided, and we show that this convergence only depends on the coarse grid size and is independent of the heterogeneities. An online enrichment guided by \emph{a posteriori} error estimator is developed to enhance computational efficiency. Several numerical experiments on a three-dimensional case to confirm the theoretical findings are presented, illustrating the performance of the method and giving efficient and accurate numerical.
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Authors: Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Luping Zhou, Shengsheng Wang, Jingdong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17169
Pdf link: https://arxiv.org/pdf/2303.17169
Abstract Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an appropriate prompt for each specific task. Recent CoCoOp further boosts the base-to-new generalization performance via an image-conditional prompt. However, it directly fuses identical image semantics to prompts of different labels and significantly weakens the discrimination among different classes as shown in our experiments. Motivated by this observation, we first propose a class-aware text prompt (CTP) to enrich generated prompts with label-related image information. Unlike CoCoOp, CTP can effectively involve image semantics and avoid introducing extra ambiguities into different prompts. On the other hand, instead of reserving the complete image representations, we propose text-guided feature tuning (TFT) to make the image branch attend to class-related representation. A contrastive loss is employed to align such augmented text and image representations on downstream tasks. In this way, the image-to-text CTP and text-to-image TFT can be mutually promoted to enhance the adaptation of VLMs for downstream tasks. Extensive experiments demonstrate that our method outperforms the existing methods by a significant margin. Especially, compared to CoCoOp, we achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
High-Performance Low-Complexity Hierarchical Frequency Synchronization for Distributed Massive MIMO-OFDMA Systems
Authors: Xiao-Yang Wang, Shaoshi Yang, Tian-Hao Yuan, Hou-Yu Zhai, Jianhua Zhang, Lajos Hanzo
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2303.17188
Pdf link: https://arxiv.org/pdf/2303.17188
Abstract We propose a high-performance yet low-complexity hierarchical frequency synchronization scheme for orthogonal frequency-division multiple-access (OFDMA) aided distributed massive multi-input multi-output (MIMO) systems, where multi-ple carrier frequency offsets (CFOs) have to be estimated in the uplink. To solve this multi-CFO estimation problem efficiently, we classify the active antenna units (AAUs) as the master and the slaves. Then, we split the scheme into two stages. During the first stage the distributed slave AAUs are synchronized with the master AAU, while the user equipment (UE) is synchronized with the closest slave AAU during the second stage. The mean square error (MSE) performance of our scheme is better than that of the representative state-of-the-art baseline schemes, while its computational complexity is substantially lower.
Practical self-supervised continual learning with continual fine-tuning
Authors: Chi Ian Tang, Lorena Qendro, Dimitris Spathis, Fahim Kawsar, Cecilia Mascolo, Akhil Mathur
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17235
Pdf link: https://arxiv.org/pdf/2303.17235
Abstract Self-supervised learning (SSL) has shown remarkable performance in computer vision tasks when trained offline. However, in a Continual Learning (CL) scenario where new data is introduced progressively, models still suffer from catastrophic forgetting. Retraining a model from scratch to adapt to newly generated data is time-consuming and inefficient. Previous approaches suggested re-purposing self-supervised objectives with knowledge distillation to mitigate forgetting across tasks, assuming that labels from all tasks are available during fine-tuning. In this paper, we generalize self-supervised continual learning in a practical setting where available labels can be leveraged in any step of the SSL process. With an increasing number of continual tasks, this offers more flexibility in the pre-training and fine-tuning phases. With Kaizen, we introduce a training architecture that is able to mitigate catastrophic forgetting for both the feature extractor and classifier with a carefully designed loss function. By using a set of comprehensive evaluation metrics reflecting different aspects of continual learning, we demonstrated that Kaizen significantly outperforms previous SSL models in competitive vision benchmarks, with up to 16.5% accuracy improvement on split CIFAR-100. Kaizen is able to balance the trade-off between knowledge retention and learning from new data with an end-to-end model, paving the way for practical deployment of continual learning systems.
Simultaneous reconstruction of sound speed and nonlinearity parameter in a paraxial model of vibro-acoustography in frequency domain
Authors: Barbara Kaltenbacher ans teresa Rauscher
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
Arxiv link: https://arxiv.org/abs/2303.17236
Pdf link: https://arxiv.org/pdf/2303.17236
Abstract In this paper we consider the inverse problem of vibro-acoustography, a technique for enhancing ultrasound imaging by making use of nonlinear effects. It amounts to determining two spatially variable coefficients in a system of PDEs describing propagation of two directed sound beams and the wave resulting from their nonlinear interaction. To justify the use of Newton's method for solving this inverse problem, on one hand we verify well-definedeness and differentiability of the forward operator corresponding to two versions of the PDE model; on the other hand we consider an all-at-once formulation of the inverse problem and prove convergence of Newton's method for its solution.
Computationally efficient predictive control based on ANN state-space model
Authors: Jan H. Hoekstra, Bence Cseppentő, Gerben I. Beintema, Maarten Schoukens, Zsolt Kollár, Roland Tóth
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2303.17305
Pdf link: https://arxiv.org/pdf/2303.17305
Abstract Artificial neural networks (ANN) have been shown to be flexible and effective function estimators for identification of nonlinear state-space models. However, if the resulting models are used directly for nonlinear model predictive control (NMPC), the resulting nonlinear optimization problem is often overly complex due the size of the network, requires the use of high-order observers to track the states of the ANN model, and the overall control scheme exploits little of the structural properties or available autograd tools for these models. In this paper, we propose an efficient approach to auto-convert ANN state-space models to linear parameter-varying (LPV) form and solve predictive control problems by successive solutions of linear model predictive problems, corresponding to quadratic programs (QPs). Furthermore, we show how existing ANN identification methods, such as the SUBNET method that uses a state encoder, can provide efficient implementation of MPCs. The performance of the proposed approach is demonstrated via a simulation study on an unbalanced disc system.
Masked Autoencoders as Image Processors
Authors: Huiyu Duan, Wei Shen, Xiongkuo Min, Danyang Tu, Long Teng, Jia Wang, Guangtao Zhai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17316
Pdf link: https://arxiv.org/pdf/2303.17316
Abstract Transformers have shown significant effectiveness for various vision tasks including both high-level vision and low-level vision. Recently, masked autoencoders (MAE) for feature pre-training have further unleashed the potential of Transformers, leading to state-of-the-art performances on various high-level vision tasks. However, the significance of MAE pre-training on low-level vision tasks has not been sufficiently explored. In this paper, we show that masked autoencoders are also scalable self-supervised learners for image processing tasks. We first present an efficient Transformer model considering both channel attention and shifted-window-based self-attention termed CSformer. Then we develop an effective MAE architecture for image processing (MAEIP) tasks. Extensive experimental results show that with the help of MAEIP pre-training, our proposed CSformer achieves state-of-the-art performance on various image processing tasks, including Gaussian denoising, real image denoising, single-image motion deblurring, defocus deblurring, and image deraining.
Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence
Authors: Anton Thielmann, Quentin Seifert, Arik Reuter, Elisabeth Bergherr, Benjamin Säfken
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2303.17324
Pdf link: https://arxiv.org/pdf/2303.17324
Abstract Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. This allows our model to detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.
Linear Insertion Deletion Codes in the High-Noise and High-Rate Regimes
Authors: Kuan Cheng, Zhengzhong Jin, Xin Li, Zhide Wei, Yu Zheng
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2303.17370
Pdf link: https://arxiv.org/pdf/2303.17370
Abstract This work continues the study of linear error correcting codes against adversarial insertion deletion errors (insdel errors). Previously, the work of Cheng, Guruswami, Haeupler, and Li \cite{CGHL21} showed the existence of asymptotically good linear insdel codes that can correct arbitrarily close to $1$ fraction of errors over some constant size alphabet, or achieve rate arbitrarily close to $1/2$ even over the binary alphabet. As shown in \cite{CGHL21}, these bounds are also the best possible. However, known explicit constructions in \cite{CGHL21}, and subsequent improved constructions by Con, Shpilka, and Tamo \cite{9770830} all fall short of meeting these bounds. Over any constant size alphabet, they can only achieve rate $< 1/8$ or correct $< 1/4$ fraction of errors; over the binary alphabet, they can only achieve rate $< 1/1216$ or correct $< 1/54$ fraction of errors. Apparently, previous techniques face inherent barriers to achieve rate better than $1/4$ or correct more than $1/2$ fraction of errors. In this work we give new constructions of such codes that meet these bounds, namely, asymptotically good linear insdel codes that can correct arbitrarily close to $1$ fraction of errors over some constant size alphabet, and binary asymptotically good linear insdel codes that can achieve rate arbitrarily close to $1/2$.\ All our constructions are efficiently encodable and decodable. Our constructions are based on a novel approach of code concatenation, which embeds the index information implicitly into codewords. This significantly differs from previous techniques and may be of independent interest. Finally, we also prove the existence of linear concatenated insdel codes with parameters that match random linear codes, and propose a conjecture about linear insdel codes.
Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions
Authors: Yicheng Luo, Jackie Kay, Edward Grefenstette, Marc Peter Deisenroth
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17396
Pdf link: https://arxiv.org/pdf/2303.17396
Abstract Offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment. Online finetuning of such offline models can further improve performance. But how should we ideally finetune agents obtained from offline RL training? While offline RL algorithms can in principle be used for finetuning, in practice, their online performance improves slowly. In contrast, we show that it is possible to use standard online off-policy algorithms for faster improvement. However, we find this approach may suffer from policy collapse, where the policy undergoes severe performance deterioration during initial online learning. We investigate the issue of policy collapse and how it relates to data diversity, algorithm choices and online replay distribution. Based on these insights, we propose a conservative policy optimization procedure that can achieve stable and sample-efficient online learning from offline pretraining.
An Efficient Mobile Gateway Selection and Discovery Based-Routing Protocol in Heterogeneous LTE-VANET Networks
Authors: Driss Abada, Rachid Adrdor, Omar Boutkhoum, Adil Bohouch
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2303.17439
Pdf link: https://arxiv.org/pdf/2303.17439
Abstract Coupling cellular communication networks with vehicular ad hoc networks (VANET) can be a very interesting way out for providing Internet access to vehicles in the road. However, due to the several specific characteristics of VANETs, making an efficient multi-hop routing from vehicular sources to the Internet gateways through Long Term Evolution (LTE) technology is still challenging. In this paper, an Internet mobile gateway selection scheme is proposed to elect more potential vehicles to behave as gateways to Internet in VANETs. Therefore, the discovery and the selection of route to those mobiles gateways is carried out via an efficient multiple metrics-based relay selection mechanism. The objective is to select the more reliable route to the mobile gateways, by reducing the communication overhead and performing seamless handover. The proposed protocol is compared with one recent protocol based on packet delivery ratio, average end-to-end delay and overhead. The results show that the proposed protocol ameliorates significantly the network performance in the contrast of the other protocol.
NN-Copula-CD: A Copula-Guided Interpretable Neural Network for Change Detection in Heterogeneous Remote Sensing Images
Authors: Weiming Li, Xueqian Wang, Gang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2303.17448
Pdf link: https://arxiv.org/pdf/2303.17448
Abstract Change detection (CD) in heterogeneous remote sensing images is a practical and challenging issue for real-life emergencies. In the past decade, the heterogeneous CD problem has significantly benefited from the development of deep neural networks (DNN). However, the data-driven DNNs always perform like a black box where the lack of interpretability limits the trustworthiness and controllability of DNNs in most practical CD applications. As a strong knowledge-driven tool to measure correlation between random variables, Copula theory has been introduced into CD, yet it suffers from non-robust CD performance without manual prior selection for Copula functions. To address the above issues, we propose a knowledge-data-driven heterogeneous CD method (NN-Copula-CD) based on the Copula-guided interpretable neural network. In our NN-Copula-CD, the mathematical characteristics of Copula are designed as the losses to supervise a simple fully connected neural network to learn the correlation between bi-temporal image patches, and then the changed regions are identified via binary classification for the correlation coefficients of all image patch pairs of the bi-temporal images. We conduct in-depth experiments on three datasets with multimodal images (e.g., Optical, SAR, and NIR), where the quantitative results and visualized analysis demonstrate both the effectiveness and interpretability of the proposed NN-Copula-CD.
HMES: A Scalable Human Mobility and Epidemic Simulation System with Fast Intervention Modeling
Authors: Haoyu Geng, Guanjie Zheng, Zhengqing Han, Hua Wei, Zhenhui Li
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2303.17464
Pdf link: https://arxiv.org/pdf/2303.17464
Abstract Recently, the world has witnessed the most severe pandemic (COVID-19) in this century. Studies on epidemic prediction and simulation have received increasing attention. However, the current methods suffer from three issues. First, most of the current studies focus on epidemic prediction, which can not provide adequate support for intervention policy making. Second, most of the current interventions are based on population groups rather than fine-grained individuals, which can not make the measures towards the infected people and may cause waste of medical resources. Third, current simulations are not efficient and flexible enough for large-scale complex systems. In this paper, we propose a new epidemic simulation framework called HMES to address the above three challenges. The proposed framework covers a full pipeline of epidemic simulation and enables comprehensive fine-grained control in a large scale. In addition, we conduct experiments on real COVID-19 data. HMES demonstrates more accurate modeling of disease transmission up to 300 million people and up to 3 times acceleration compared to the state-of-the-art methods.
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
Authors: Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17472
Pdf link: https://arxiv.org/pdf/2303.17472
Abstract Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at \url{https://github.com/QitaoZhao/PoseFormerV2}.
Efficient distributed representations beyond negative sampling
Authors: Lorenzo Dall'Amico, Enrico Maria Belliardo
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2303.17475
Pdf link: https://arxiv.org/pdf/2303.17475
Abstract This article describes an efficient method to learn distributed representations, also known as embeddings. This is accomplished minimizing an objective function similar to the one introduced in the Word2Vec algorithm and later adopted in several works. The optimization computational bottleneck is the calculation of the softmax normalization constants for which a number of operations scaling quadratically with the sample size is required. This complexity is unsuited for large datasets and negative sampling is a popular workaround, allowing one to obtain distributed representations in linear time with respect to the sample size. Negative sampling consists, however, in a change of the loss function and hence solves a different optimization problem from the one originally proposed. Our contribution is to show that the sotfmax normalization constants can be estimated in linear time, allowing us to design an efficient optimization strategy to learn distributed representations. We test our approximation on two popular applications related to word and node embeddings. The results evidence competing performance in terms of accuracy with respect to negative sampling with a remarkably lower computational time.
Teaching contact-rich tasks from visual demonstrations by constraint extraction
Authors: Christian Hegeler, Filippo Rozzi, Loris Roveda, Kevin Haninger
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17481
Pdf link: https://arxiv.org/pdf/2303.17481
Abstract Contact-rich manipulation involves kinematic constraints on the task motion, typically with discrete transitions between these constraints during the task. Allowing the robot to detect and reason about these contact constraints can support robust and dynamic manipulation, but how can these contact models be efficiently learned? Purely visual observations are an attractive data source, allowing passive task demonstrations with unmodified objects. Existing approaches for vision-only learning from demonstration are effective in pick-and-place applications and planar tasks. Nevertheless, accuracy/occlusions and unobserved task dynamics can limit their robustness in contact-rich manipulation. To use visual demonstrations for contact-rich robotic tasks, we consider the demonstration of pose trajectories with transitions between holonomic kinematic constraints, first clustering the trajectories into discrete contact modes, then fitting kinematic constraints per each mode. The fit constraints are then used to (i) detect contact online with force/torque measurements and (ii) plan the robot policy with respect to the active constraint. We demonstrate the approach with real experiments, on cabling and rake tasks, showing the approach gives robust manipulation through contact transitions.
Edge Ranking of Graphs in Transportation Networks using a Graph Neural Network (GNN)
Authors: Debasish Jana, Sven Malama, Sriram Narasimhan, Ertugrul Taciroglu
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17485
Pdf link: https://arxiv.org/pdf/2303.17485
Abstract Many networks, such as transportation, power, and water distribution, can be represented as graphs. Crucial challenge in graph representations is identifying the importance of graph edges and their influence on overall network efficiency and information flow performance. For example, important edges in a transportation network are those roads that, when affected, will significantly alter the network's overall efficiency. Commonly used approach to finding such important edges is ``edge betweenness centrality'' (EBC), an edge ranking measure to determine the influential edges of the graph based on connectivity and information spread. Computing the EBC utilizing the common Brandes algorithm involves calculating the shortest paths for every node pair, which can be computationally expensive and restrictive, especially for large graphs. Changes in the graph parameters, e.g., in the edge weight or the addition and deletion of nodes or edges, require the recalculation of the EBC. As the main contribution, we propose an approximate method to estimate the EBC using a Graph Neural Network (GNN), a deep learning-based approach. We show that it is computationally efficient compared to the conventional method, especially for large graphs. The proposed method of GNN-based edge ranking is evaluated on several synthetic graphs and a real-world transportation data set. We show that this framework can estimate the approximate edge ranking much faster compared to the conventional method. This approach is inductive, i.e., training and testing are performed on different sets of graphs with varying numbers of nodes and edges. The proposed method is especially suitable for applications on large-scale networks when edge information is desired, for example, in urban infrastructure improvement projects, power, and water network resilience analyses, and optimizing resource allocations in engineering networks.
3D Line Mapping Revisited
Authors: Shaohui Liu, Yifan Yu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17504
Pdf link: https://arxiv.org/pdf/2303.17504
Abstract In contrast to sparse keypoints, a handful of line segments can concisely encode the high-level scene layout, as they often delineate the main structural elements. In addition to offering strong geometric cues, they are also omnipresent in urban landscapes and indoor scenes. Despite their apparent advantages, current line-based reconstruction methods are far behind their point-based counterparts. In this paper we aim to close the gap by introducing LIMAP, a library for 3D line mapping that robustly and efficiently creates 3D line maps from multi-view imagery. This is achieved through revisiting the degeneracy problem of line triangulation, carefully crafted scoring and track building, and exploiting structural priors such as line coincidence, parallelism, and orthogonality. Our code integrates seamlessly with existing point-based Structure-from-Motion methods and can leverage their 3D points to further improve the line reconstruction. Furthermore, as a byproduct, the method is able to recover 3D association graphs between lines and points / vanishing points (VPs). In thorough experiments, we show that LIMAP significantly outperforms existing approaches for 3D line mapping. Our robust 3D line maps also open up new research directions. We show two example applications: visual localization and bundle adjustment, where integrating lines alongside points yields the best results. Code is available at https://github.com/cvg/limap.
Sum-of-Squares Lower Bounds for Densest $k$-Subgraph
Authors: Chris Jones, Aaron Potechin, Goutham Rajendran, Jeff Xu
Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2303.17506
Pdf link: https://arxiv.org/pdf/2303.17506
Abstract Given a graph and an integer $k$, Densest $k$-Subgraph is the algorithmic task of finding the subgraph on $k$ vertices with the maximum number of edges. This is a fundamental problem that has been subject to intense study for decades, with applications spanning a wide variety of fields. The state-of-the-art algorithm is an $O(n^{1/4 + \epsilon})$-factor approximation (for any $\epsilon > 0$) due to Bhaskara et al. [STOC '10]. Moreover, the so-called log-density framework predicts that this is optimal, i.e. it is impossible for an efficient algorithm to achieve an $O(n^{1/4 - \epsilon})$-factor approximation. In the average case, Densest $k$-Subgraph is a prototypical noisy inference task which is conjectured to exhibit a statistical-computational gap. In this work, we provide the strongest evidence yet of hardness for Densest $k$-Subgraph by showing matching lower bounds against the powerful Sum-of-Squares (SoS) algorithm, a meta-algorithm based on convex programming that achieves state-of-art algorithmic guarantees for many optimization and inference problems. For $k \leq n^{\frac{1}{2}}$, we obtain a degree $n^{\delta}$ SoS lower bound for the hard regime as predicted by the log-density framework. To show this, we utilize the modern framework for proving SoS lower bounds on average-case problems pioneered by Barak et al. [FOCS '16]. A key issue is that small denser-than-average subgraphs in the input will greatly affect the value of the candidate pseudoexpectation operator around the subgraph. To handle this challenge, we devise a novel matrix factorization scheme based on the positive minimum vertex separator. We then prove an intersection tradeoff lemma to show that the error terms when using this separator are indeed small.
Learning in Factored Domains with Information-Constrained Visual Representations
Authors: Tyler Malloy, Miao Liu, Matthew D. Riemer, Tim Klinger, Gerald Tesauro, Chris R. Sims
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2303.17508
Pdf link: https://arxiv.org/pdf/2303.17508
Abstract Humans learn quickly even in tasks that contain complex visual information. This is due in part to the efficient formation of compressed representations of visual information, allowing for better generalization and robustness. However, compressed representations alone are insufficient for explaining the high speed of human learning. Reinforcement learning (RL) models that seek to replicate this impressive efficiency may do so through the use of factored representations of tasks. These informationally simplistic representations of tasks are similarly motivated as the use of compressed representations of visual information. Recent studies have connected biological visual perception to disentangled and compressed representations. This raises the question of how humans learn to efficiently represent visual information in a manner useful for learning tasks. In this paper we present a model of human factored representation learning based on an altered form of a $\beta$-Variational Auto-encoder used in a visual learning task. Modelling results demonstrate a trade-off in the informational complexity of model latent dimension spaces, between the speed of learning and the accuracy of reconstructions.
Hybrid Dealiasing of Complex Convolutions
Authors: Noel Murasko, John C. Bowman
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2303.17510
Pdf link: https://arxiv.org/pdf/2303.17510
Abstract Efficient algorithms for computing linear convolutions based on the fast Fourier transform are developed. A hybrid approach is described that combines the conventional practice of explicit dealiasing (explicitly padding the input data with zeros) and implicit dealiasing (mathematically accounting for these zero values). The new approach generalizes implicit dealiasing to arbitrary padding ratios and includes explicit dealiasing as a special case. Unlike existing implementations of implicit dealiasing, hybrid dealiasing tailors its subtransform sizes to the convolution geometry. Multidimensional convolutions are implemented with hybrid dealiasing by decomposing them into lower-dimensional convolutions. Convolutions of complex-valued and Hermitian inputs of equal length are illustrated with pseudocode and implemented in the open-source FFTW++ library. Hybrid dealiasing is shown to outperform explicit dealiasing in one, two, and three dimensions.
Power-Optimal HARQ Protocol for Reliable Free Space Optical Communication
Authors: Georgios D. Chondrogiannis, Nikos A. Mitsiou, Nestor D. Chatzidiamantis, Alexandros-Apostolos A. Boulogeorgos, George K. Karagiannidis
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2303.17512
Pdf link: https://arxiv.org/pdf/2303.17512
Abstract This paper investigates the usage of hybrid automatic repeat request (HARQ) protocols for power-efficient and reliable communications over free space optical (FSO) links. By exploiting the large coherence time of the FSO channel, the proposed transmission schemes combat turbulence-induced fading by retransmitting the failed packets in the same coherence interval. To assess the performance of the presented HARQ technique, we extract a theoretical framework for the outage performance. In more detail, a closed-form expression for the outage probability (OP) is reported and an approximation for the high signal-to-noise ratio (SNR) region is extracted. Building upon the theoretical framework, we formulate a transmission power allocation problem throughout the retransmission rounds. This optimization problem is solved numerically through the use of an iterative algorithm. In addition, the average throughput of the HARQ schemes under consideration is examined. Simulation results validate the theoretical analysis under different turbulence conditions and demonstrate the performance improvement, in terms of both OP and throughput, of the proposed HARQ schemes compared to fixed transmit power HARQ benchmarks.
Nonlinear Approximation with Subsampled Rank-1 Lattices
Authors: Felix Bartel, Fabian Taubert
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2303.17541
Pdf link: https://arxiv.org/pdf/2303.17541
Abstract In this paper we approximate high-dimensional functions $f\colon\mathbb T^d\to\mathbb C$ by sparse trigonometric polynomials based on function evaluations. Recently it was shown that a dimension-incremental sparse Fourier transform (SFT) approach does not require the signal to be exactly sparse and is applicable in this setting. We combine this approach with subsampling techniques for rank-1 lattices. This way our approach benefits from the underlying structure in the sampling points making fast Fourier algorithms applicable whilst achieving the good sampling complexity of random points (logarithmic oversampling). In our analysis we show detection guarantees of the frequencies corresponding to the Fourier coefficients of largest magnitude. In numerical experiments we make a comparison to full rank-1 lattices and uniformly random points to confirm our findings.
Active User Identification in Fast Fading Massive Random Access Channels
Authors: Jyotish Robin, Elza Erkip
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2303.17543
Pdf link: https://arxiv.org/pdf/2303.17543
Abstract Reliable and prompt identification of active users is critical for enabling random access in massive machine-to-machine type networks which typically operate within stringent access delay and energy constraints. In this paper, an energy efficient active user identification protocol is envisioned in which the active users simultaneously transmit On-Off Keying (OOK) modulated preambles whereas the base station uses non-coherent detection to avoid the channel estimation overheads. The minimum number of channel-uses required for active user identification in the asymptotic regime of total number of users $\ell$ when the number of active devices k scales as $k = \Theta(1)$ is characterized along with an achievability scheme relying on the equivalence of activity detection to a group testing problem. A practical scheme for active user identification based on a belief propagation strategy is also proposed and its performance is compared against the theoretical bounds.
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
Authors: Chenpng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2303.17550
Pdf link: https://arxiv.org/pdf/2303.17550
Abstract While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.
DDP: Diffusion Model for Dense Visual Prediction
Authors: Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17559
Pdf link: https://arxiv.org/pdf/2303.17559
Abstract We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research
Using AI to Measure Parkinson's Disease Severity at Home
Authors: Md Saiful Islam, Wasifur Rahman, Abdelrahman Abdelkader, Phillip T. Yang, Sangwu Lee, Jamie L. Adams, Ruth B. Schneider, E. Ray Dorsey, Ehsan Hoque
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17573
Pdf link: https://arxiv.org/pdf/2303.17573
Abstract We present an artificial intelligence system to remotely assess the motor performance of individuals with Parkinson's disease (PD). Participants performed a motor task (i.e., tapping fingers) in front of a webcam, and data from 250 global participants were rated by three expert neurologists following the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). The neurologists' ratings were highly reliable, with an intra-class correlation coefficient (ICC) of 0.88. We developed computer algorithms to obtain objective measurements that align with the MDS-UPDRS guideline and are strongly correlated with the neurologists' ratings. Our machine learning model trained on these measures outperformed an MDS-UPDRS certified rater, with a mean absolute error (MAE) of 0.59 compared to the rater's MAE of 0.79. However, the model performed slightly worse than the expert neurologists (0.53 MAE). The methodology can be replicated for similar motor tasks, providing the possibility of evaluating individuals with PD and other movement disorders remotely, objectively, and in areas with limited access to neurological care.
Human-Robot Interaction using VAHR: Virtual Assistant, Human, and Robots in the Loop
Authors: Ahmad Amine, Mostafa Aldilati, Hadi Hasan, Noel Maalouf, Imad H. Elhajj
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17582
Pdf link: https://arxiv.org/pdf/2303.17582
Abstract Robots have become ubiquitous tools in various industries and households, highlighting the importance of human-robot interaction (\textbf{HRI}). This has increased the need for easy and accessible communication between humans and robots. Recent research has focused on the intersection of virtual assistant technology, such as Amazon's Alexa, with robots and its effect on HRI. This paper presents the Virtual Assistant, Human, and Robots in the loop (VAHR) system, which utilizes bidirectional communication to control multiple robots through Alexa. VAHR's performance was evaluated through a human-subjects experiment, comparing objective and subjective metrics of traditional keyboard and mouse interfaces to VAHR. The results showed that VAHR required 41\% less Robot Attention Demand and ensured 91% more Fan-out time compared to the standard method. Additionally, VAHR led to a 62.5% improvement in multi-tasking, highlighting the potential for efficient human-robot interaction in physically- and mentally-demanding scenarios. However, subjective metrics revealed a need for human operators to build confidence and trust with this new method of operation.
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
Authors: Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17591
Pdf link: https://arxiv.org/pdf/2303.17591
Abstract The unlearning problem of deep learning models, once primarily an academic concern, has become a prevalent issue in the industry. The significant advances in text-to-image generation techniques have prompted global discussions on privacy, copyright, and safety, as numerous unauthorized personal IDs, content, artistic creations, and potentially harmful materials have been learned by these models and later utilized to generate and distribute uncontrolled content. To address this challenge, we propose \textbf{Forget-Me-Not}, an efficient and low-cost solution designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds, without impairing its ability to generate other content. Alongside our method, we introduce the \textbf{Memorization Score (M-Score)} and \textbf{ConceptBench} to measure the models' capacity to generate general concepts, grouped into three primary categories: ID, object, and style. Using M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts. Furthermore, Forget-Me-Not offers two practical extensions: a) removal of potentially harmful or NSFW content, and b) enhancement of model accuracy, inclusion and diversity through \textbf{concept correction and disentanglement}. It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution. To encourage future research in this critical area and promote the development of safe and inclusive generative models, we will open-source our code and ConceptBench at \href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}.
MobileInst: Video Instance Segmentation on the Mobile
Authors: Renhong Zhang, Tianheng Cheng, Shusheng Yang, Haoyi Jiang, Shuai Zhang, Jiancheng Lyu, Xin Li, Xiaowen Ying, Dashan Gao, Wenyu Liu, Xinggang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17594
Pdf link: https://arxiv.org/pdf/2303.17594
Abstract Although recent approaches aiming for video instance segmentation have achieved promising results, it is still difficult to employ those approaches for real-world applications on mobile devices, which mainly suffer from (1) heavy computation and memory cost and (2) complicated heuristics for tracking objects. To address those issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on a mobile CPU core of Qualcomm Snapdragon-778G, without other methods of acceleration. On the COCO dataset, MobileInst achieves 30.5 mask AP and 176 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP on YouTube-VIS 2019 and 30.1 AP on YouTube-VIS 2021. Code will be available to facilitate real-world applications and future research.
Token Merging for Fast Stable Diffusion
Authors: Daniel Bolya, Judy Hoffman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17604
Pdf link: https://arxiv.org/pdf/2303.17604
Abstract The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Authors: Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, Song Han
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17605
Pdf link: https://arxiv.org/pdf/2303.17605
Abstract High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.
Keyword: faster

Urgency-aware Routing in Single Origin-destination Itineraries through Artificial Currencies
Authors: Leonardo Pedroso, W.P.M.H. Heemels, Mauro Salazar
Subjects: Systems and Control (eess.SY); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2303.16945
Pdf link: https://arxiv.org/pdf/2303.16945
Abstract Within mobility systems, the presence of self-interested users can lead to aggregate routing patterns that are far from the societal optimum which could be achieved by centrally controlling the users' choices. In this paper, we design a fair incentive mechanism to steer the selfish behavior of the users to align with the societally optimal aggregate routing. The proposed mechanism is based on an artificial currency that cannot be traded or bought, but only spent or received when traveling. Specifically, we consider a parallel-arc network with a single origin and destination node within a repeated game setting whereby each user chooses from one of the available arcs to reach their destination on a daily basis. In this framework, taking faster routes comes at a cost, whereas taking slower routes is incentivized by a reward. The users are thus playing against their future selves when choosing their present actions. To capture this complex behavior, we assume the users to be rational and to minimize an urgency-weighted combination of their immediate and future discomfort. To design the optimal pricing, we first derive a closed-form expression for the best individual response strategy. Second, we formulate the pricing design problem for each arc to achieve the societally optimal aggregate flows, and reformulate it so that it can be solved with gradient-free optimization methods. Our numerical simulations show that it is possible to achieve a near-optimal routing whilst significantly reducing the users' perceived discomfort when compared to a centralized optimal but urgency-unaware policy.
PopSparse: Accelerated block sparse matrix multiplication on IPU
Authors: Zhiyi Li, Douglas Orr, Valeriu Ohan, Godfrey Da costa, Tom Murray, Adam Sanders, Deniz Beker, Dominic Masters
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.16999
Pdf link: https://arxiv.org/pdf/2303.16999
Abstract Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%).
Overcoming Challenges to Continuous Integration in HPC
Authors: Todd Gamblin, Daniel S. Katz
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2303.17034
Pdf link: https://arxiv.org/pdf/2303.17034
Abstract Continuous integration (CI) has become a ubiquitous practice in modern software development, with major code hosting services offering free automation on popular platforms. CI offers major benefits, as it enables detecting bugs in code prior to committing changes. While high-performance computing (HPC) research relies heavily on software, HPC machines are not considered "common" platforms. This presents several challenges that hinder the adoption of CI in HPC environments, making it difficult to maintain bug-free HPC projects, and resulting in adverse effects on the research community. In this article, we explore the challenges that impede HPC CI, such as hardware diversity, security, isolation, administrative policies, and non-standard authentication, environments, and job submission mechanisms. We propose several solutions that could enhance the quality of HPC software and the experience of developers. Implementing these solutions would require significant changes at HPC centers, but if these changes are made, it would ultimately enable faster and better science.
ACM with Overlapping Partitions: Implementation and Periodicity Analysis
Authors: Anthony O'Dea
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2303.17069
Pdf link: https://arxiv.org/pdf/2303.17069
Abstract The Arnold Cat Map (ACM) is a popular chaotic map used in image encryption. Chaotic maps are known for their sensitivity to initial conditions and their ability to mix, or rearrange, pixels. However, ACM is periodic, and the period is relatively short. This periodicity decreases the effective key space for a cryptosystem. Further, ACM can only be performed on square matrices. For non-square images, this issue can be solved by performing ACM on multiple square partitions of the image. If these partitions overlap, the periodicity will greatly increase. The resulting system will be referred to as overlapping ACM or OACM. This paper will cover the implementation and periodicity analysis for these overlapping systems, which previous papers involving similar overlapping block partitions did not. Viewing OACM as a scan as opposed to a map allows for faster implementation and period analysis.
TreePiece: Faster Semantic Parsing via Tree Tokenization
Authors: Sid Wang, Akshat Shrivastava, Sasha Livshits
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17161
Pdf link: https://arxiv.org/pdf/2303.17161
Abstract Autoregressive (AR) encoder-decoder neural networks have proved successful in many NLP problems, including Semantic Parsing -- a task that translates natural language to machine-readable parse trees. However, the sequential prediction process of AR models can be slow. To accelerate AR for semantic parsing, we introduce a new technique called TreePiece that tokenizes a parse tree into subtrees and generates one subtree per decoding step. On TopV2 benchmark, TreePiece shows 4.6 times faster decoding speed than standard AR, and comparable speed but significantly higher accuracy compared to Non-Autoregressive (NAR).
DPP-based Client Selection for Federated Learning with Non-IID Data
Authors: Yuxuan Zhang, Chao Xu, Howard H. Yang, Xijun Wang, Tony Q. S. Quek
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17358
Pdf link: https://arxiv.org/pdf/2303.17358
Abstract This paper proposes a client selection (CS) method to tackle the communication bottleneck of federated learning (FL) while concurrently coping with FL's data heterogeneity issue. Specifically, we first analyze the effect of CS in FL and show that FL training can be accelerated by adequately choosing participants to diversify the training dataset in each round of training. Based on this, we leverage data profiling and determinantal point process (DPP) sampling techniques to develop an algorithm termed Federated Learning with DPP-based Participant Selection (FL-DP$^3$S). This algorithm effectively diversifies the participants' datasets in each round of training while preserving their data privacy. We conduct extensive experiments to examine the efficacy of our proposed method. The results show that our scheme attains a faster convergence rate, as well as a smaller communication overhead than several baselines.
Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions
Authors: Yicheng Luo, Jackie Kay, Edward Grefenstette, Marc Peter Deisenroth
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17396
Pdf link: https://arxiv.org/pdf/2303.17396
Abstract Offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment. Online finetuning of such offline models can further improve performance. But how should we ideally finetune agents obtained from offline RL training? While offline RL algorithms can in principle be used for finetuning, in practice, their online performance improves slowly. In contrast, we show that it is possible to use standard online off-policy algorithms for faster improvement. However, we find this approach may suffer from policy collapse, where the policy undergoes severe performance deterioration during initial online learning. We investigate the issue of policy collapse and how it relates to data diversity, algorithm choices and online replay distribution. Based on these insights, we propose a conservative policy optimization procedure that can achieve stable and sample-efficient online learning from offline pretraining.
Edge Ranking of Graphs in Transportation Networks using a Graph Neural Network (GNN)
Authors: Debasish Jana, Sven Malama, Sriram Narasimhan, Ertugrul Taciroglu
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17485
Pdf link: https://arxiv.org/pdf/2303.17485
Abstract Many networks, such as transportation, power, and water distribution, can be represented as graphs. Crucial challenge in graph representations is identifying the importance of graph edges and their influence on overall network efficiency and information flow performance. For example, important edges in a transportation network are those roads that, when affected, will significantly alter the network's overall efficiency. Commonly used approach to finding such important edges is ``edge betweenness centrality'' (EBC), an edge ranking measure to determine the influential edges of the graph based on connectivity and information spread. Computing the EBC utilizing the common Brandes algorithm involves calculating the shortest paths for every node pair, which can be computationally expensive and restrictive, especially for large graphs. Changes in the graph parameters, e.g., in the edge weight or the addition and deletion of nodes or edges, require the recalculation of the EBC. As the main contribution, we propose an approximate method to estimate the EBC using a Graph Neural Network (GNN), a deep learning-based approach. We show that it is computationally efficient compared to the conventional method, especially for large graphs. The proposed method of GNN-based edge ranking is evaluated on several synthetic graphs and a real-world transportation data set. We show that this framework can estimate the approximate edge ranking much faster compared to the conventional method. This approach is inductive, i.e., training and testing are performed on different sets of graphs with varying numbers of nodes and edges. The proposed method is especially suitable for applications on large-scale networks when edge information is desired, for example, in urban infrastructure improvement projects, power, and water network resilience analyses, and optimizing resource allocations in engineering networks.
Pgx: Hardware-accelerated parallel game simulation for reinforcement learning
Authors: Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, Shin Ishii
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17503
Pdf link: https://arxiv.org/pdf/2303.17503
Abstract We propose Pgx, a collection of board game simulators written in JAX. Thanks to auto-vectorization and Just-In-Time compilation of JAX, Pgx scales easily to thousands of parallel execution on GPU/TPU accelerators. We found that the simulation of Pgx on a single A100 GPU is 10x faster than that of existing reinforcement learning libraries. Pgx implements games considered vital benchmarks in artificial intelligence research, such as Backgammon, Shogi, and Go. Pgx is available at https://github.com/sotetsuk/pgx.
Token Merging for Fast Stable Diffusion
Authors: Daniel Bolya, Judy Hoffman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17604
Pdf link: https://arxiv.org/pdf/2303.17604
Abstract The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.
Keyword: mobile

A Tensor-based Convolutional Neural Network for Small Dataset Classification
Authors: Zhenhua Chen, David Crandall
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2303.17061
Pdf link: https://arxiv.org/pdf/2303.17061
Abstract Inspired by the ConvNets with structured hidden representations, we propose a Tensor-based Neural Network, TCNN. Different from ConvNets, TCNNs are composed of structured neurons rather than scalar neurons, and the basic operation is neuron tensor transformation. Unlike other structured ConvNets, where the part-whole relationships are modeled explicitly, the relationships are learned implicitly in TCNNs. Also, the structured neurons in TCNNs are high-rank tensors rather than vectors or matrices. We compare TCNNs with current popular ConvNets, including ResNets, MobileNets, EfficientNets, RegNets, etc., on CIFAR10, CIFAR100, and Tiny ImageNet. The experiment shows that TCNNs have higher efficiency in terms of parameters. TCNNs also show higher robustness against white-box adversarial attacks on MNIST compared to ConvNets.
Dependent Task Offloading in Edge Computing Using GNN and Deep Reinforcement Learning
Authors: Zequn Cao, Xiaoheng Deng
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2303.17100
Pdf link: https://arxiv.org/pdf/2303.17100
Abstract Task offloading is a widely used technology in Mobile Edge Computing (MEC), which declines the completion time of user task with the help of resourceful edge servers. Existing works mainly focus on the case that the computation density of a user task is homogenous so that it can be offloaded in full or by percentage. However, various user tasks in real life consist of several inner dependent subtasks, each of which is a minimum execution unit logically. Motivated by this gap, we aim to solve the Dependent Task Offloading (DTO) problem under multi-user multi-edge scenario in this paper. We firstly use Directed Acyclic Graph (DAG) to represent dependent task where nodes indicate subtasks and directed edges indicate dependencies among subtasks. Then we propose a scheme based on Graph Attention Network (GAT) and Deep Reinforcement Learning (DRL) to minimize the makespan of user tasks. To utilize GAT efficiently, we put the training of it on resourceful cloud in unsupervised style due to the numerous data and computation resource requirements. In addition, we design a multi-discrete Action space for DRL algorithm to enhance the applicability of our proposed scheme. Experiments are conducted on broadly distributed synthetic data. The results demonstrate that our proposed approach can be adapted to both simple and complex MEC environments and outperforms other methods.
Deep Generative Model and Its Applications in Efficient Wireless Network Management: A Tutorial and Case Study
Authors: Yinqiu Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Dong In Kim, Abbas Jamalipour
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17114
Pdf link: https://arxiv.org/pdf/2303.17114
Abstract With the phenomenal success of diffusion models and ChatGPT, deep generation models (DGMs) have been experiencing explosive growth from 2022. Not limited to content generation, DGMs are also widely adopted in Internet of Things, Metaverse, and digital twin, due to their outstanding ability to represent complex patterns and generate plausible samples. In this article, we explore the applications of DGMs in a crucial task, i.e., improving the efficiency of wireless network management. Specifically, we firstly overview the generative AI, as well as three representative DGMs. Then, a DGM-empowered framework for wireless network management is proposed, in which we elaborate the issues of the conventional network management approaches, why DGMs can address them efficiently, and the step-by-step workflow for applying DGMs in managing wireless networks. Moreover, we conduct a case study on network economics, using the state-of-the-art DGM model, i.e., diffusion model, to generate effective contracts for incentivizing the mobile AI-Generated Content (AIGC) services. Last but not least, we discuss important open directions for the further research.
GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection
Authors: Xinxin Hu, Haotian Chen, Junjie Zhang, Hongchang Chen, Shuxin Liu, Xing Li, Yahui Wang, Xiangyang Xue
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17334
Pdf link: https://arxiv.org/pdf/2303.17334
Abstract Along with the rapid evolution of mobile communication technologies, such as 5G, there has been a drastically increase in telecom fraud, which significantly dissipates individual fortune and social wealth. In recent years, graph mining techniques are gradually becoming a mainstream solution for detecting telecom fraud. However, the graph imbalance problem, caused by the Pareto principle, brings severe challenges to graph data mining. This is a new and challenging problem, but little previous work has been noticed. In this paper, we propose a Graph ATtention network with COst-sensitive BOosting (GAT-COBO) for the graph imbalance problem. First, we design a GAT-based base classifier to learn the embeddings of all nodes in the graph. Then, we feed the embeddings into a well-designed cost-sensitive learner for imbalanced learning. Next, we update the weights according to the misclassification cost to make the model focus more on the minority class. Finally, we sum the node embeddings obtained by multiple cost-sensitive learners to obtain a comprehensive node representation, which is used for the downstream anomaly detection task. Extensive experiments on two real-world telecom fraud detection datasets demonstrate that our proposed method is effective for the graph imbalance problem, outperforming the state-of-the-art GNNs and GNN-based fraud detectors. In addition, our model is also helpful for solving the widespread over-smoothing problem in GNNs. The GAT-COBO code and datasets are available at https://github.com/xxhu94/GAT-COBO.
An Efficient Mobile Gateway Selection and Discovery Based-Routing Protocol in Heterogeneous LTE-VANET Networks
Authors: Driss Abada, Rachid Adrdor, Omar Boutkhoum, Adil Bohouch
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2303.17439
Pdf link: https://arxiv.org/pdf/2303.17439
Abstract Coupling cellular communication networks with vehicular ad hoc networks (VANET) can be a very interesting way out for providing Internet access to vehicles in the road. However, due to the several specific characteristics of VANETs, making an efficient multi-hop routing from vehicular sources to the Internet gateways through Long Term Evolution (LTE) technology is still challenging. In this paper, an Internet mobile gateway selection scheme is proposed to elect more potential vehicles to behave as gateways to Internet in VANETs. Therefore, the discovery and the selection of route to those mobiles gateways is carried out via an efficient multiple metrics-based relay selection mechanism. The objective is to select the more reliable route to the mobile gateways, by reducing the communication overhead and performing seamless handover. The proposed protocol is compared with one recent protocol based on packet delivery ratio, average end-to-end delay and overhead. The results show that the proposed protocol ameliorates significantly the network performance in the contrast of the other protocol.
Cost Sensitive GNN-based Imbalanced Learning for Mobile Social Network Fraud Detection
Authors: Xinxin Hu, Haotian Chen, Hongchang Chen, Shuxin Liu, Xing Li, Shibo Zhang, Yahui Wang, Xiangyang Xue
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17486
Pdf link: https://arxiv.org/pdf/2303.17486
Abstract With the rapid development of mobile networks, the people's social contacts have been considerably facilitated. However, the rise of mobile social network fraud upon those networks, has caused a great deal of distress, in case of depleting personal and social wealth, then potentially doing significant economic harm. To detect fraudulent users, call detail record (CDR) data, which portrays the social behavior of users in mobile networks, has been widely utilized. But the imbalance problem in the aforementioned data, which could severely hinder the effectiveness of fraud detectors based on graph neural networks(GNN), has hardly been addressed in previous work. In this paper, we are going to present a novel Cost-Sensitive Graph Neural Network (CSGNN) by creatively combining cost-sensitive learning and graph neural networks. We conduct extensive experiments on two open-source realworld mobile network fraud datasets. The results show that CSGNN can effectively solve the graph imbalance problem and then achieve better detection performance than the state-of-the-art algorithms. We believe that our research can be applied to solve the graph imbalance problems in other fields. The CSGNN code and datasets are publicly available at https://github.com/xxhu94/CSGNN.
MobileInst: Video Instance Segmentation on the Mobile
Authors: Renhong Zhang, Tianheng Cheng, Shusheng Yang, Haoyi Jiang, Shuai Zhang, Jiancheng Lyu, Xin Li, Xiaowen Ying, Dashan Gao, Wenyu Liu, Xinggang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17594
Pdf link: https://arxiv.org/pdf/2303.17594
Abstract Although recent approaches aiming for video instance segmentation have achieved promising results, it is still difficult to employ those approaches for real-world applications on mobile devices, which mainly suffer from (1) heavy computation and memory cost and (2) complicated heuristics for tracking objects. To address those issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on a mobile CPU core of Qualcomm Snapdragon-778G, without other methods of acceleration. On the COCO dataset, MobileInst achieves 30.5 mask AP and 176 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP on YouTube-VIS 2019 and 30.1 AP on YouTube-VIS 2021. Code will be available to facilitate real-world applications and future research.
Keyword: pruning

Explainable Intrusion Detection Systems Using Competitive Learning Techniques
Authors: Jesse Ables, Thomas Kirby, Sudip Mittal, Ioana Banicescu, Shahram Rahimi, William Anderson, Maria Seale
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17387
Pdf link: https://arxiv.org/pdf/2303.17387
Abstract The current state of the art systems in Artificial Intelligence (AI) enabled intrusion detection use a variety of black box methods. These black box methods are generally trained using Error Based Learning (EBL) techniques with a focus on creating accurate models. These models have high performative costs and are not easily explainable. A white box Competitive Learning (CL) based eXplainable Intrusion Detection System (X-IDS) offers a potential solution to these problem. CL models utilize an entirely different learning paradigm than EBL approaches. This different learning process makes the CL family of algorithms innately explainable and less resource intensive. In this paper, we create an X-IDS architecture that is based on DARPA's recommendation for explainable systems. In our architecture we leverage CL algorithms like, Self Organizing Maps (SOM), Growing Self Organizing Maps (GSOM), and Growing Hierarchical Self Organizing Map (GHSOM). The resulting models can be data-mined to create statistical and visual explanations. Our architecture is tested using NSL-KDD and CIC-IDS-2017 benchmark datasets, and produces accuracies that are 1% - 3% less than EBL models. However, CL models are much more explainable than EBL models. Additionally, we use a pruning process that is able to significantly reduce the size of these CL based models. By pruning our models, we are able to increase prediction speeds. Lastly, we analyze the statistical and visual explanations generated by our architecture, and we give a strategy that users could use to help navigate the set of explanations. These explanations will help users build trust with an Intrusion Detection System (IDS), and allow users to discover ways to increase the IDS's potency.
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Authors: Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, Song Han
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17605
Pdf link: https://arxiv.org/pdf/2303.17605
Abstract High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.
Keyword: voxel

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions
Authors: Lingdong Kong, Youquan Liu, Xin Li, Runnan Chen, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, Ziwei Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17597
Pdf link: https://arxiv.org/pdf/2303.17597
Abstract The robustness of 3D perception systems under natural corruptions from environments and sensors is pivotal for safety-critical applications. Existing large-scale 3D perception datasets often contain data that are meticulously cleaned. Such configurations, however, cannot reflect the reliability of perception models during the deployment stage. In this work, we present Robo3D, the first comprehensive benchmark heading toward probing the robustness of 3D detectors and segmentors under out-of-distribution scenarios against natural corruptions that occur in real-world environments. Specifically, we consider eight corruption types stemming from adversarial weather conditions, external disturbances, and internal sensor failure. We uncover that, although promising results have been progressively achieved on standard benchmarks, state-of-the-art 3D perception models are at risk of being vulnerable to corruptions. We draw key observations on the use of data representations, augmentation schemes, and training strategies, that could severely affect the model's performance. To pursue better robustness, we propose a density-insensitive training framework along with a simple flexible voxelization strategy to enhance the model resiliency. We hope our benchmark and approach could inspire future research in designing more robust and reliable 3D perception models. Our robustness benchmark suite is publicly available.
Keyword: lidar

T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals
Authors: James Giroux, Martin Bouchard, Robert Laganiere
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.16940
Pdf link: https://arxiv.org/pdf/2303.16940
Abstract Object detection utilizing Frequency Modulated Continous Wave radar is becoming increasingly popular in the field of autonomous systems. Radar does not possess the same drawbacks seen by other emission-based sensors such as LiDAR, primarily the degradation or loss of return signals due to weather conditions such as rain or snow. However, radar does possess traits that make it unsuitable for standard emission-based deep learning representations such as point clouds. Radar point clouds tend to be sparse and therefore information extraction is not efficient. To overcome this, more traditional digital signal processing pipelines were adapted to form inputs residing directly in the frequency domain via Fast Fourier Transforms. Commonly, three transformations were used to form Range-Azimuth-Doppler cubes in which deep learning algorithms could perform object detection. This too has drawbacks, namely the pre-processing costs associated with performing multiple Fourier Transforms and normalization. We explore the possibility of operating on raw radar inputs from analog to digital converters via the utilization of complex transformation layers. Moreover, we introduce hierarchical Swin Vision transformers to the field of radar object detection and show their capability to operate on inputs varying in pre-processing, along with different radar configurations, i.e. relatively low and high numbers of transmitters and receivers, while obtaining on par or better results than the state-of-the-art.
BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation
Authors: Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, Jiuhua Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17099
Pdf link: https://arxiv.org/pdf/2303.17099
Abstract Integrating LiDAR and Camera information into Bird's-Eye-View (BEV) has become an essential topic for 3D object detection in autonomous driving. Existing methods mostly adopt an independent dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion. Since point clouds provide more accurate localization and geometry information, they could serve as a reliable spatial prior to acquiring relevant semantic information from the images. Therefore, we design a LiDAR-Guided View Transformer (LGVT) to effectively obtain the camera representation in BEV space and thus benefit the whole dual-branch fusion system. LGVT takes camera BEV as the primitive semantic query, repeatedly leveraging the spatial cue of LiDAR BEV for extracting image features across multiple camera views. Moreover, we extend our framework into the temporal domain with our proposed Temporal Deformable Alignment (TDA) module, which aims to aggregate BEV features from multiple historical frames. Including these two modules, our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection, with 72.0% mAP and 73.5% NDS on the nuScenes validation set, and 73.3% mAP and 74.7% NDS on nuScenes test set, respectively.
Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving
Authors: Zijian Zhu, Yichi Zhang, Hai Chen, Yinpeng Dong, Shu Zhao, Wenbo Ding, Jiachen Zhong, Shibao Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2303.17297
Pdf link: https://arxiv.org/pdf/2303.17297
Abstract 3D object detection is an essential perception task in autonomous driving to understand the environments. The Bird's-Eye-View (BEV) representations have significantly improved the performance of 3D detectors with camera inputs on popular benchmarks. However, there still lacks a systematic understanding of the robustness of these vision-dependent BEV models, which is closely related to the safety of autonomous driving systems. In this paper, we evaluate the natural and adversarial robustness of various representative models under extensive settings, to fully understand their behaviors influenced by explicit BEV features compared with those without BEV. In addition to the classic settings, we propose a 3D consistent patch attack by applying adversarial patches in the 3D space to guarantee the spatiotemporal consistency, which is more realistic for the scenario of autonomous driving. With substantial experiments, we draw several findings: 1) BEV models tend to be more stable than previous methods under different natural conditions and common corruptions due to the expressive spatial representations; 2) BEV models are more vulnerable to adversarial noises, mainly caused by the redundant BEV features; 3) Camera-LiDAR fusion models have superior performance under different settings with multi-modal inputs, but BEV fusion model is still vulnerable to adversarial noises of both point cloud and image. These findings alert the safety issue in the applications of BEV detectors and could facilitate the development of more robust models.
Event-based Agile Object Catching with a Quadrupedal Robot
Authors: Benedek Forrai, Takahiro Miki, Daniel Gehrig, Marco Hutter, Davide Scaramuzza
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17479
Pdf link: https://arxiv.org/pdf/2303.17479
Abstract Quadrupedal robots are conquering various indoor and outdoor applications due to their ability to navigate challenging uneven terrains. Exteroceptive information greatly enhances this capability since perceiving their surroundings allows them to adapt their controller and thus achieve higher levels of robustness. However, sensors such as LiDARs and RGB cameras do not provide sufficient information to quickly and precisely react in a highly dynamic environment since they suffer from a bandwidth-latency tradeoff. They require significant bandwidth at high frame rates while featuring significant perceptual latency at lower frame rates, thereby limiting their versatility on resource-constrained platforms. In this work, we tackle this problem by equipping our quadruped with an event camera, which does not suffer from this tradeoff due to its asynchronous and sparse operation. In leveraging the low latency of the events, we push the limits of quadruped agility and demonstrate high-speed ball catching for the first time. We show that our quadruped equipped with an event camera can catch objects with speeds up to 15 m/s from 4 meters, with a success rate of 83%. Using a VGA event camera, our method runs at 100 Hz on an NVIDIA Jetson Orin.
Keyword: diffusion

HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion
Authors: Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17015
Pdf link: https://arxiv.org/pdf/2303.17015
Abstract Implicit neural fields, typically encoded by a multilayer perceptron (MLP) that maps from coordinates (e.g., xyz) to signals (e.g., signed distances), have shown remarkable promise as a high-fidelity and compact representation. However, the lack of a regular and explicit grid structure also makes it challenging to apply generative modeling directly on implicit neural fields in order to synthesize new data. To this end, we propose HyperDiffusion, a novel approach for unconditional generative modeling of implicit neural fields. HyperDiffusion operates directly on MLP weights and generates new neural implicit fields encoded by synthesized MLP parameters. Specifically, a collection of MLPs is first optimized to faithfully represent individual data samples. Subsequently, a diffusion process is trained in this MLP weight space to model the underlying distribution of neural implicit fields. HyperDiffusion enables diffusion modeling over a implicit, compact, and yet high-fidelity representation of complex signals across 3D shapes and 4D mesh animations within one single unified framework.
DiffCollage: Parallel Generation of Large Content with Diffusion Models
Authors: Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, Ming-Yu Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17076
Pdf link: https://arxiv.org/pdf/2303.17076
Abstract We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach.
Deep Generative Model and Its Applications in Efficient Wireless Network Management: A Tutorial and Case Study
Authors: Yinqiu Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Dong In Kim, Abbas Jamalipour
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17114
Pdf link: https://arxiv.org/pdf/2303.17114
Abstract With the phenomenal success of diffusion models and ChatGPT, deep generation models (DGMs) have been experiencing explosive growth from 2022. Not limited to content generation, DGMs are also widely adopted in Internet of Things, Metaverse, and digital twin, due to their outstanding ability to represent complex patterns and generate plausible samples. In this article, we explore the applications of DGMs in a crucial task, i.e., improving the efficiency of wireless network management. Specifically, we firstly overview the generative AI, as well as three representative DGMs. Then, a DGM-empowered framework for wireless network management is proposed, in which we elaborate the issues of the conventional network management approaches, why DGMs can address them efficiently, and the step-by-step workflow for applying DGMs in managing wireless networks. Moreover, we conduct a case study on network economics, using the state-of-the-art DGM model, i.e., diffusion model, to generate effective contracts for incentivizing the mobile AI-Generated Content (AIGC) services. Last but not least, we discuss important open directions for the further research.
Discriminative Class Tokens for Text-to-Image Diffusion Models
Authors: Idan Schwartz, Vésteinn Snæbjarnarson, Sagie Benaim, Hila Chefer, Ryan Cotterell, Lior Wolf, Serge Belongie
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2303.17155
Pdf link: https://arxiv.org/pdf/2303.17155
Abstract Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. However, generated images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This comes with a downside, doing so limits their expressive power: (i) supervised datasets are generally small compared to large-scale scraped text-image datasets on which text-to-image models are trained, and so the quality and diversity of generated images are severely affected, or (ii) the input is a hard-coded label, as opposed to free-form text, which limits the control over the generated images. In this work, we propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier, which guides the generation. This is done by iteratively modifying the embedding of a single input token of a text-to-image diffusion model, using the classifier, by steering generated images toward a given target class. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images or retraining of a noise-tolerant classifier. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier. The code is available at \url{https://github.com/idansc/discriminative_class_tokens}
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
Authors: Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17189
Pdf link: https://arxiv.org/pdf/2303.17189
Abstract Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
Authors: Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17546
Pdf link: https://arxiv.org/pdf/2303.17546
Abstract Image editing using diffusion models has witnessed extremely fast-paced growth recently. There are various ways in which previous works enable controlling and editing images. Some works use high-level conditioning such as text, while others use low-level conditioning. Nevertheless, most of them lack fine-grained control over the properties of the different objects present in the image, i.e. object-level image editing. In this work, we consider an image as a composition of multiple objects, each defined by various properties. Out of these properties, we identify structure and appearance as the most intuitive to understand and useful for editing purposes. We propose Structure-and-Appearance Paired Diffusion model (PAIR-Diffusion), which is trained using structure and appearance information explicitly extracted from the images. The proposed model enables users to inject a reference image's appearance into the input image at both the object and global levels. Additionally, PAIR-Diffusion allows editing the structure while maintaining the style of individual components of the image unchanged. We extensively evaluate our method on LSUN datasets and the CelebA-HQ face dataset, and we demonstrate fine-grained control over both structure and appearance at the object level. We also applied the method to Stable Diffusion to edit any real image at the object level.
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
Authors: Chenpng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2303.17550
Pdf link: https://arxiv.org/pdf/2303.17550
Abstract While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.
DDP: Diffusion Model for Dense Visual Prediction
Authors: Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17559
Pdf link: https://arxiv.org/pdf/2303.17559
Abstract We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
Authors: Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17591
Pdf link: https://arxiv.org/pdf/2303.17591
Abstract The unlearning problem of deep learning models, once primarily an academic concern, has become a prevalent issue in the industry. The significant advances in text-to-image generation techniques have prompted global discussions on privacy, copyright, and safety, as numerous unauthorized personal IDs, content, artistic creations, and potentially harmful materials have been learned by these models and later utilized to generate and distribute uncontrolled content. To address this challenge, we propose \textbf{Forget-Me-Not}, an efficient and low-cost solution designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds, without impairing its ability to generate other content. Alongside our method, we introduce the \textbf{Memorization Score (M-Score)} and \textbf{ConceptBench} to measure the models' capacity to generate general concepts, grouped into three primary categories: ID, object, and style. Using M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts. Furthermore, Forget-Me-Not offers two practical extensions: a) removal of potentially harmful or NSFW content, and b) enhancement of model accuracy, inclusion and diversity through \textbf{concept correction and disentanglement}. It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution. To encourage future research in this critical area and promote the development of safe and inclusive generative models, we will open-source our code and ConceptBench at \href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}.
Consistent View Synthesis with Pose-Guided Diffusion Models
Authors: Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, Johannes Kopf
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17598
Pdf link: https://arxiv.org/pdf/2303.17598
Abstract Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches.
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Authors: Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17599
Pdf link: https://arxiv.org/pdf/2303.17599
Abstract Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code will be made available at \url{https://github.com/baaivision/vid2vid-zero}.
Token Merging for Fast Stable Diffusion
Authors: Daniel Bolya, Judy Hoffman
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17604
Pdf link: https://arxiv.org/pdf/2303.17604
Abstract The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.
AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control
Authors: Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17606
Pdf link: https://arxiv.org/pdf/2303.17606
Abstract Neural implicit fields are powerful for representing 3D scenes and generating high-quality novel views, but it remains challenging to use such implicit representations for creating a 3D human avatar with a specific identity and artistic style that can be easily animated. Our proposed method, AvatarCraft, addresses this challenge by using diffusion models to guide the learning of geometry and texture for a neural avatar based on a single text prompt. We carefully design the optimization framework of neural implicit fields, including a coarse-to-fine multi-bounding box training strategy, shape regularization, and diffusion-based constraints, to produce high-quality geometry and texture. Additionally, we make the human avatar animatable by deforming the neural implicit field with an explicit warping field that maps the target human mesh to a template human mesh, both represented using parametric human models. This simplifies animation and reshaping of the generated avatar by controlling pose and shape parameters. Extensive experiments on various text descriptions show that AvatarCraft is effective and robust in creating human avatars and rendering novel views, poses, and shapes. Our project page is: \url{https://avatar-craft.github.io/}.
Keyword: dynamic

Thrust vector control and state estimation architecture for low-cost small-scale launchers
Authors: Pedro dos Santos, Paulo Oliveira
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2303.16983
Pdf link: https://arxiv.org/pdf/2303.16983
Abstract This paper proposes an integrated architecture for Thrust Vector Control (TVC) and state estimation for low-cost small-scale launchers, naturally unstable, and propelled by a solid motor. The architecture is based on a non-linear, six-degrees-of-freedom model for the generic thrust-vector-controlled launcher dynamics and kinematics, deduced and implemented in a realistic simulation environment. For estimation and control design purposes, a linearized version of the model is proposed. Single-nozzle TVC actuation is adopted, allowing for pitch and yaw control, with the control law being derived from the Linear Quadratic Regulator (LQR) with additional integral action (LQI). The control system is implemented through gain scheduling. Full state estimation is performed resorting to complementary kinematic filters, closely related to linear Kalman fitering theory. The architecture, composed by the navigation and control systems, is tested in simulation environment, demonstrating satisfactory attitude tracking performance and robustness to both external disturbances and model uncertainties.
PopSparse: Accelerated block sparse matrix multiplication on IPU
Authors: Zhiyi Li, Douglas Orr, Valeriu Ohan, Godfrey Da costa, Tom Murray, Adam Sanders, Deniz Beker, Dominic Masters
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.16999
Pdf link: https://arxiv.org/pdf/2303.16999
Abstract Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%).
Scalable Implicit Solvers with Dynamic Mesh Adaptation for a Relativistic Drift-Kinetic Fokker-Planck-Boltzmann Model
Authors: Johann Rudi, Max Heldman, Emil M. Constantinescu, Qi Tang, Xian-Zhu Tang
Subjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Plasma Physics (physics.plasm-ph)
Arxiv link: https://arxiv.org/abs/2303.17019
Pdf link: https://arxiv.org/pdf/2303.17019
Abstract In this work we consider a relativistic drift-kinetic model for runaway electrons along with a Fokker-Planck operator for small-angle Coulomb collisions, a radiation damping operator, and a secondary knock-on (Boltzmann) collision source. We develop a new scalable fully implicit solver utilizing finite volume and conservative finite difference schemes and dynamic mesh adaptivity. A new data management framework in the PETSc library based on the p4est library is developed to enable simulations with dynamic adaptive mesh refinement (AMR), parallel computation, and load balancing. This framework is tested through the development of the runaway electron solver that is able to dynamically capture both bulk Maxwellian at the low-energy region and a runaway tail at the high-energy region. To effectively capture features via the AMR algorithm, a new AMR indicator prediction strategy is proposed that is performed alongside the implicit time evolution of the solution. This strategy is complemented by the introduction of computationally cheap feature-based AMR indicators that are analyzed theoretically. Numerical results quantify the advantages of the prediction strategy in better capturing features compared with nonpredictive strategies; and we demonstrate trade-offs regarding computational costs. The full solver is further verified through several benchmark problems including manufactured solutions and solutions of physics models. We particularly focus on demonstrating the advantages of using implicit time stepping and AMR for runaway electron simulations.
Stability bounds of droop-controlled inverters in power grid networks
Authors: Philipp C. Böttcher, Leonardo Rydin Gorjão, Dirk Witthaut
Subjects: Systems and Control (eess.SY); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2303.17032
Pdf link: https://arxiv.org/pdf/2303.17032
Abstract The energy mix of future power systems will include high shares of wind power and solar PV. These generation facilities are generally connected via power-electronic inverters. While conventional generation responds dynamically to the state of the electric power system, inverters are power electronic hardware and need to be programmed to react to the state of the system. Choosing an appropriate control scheme and the corresponding parameters is necessary to guarantee that the system operates safely. A prominent control scheme for inverters is droop control, which mimics the response of conventional generation. In this work, we investigate the stability of coupled systems of droop-controlled inverters in arbitrary network topologies. Employing linear stability analysis, we derive effective local stability criteria that consider both the overall network topology as well as its interplay with the inverters' intrinsic parameters. First, we explore the stability of an inverter coupled to an infinite grid in an analytic fashion and uncover stability and instability regions. Secondly, we extend the analysis to a generic topology of inverters and provide mathematical criteria for stability and instability of the system. Last, we showcase the usefulness of the criteria by examining two model systems using numerical simulations. The developed criteria show which parameters might lead to an unstable operating state.
Material-agnostic Shaping of Granular Materials with Optimal Transport
Authors: Nikhilesh Alatur, Olov Andersson, Roland Siegwart, Lionel Ott
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17047
Pdf link: https://arxiv.org/pdf/2303.17047
Abstract From construction materials, such as sand or asphalt, to kitchen ingredients, like rice, sugar, or salt; the world is full of granular materials. Despite impressive progress in robotic manipulation, manipulating and interacting with granular material remains a challenge due to difficulties in perceiving, representing, modelling, and planning for these variable materials that have complex internal dynamics. While some prior work has looked into estimating or learning accurate dynamics models for granular materials, the literature is still missing a more abstract planning method that can be used for planning manipulation actions for granular materials with unknown material properties. In this work, we leverage tools from optimal transport and connect them to robot motion planning. We propose a heuristics-based sweep planner that does not require knowledge of the material's properties and directly uses a height map representation to generate promising sweeps. These sweeps transform granular material from arbitrary start shapes into arbitrary target shapes. We apply the sweep planner in a fast and reactive feedback loop and avoid the need for model-based planning over multiple time steps. We validate our approach with a large set of simulation and hardware experiments where we show that our method is capable of efficiently solving several complex tasks, including gathering, separating, and shaping of several types of granular materials into different target shapes.
Modularized Control Synthesis for Complex Signal Temporal Logic Specifications
Authors: Zengjie Zhang, Sofie Haesaert
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2303.17086
Pdf link: https://arxiv.org/pdf/2303.17086
Abstract The control synthesis of a dynamic system subject to signal temporal logic (STL) specifications is commonly formulated as a mixed-integer linear programming (MILP) problem. Solving a MILP problem is computationally expensive when the STL formulas are long and complex. In this paper, we propose a framework to transform a long and complex STL formula into a syntactically separate form, i.e., the logical combination of a series of short and simple subformulas with non-overlapping timing intervals. Using this framework, one can easily modularize the synthesis of a complex formula using the synthesis solutions of the subformulas, which improves the efficiency of solving a MILP problem. Specifically, we propose a group of separation principles to guarantee the syntactic equivalence between the original formula and its syntactically separate counterpart. Then, we propose novel methods to solve the largest satisfaction region and the open-loop controller of the specification in a modularized manner. The efficacy of the methods is validated with a robot monitoring case study in simulation. Our work is promising to promote the efficiency of control synthesis for systems with complicated specifications.
Learning Reliable Representations for Incomplete Multi-View Partial Multi-Label Classification
Authors: Chengliang Liu, Jie Wen, Yong Xu, Liqiang Nie, Min Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17117
Pdf link: https://arxiv.org/pdf/2303.17117
Abstract As a cross-topic of multi-view learning and multi-label classification, multi-view multi-label classification has gradually gained traction in recent years. The application of multi-view contrastive learning has further facilitated this process, however, the existing multi-view contrastive learning methods crudely separate the so-called negative pair, which largely results in the separation of samples belonging to the same category or similar ones. Besides, plenty of multi-view multi-label learning methods ignore the possible absence of views and labels. To address these issues, in this paper, we propose an incomplete multi-view partial multi-label classification network named RANK. In this network, a label-driven multi-view contrastive learning strategy is proposed to leverage supervised information to preserve the structure within view and perform consistent alignment across views. Furthermore, we break through the view-level weights inherent in existing methods and propose a quality-aware sub-network to dynamically assign quality scores to each view of each sample. The label correlation information is fully utilized in the final multi-label cross-entropy classification loss, effectively improving the discriminative power. Last but not least, our model is not only able to handle complete multi-view multi-label datasets, but also works on datasets with missing instances and labels. Extensive experiments confirm that our RANK outperforms existing state-of-the-art methods.
Weighted Scheduling of Time-Sensitive Coflows
Authors: Olivier Brun, Rachid El-Azouzi, Quang-Trung Luu, Francesco De Pellergrini, Balakrishna J. Prabhu, Cédric Richier
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2303.17175
Pdf link: https://arxiv.org/pdf/2303.17175
Abstract Datacenter networks routinely support the data transfers of distributed computing frameworks in the form of coflows, i.e., sets of concurrent flows related to a common task. The vast majority of the literature has focused on the problem of scheduling coflows for completion time minimization, i.e., to maximize the average rate at which coflows are dispatched in the network fabric. However, many modern applications generate coflows dedicated to online services and mission-critical computing tasks which have to comply with specific completion deadlines. In this paper, we introduce $\mathtt{WDCoflow}$, a new algorithm to maximize the weighted number of coflows that complete before their deadline. By combining a dynamic programming algorithm along with parallel inequalities, our heuristic solution performs at once coflow admission control and coflow prioritization, imposing a $\sigma$-order on the set of coflows. With extensive simulation, we demonstrate the effectiveness of our algorithm in improving up to $3\times$ more coflows that meet their deadline in comparison the best SotA solution, namely $\mathtt{CS\text{-}MHA}$. Furthermore, when weights are used to differentiate coflow classes, $\mathtt{WDCoflow}$ is able to improve the admission per class up to $4\times$, while increasing the average weighted coflow admission rate.
Innovative Countermeasures to Defeat Cyber Attacks Against Blockchain Wallets: A Crypto Terminal Use Case
Authors: Pascal Urien (LTCI)
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2303.17206
Pdf link: https://arxiv.org/pdf/2303.17206
Abstract Blockchain transactions are signed by private keys. Secure key storage and tamper-proof computers are essential requirements for deploying a trusted infrastructure. In this paper, we identify some threats against blockchain wallets and propose a set of physical and logical countermeasures to thwart them. We present the crypto terminal device, operating with a removable secure element, built on open software and hardware architectures, capable of detecting a cloned device or corrupted software. These technologies are based on tamper-resistant computing (javacard), smart card anti-cloning, smart card content attestation, application firewall, bare-metal architecture, remote attestation, dynamic Physical Unclonable Function (dPUF), and programming tokens as a root of trust.This paper is an extended version of the paper ''Innovative Countermeasures to Defeat Cyber Attacks Against Blockchain Wallets,'' 2021 5th Cyber Security in Networking Conference (CSNet), 2021, pp. 49-54, doi: 10.1109/CSNet52717.2021.9614649
Multifactor Sequential Disentanglement via Structured Koopman Autoencoders
Authors: Nimrod Berman, Ilan Naiman, Omri Azencot
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17264
Pdf link: https://arxiv.org/pdf/2303.17264
Abstract Disentangling complex data to its latent factors of variation is a fundamental task in representation learning. Existing work on sequential disentanglement mostly provides two factor representations, i.e., it separates the data to time-varying and time-invariant factors. In contrast, we consider multifactor disentanglement in which multiple (more than two) semantic disentangled components are generated. Key to our approach is a strong inductive bias where we assume that the underlying dynamics can be represented linearly in the latent space. Under this assumption, it becomes natural to exploit the recently introduced Koopman autoencoder models. However, disentangled representations are not guaranteed in Koopman approaches, and thus we propose a novel spectral loss term which leads to structured Koopman matrices and disentanglement. Overall, we propose a simple and easy to code new deep model that is fully unsupervised and it supports multifactor disentanglement. We showcase new disentangling abilities such as swapping of individual static factors between characters, and an incremental swap of disentangled factors from the source to the target. Moreover, we evaluate our method extensively on two factor standard benchmark tasks where we significantly improve over competing unsupervised approaches, and we perform competitively in comparison to weakly- and self-supervised state-of-the-art approaches. The code is available at https://github.com/azencot-group/SKD.
Improved a posteriori Error Bounds for Reduced port-Hamiltonian Systems
Authors: Johannes Rettberg, Dominik Wittwar, Patrick Buchfink, Robin Herkert, Jörg Fehr, Bernard Haasdonk
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2303.17329
Pdf link: https://arxiv.org/pdf/2303.17329
Abstract Projection-based model order reduction of dynamical systems usually introduces an error between the high-fidelity model and its counterpart of lower dimension. This unknown error can be bounded by residual-based methods, which are typically known to be highly pessimistic in the sense of largely overestimating the true error. This work applies two improved error bounding techniques, namely (a) a hierarchical error bound and (b) an error bound based on an auxiliary linear problem, to the case of port-Hamiltonian systems. The approaches rely on a second approximation of (a) the dynamical system and (b) the error system. In this paper, these methods are for the first time adapted to port-Hamiltonian systems by exploiting their structure. The mathematical relationship between the two methods is discussed both, theoretically and numerically. The effectiveness of the described methods is demonstrated using a challenging three-dimensional port-Hamiltonian model of a classical guitar with fluid-structure interaction.
Uniform Substitution for Dynamic Logic with Communicating Hybrid Programs
Authors: Marvin Brieger, Stefan Mitsch, André Platzer
Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2303.17333
Pdf link: https://arxiv.org/pdf/2303.17333
Abstract This paper introduces a uniform substitution calculus for $d\mathcal{L}\text{CHP}$, the dynamic logic of communicating hybrid programs. Uniform substitution enables parsimonious prover kernels by using axioms instead of axiom schemata. Instantiations can be recovered from a single proof rule responsible for soundness-critical instantiation checks rather than being spread across axiom schemata in side conditions. Even though communication and parallelism reasoning are notorious for necessitating subtle soundness-critical side conditions, uniform substitution when generalized to $d\mathcal{L}\text{CHP}$ manages to limit and isolate their conceptual overhead. Since uniform substitution has proven to simplify the implementation of hybrid systems provers substantially, uniform substitution for $d\mathcal{L}_\text{CHP}$ paves the way for a parsimonious implementation of theorem provers for hybrid systems with communication and parallelism.
The Essential Algorithms for the Matrix Chain
Authors: Francisco López, Lars Karlsson, Paolo Bientinesi
Subjects: Discrete Mathematics (cs.DM)
Arxiv link: https://arxiv.org/abs/2303.17352
Pdf link: https://arxiv.org/pdf/2303.17352
Abstract For a given product of $n$ matrices, the matrix chain multiplication problem asks for a parenthesisation that minimises the number of arithmetic operations. In 1973, Godbole presented a now classical dynamic programming formulation with cubic time complexity on the length of the chain. The best known algorithms run in linearithmic time, and the best known approximation algorithms run in linear time with an approximation factor smaller than two. All solutions have in common that they select an optimal parenthesisation from a set of $C_{n-1}$ (Catalan number $n - 1$) distinct parenthesisations. We studied the set of parenthesisations and discovered (a) that all of the exponentially many parenthesisations are useful in the sense that they are optimal in an infinite subset of the input space, (b) that only $n + 1$ parenthesisations are essential in the sense that they are arbitrarily better than the second best on an infinite subset of the input space, and (c) that the best essential parenthesisation is never more than twice as costly as the best non-essential parenthesisation. Through random sampling of the input space, we further discovered that the set of essential parenthesisations includes an optimal parenthesisation in the vast majority of inputs, and that the best essential parenthesisation is on average much closer to optimal than the worst-case bound. The results have direct consequences for the development of compilers for linear algebra expressions where the matrix sizes are unknown at compile-time.
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery
Authors: Nan Pu, Zhun Zhong, Nicu Sebe
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17393
Pdf link: https://arxiv.org/pdf/2303.17393
Abstract Generalized category discovery (GCD) is a recently proposed open-world problem, which aims to automatically cluster partially labeled data. The main challenge is that the unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. This leads traditional novel category discovery (NCD) methods to be incapacitated for GCD, due to their assumption of unlabeled data are only from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. However, this manner largely ignores underlying relationships between instances of the same concepts (e.g., class, super-class, and sub-class), which results in inferior representation learning. In this paper, we propose a Dynamic Conceptional Contrastive Learning (DCCL) framework, which can effectively improve clustering accuracy by alternately estimating underlying visual conceptions and learning conceptional representation. In addition, we design a dynamic conception generation and update mechanism, which is able to ensure consistent conception learning and thus further facilitate the optimization of DCCL. Extensive experiments show that DCCL achieves new state-of-the-art performances on six generic and fine-grained visual recognition datasets, especially on fine-grained ones. For example, our method significantly surpasses the best competitor by 16.2% on the new classes for the CUB-200 dataset. Code is available at https://github.com/TPCD/DCCL.
Fast inference of latent space dynamics in huge relational event networks
Authors: Igor Artico, Ernst Wit
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2303.17460
Pdf link: https://arxiv.org/pdf/2303.17460
Abstract Relational events are a type of social interactions, that sometimes are referred to as dynamic networks. Its dynamics typically depends on emerging patterns, so-called endogenous variables, or external forces, referred to as exogenous variables. Comprehensive information on the actors in the network, especially for huge networks, is rare, however. A latent space approach in network analysis has been a popular way to account for unmeasured covariates that are driving network configurations. Bayesian and EM-type algorithms have been proposed for inferring the latent space, but both the sheer size many social network applications as well as the dynamic nature of the process, and therefore the latent space, make computations prohibitively expensive. In this work we propose a likelihood-based algorithm that can deal with huge relational event networks. We propose a hierarchical strategy for inferring network community dynamics embedded into an interpretable latent space. Node dynamics are described by smooth spline processes. To make the framework feasible for large networks we borrow from machine learning optimization methodology. Model-based clustering is carried out via a convex clustering penalization, encouraging shared trajectories for ease of interpretation. We propose a model-based approach for separating macro-microstructures and perform a hierarchical analysis within successive hierarchies. The method can fit millions of nodes on a public Colab GPU in a few minutes. The code and a tutorial are available in a Github repository.
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
Authors: Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17472
Pdf link: https://arxiv.org/pdf/2303.17472
Abstract Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at \url{https://github.com/QitaoZhao/PoseFormerV2}.
Differentiable Environment Primitives for Contact State Estimation
Authors: Kevin Haninger, Kangwagye Samuel, Filippo Rozzi, Sehoon Oh, Loris Roveda
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17476
Pdf link: https://arxiv.org/pdf/2303.17476
Abstract In contact-rich manipulation, the robot dynamics are coupled with an environment that has application-specific dynamic properties (stiffness, inertia) and geometry (contact normal). Knowledge of these environmental parameters can improve control and monitoring, but they are often unobserved and may vary, either online or between task instances. Observers, such as the extended Kalman filter, can be used to estimate these parameters, but such model-based techniques can require too much engineering work to scale up to complex environments, such as multi-point contact. To accelerate environment modeling, we propose environment primitives: parameterized environment dynamics that can be connected in parallel and are expressed in an automatic differentiation framework. This simplifies offline gradient-based optimization to fit model parameters and linearization of the coupled dynamics for an observer. This method is implemented for stiffness contact models, allowing the fitting of contact geometry and stiffness offline or their online estimation by an extended Kalman filter. This method is applied to a collaborative robot, estimating external force, contact stiffness, and contact geometry from the motor position and current. The estimates of external force and stiffness are compared with a momentum observer and direct force measurements.
On the Analysis of Computational Delays in Reinforcement Learning-based Rate Adaptation Algorithms
Authors: Ricardo Trancoso, Ruben Queiros, Helder Fontes, Rui Campos
Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2303.17477
Pdf link: https://arxiv.org/pdf/2303.17477
Abstract Several research works have applied Reinforcement Learning (RL) algorithms to solve the Rate Adaptation (RA) problem in Wi-Fi networks. The dynamic nature of the radio link requires the algorithms to be responsive to changes in link quality. Delays in the execution of the algorithm may be detrimental to its performance, which in turn may decrease network performance. This aspect has been overlooked in the state of the art. In this paper, we present an analysis of common computational delays in RL-based RA algorithms, and propose a methodology that may be applied to reduce these computational delays and increase the efficiency of this type of algorithms. We apply the proposed methodology to an existing RL-based RA algorithm. The obtained experimental results indicate a reduction of one order of magnitude in the execution time of the algorithm, improving its responsiveness to link quality changes.
Event-based Agile Object Catching with a Quadrupedal Robot
Authors: Benedek Forrai, Takahiro Miki, Daniel Gehrig, Marco Hutter, Davide Scaramuzza
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17479
Pdf link: https://arxiv.org/pdf/2303.17479
Abstract Quadrupedal robots are conquering various indoor and outdoor applications due to their ability to navigate challenging uneven terrains. Exteroceptive information greatly enhances this capability since perceiving their surroundings allows them to adapt their controller and thus achieve higher levels of robustness. However, sensors such as LiDARs and RGB cameras do not provide sufficient information to quickly and precisely react in a highly dynamic environment since they suffer from a bandwidth-latency tradeoff. They require significant bandwidth at high frame rates while featuring significant perceptual latency at lower frame rates, thereby limiting their versatility on resource-constrained platforms. In this work, we tackle this problem by equipping our quadruped with an event camera, which does not suffer from this tradeoff due to its asynchronous and sparse operation. In leveraging the low latency of the events, we push the limits of quadruped agility and demonstrate high-speed ball catching for the first time. We show that our quadruped equipped with an event camera can catch objects with speeds up to 15 m/s from 4 meters, with a success rate of 83%. Using a VGA event camera, our method runs at 100 Hz on an NVIDIA Jetson Orin.
Teaching contact-rich tasks from visual demonstrations by constraint extraction
Authors: Christian Hegeler, Filippo Rozzi, Loris Roveda, Kevin Haninger
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2303.17481
Pdf link: https://arxiv.org/pdf/2303.17481
Abstract Contact-rich manipulation involves kinematic constraints on the task motion, typically with discrete transitions between these constraints during the task. Allowing the robot to detect and reason about these contact constraints can support robust and dynamic manipulation, but how can these contact models be efficiently learned? Purely visual observations are an attractive data source, allowing passive task demonstrations with unmodified objects. Existing approaches for vision-only learning from demonstration are effective in pick-and-place applications and planar tasks. Nevertheless, accuracy/occlusions and unobserved task dynamics can limit their robustness in contact-rich manipulation. To use visual demonstrations for contact-rich robotic tasks, we consider the demonstration of pose trajectories with transitions between holonomic kinematic constraints, first clustering the trajectories into discrete contact modes, then fitting kinematic constraints per each mode. The fit constraints are then used to (i) detect contact online with force/torque measurements and (ii) plan the robot policy with respect to the active constraint. We demonstrate the approach with real experiments, on cabling and rake tasks, showing the approach gives robust manipulation through contact transitions.
DDP: Diffusion Model for Dense Visual Prediction
Authors: Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17559
Pdf link: https://arxiv.org/pdf/2303.17559
Abstract We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research
TiDy-PSFs: Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions
Authors: Sachin Shah, Sakshum Kulshrestha, Christopher A. Metzler
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17583
Pdf link: https://arxiv.org/pdf/2303.17583
Abstract Point-spread-function (PSF) engineering is a powerful computational imaging techniques wherein a custom phase mask is integrated into an optical system to encode additional information into captured images. Used in combination with deep learning, such systems now offer state-of-the-art performance at monocular depth estimation, extended depth-of-field imaging, lensless imaging, and other tasks. Inspired by recent advances in spatial light modulator (SLM) technology, this paper answers a natural question: Can one encode additional information and achieve superior performance by changing a phase mask dynamically over time? We first prove that the set of PSFs described by static phase masks is non-convex and that, as a result, time-averaged PSFs generated by dynamic phase masks are fundamentally more expressive. We then demonstrate, in simulation, that time-averaged dynamic (TiDy) phase masks can offer substantially improved monocular depth estimation and extended depth-of-field imaging performance.
Polarity is all you need to learn and transfer faster
Authors: Qingyang Wang, Michael A.Powell, Ali Geisa, Eric Bridgeford, Joshua T. Vogelstein
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2303.17589
Pdf link: https://arxiv.org/pdf/2303.17589
Abstract Natural intelligences (NIs) thrive in a dynamic world - they learn quickly, sometimes with only a few samples. In contrast, Artificial intelligences (AIs) typically learn with prohibitive amount of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we propose an angle from weight polarity: development processes initialize NIs with advantageous polarity configurations; as NIs grow and learn, synapse magnitudes update yet polarities are largely kept unchanged. We demonstrate with simulation and image classification tasks that if weight polarities are adequately set $\textit{a priori}$, then networks learn with less time and data. We also explicitly illustrate situations in which $\textit{a priori}$ setting the weight polarities is disadvantageous for networks. Our work illustrates the value of weight polarities from the perspective of statistical and computational efficiency during learning.
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
Authors: Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2303.17599
Pdf link: https://arxiv.org/pdf/2303.17599
Abstract Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code will be made available at \url{https://github.com/baaivision/vid2vid-zero}.

A-suozhang / GetArxivDaily

New submissions for Fri, 31 Mar 23 #22

Keyword: efficient

Machine learning-based spin structure detection

Optimizing Reconfigurable Intelligent Surfaces for Short Transmissions: How Detailed Configurations can be Afforded?

T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals

Concise QBF Encodings for Games on a Grid (extended version)

Fairness-Aware Data Valuation for Supervised Learning

Computationally efficient sampling methods for sparsity promoting hierarchical Bayesian models

The G-invariant graph Laplacian

The secret of immersion: actor driven camera movement generation for auto-cinematography

Material-agnostic Shaping of Granular Materials with Optimal Transport

Transductive few-shot adapters for medical image segmentation

A Tensor-based Convolutional Neural Network for Small Dataset Classification

Reading Strategies for Graph Visualizations that Wrap Around in Torus Topology

Dependent Task Offloading in Edge Computing Using GNN and Deep Reinforcement Learning

Deep Generative Model and Its Applications in Efficient Wireless Network Management: A Tutorial and Case Study

Conservation and stability in a discontinuous Galerkin method for the vector invariant spherical shallow water equations

C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation

DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving

Convergence of the CEM-GMsFEM for compressible flow in highly heterogeneous media

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

High-Performance Low-Complexity Hierarchical Frequency Synchronization for Distributed Massive MIMO-OFDMA Systems

Practical self-supervised continual learning with continual fine-tuning

Simultaneous reconstruction of sound speed and nonlinearity parameter in a paraxial model of vibro-acoustography in frequency domain

Computationally efficient predictive control based on ANN state-space model

Masked Autoencoders as Image Processors

Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence

Linear Insertion Deletion Codes in the High-Noise and High-Rate Regimes

Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions

An Efficient Mobile Gateway Selection and Discovery Based-Routing Protocol in Heterogeneous LTE-VANET Networks

NN-Copula-CD: A Copula-Guided Interpretable Neural Network for Change Detection in Heterogeneous Remote Sensing Images

HMES: A Scalable Human Mobility and Epidemic Simulation System with Fast Intervention Modeling

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Efficient distributed representations beyond negative sampling

Teaching contact-rich tasks from visual demonstrations by constraint extraction

Edge Ranking of Graphs in Transportation Networks using a Graph Neural Network (GNN)

3D Line Mapping Revisited

Sum-of-Squares Lower Bounds for Densest $k$-Subgraph

Learning in Factored Domains with Information-Constrained Visual Representations

Hybrid Dealiasing of Complex Convolutions

Power-Optimal HARQ Protocol for Reliable Free Space Optical Communication

Nonlinear Approximation with Subsampled Rank-1 Lattices

Active User Identification in Fast Fading Massive Random Access Channels

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

DDP: Diffusion Model for Dense Visual Prediction

Using AI to Measure Parkinson's Disease Severity at Home

Human-Robot Interaction using VAHR: Virtual Assistant, Human, and Robots in the Loop

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

MobileInst: Video Instance Segmentation on the Mobile

Token Merging for Fast Stable Diffusion

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Keyword: faster

Urgency-aware Routing in Single Origin-destination Itineraries through Artificial Currencies

PopSparse: Accelerated block sparse matrix multiplication on IPU

Overcoming Challenges to Continuous Integration in HPC

ACM with Overlapping Partitions: Implementation and Periodicity Analysis

TreePiece: Faster Semantic Parsing via Tree Tokenization

DPP-based Client Selection for Federated Learning with Non-IID Data

Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs and Practical Solutions

Edge Ranking of Graphs in Transportation Networks using a Graph Neural Network (GNN)

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

Token Merging for Fast Stable Diffusion

Keyword: mobile

A Tensor-based Convolutional Neural Network for Small Dataset Classification

Dependent Task Offloading in Edge Computing Using GNN and Deep Reinforcement Learning

Deep Generative Model and Its Applications in Efficient Wireless Network Management: A Tutorial and Case Study

GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection

An Efficient Mobile Gateway Selection and Discovery Based-Routing Protocol in Heterogeneous LTE-VANET Networks

Cost Sensitive GNN-based Imbalanced Learning for Mobile Social Network Fraud Detection

MobileInst: Video Instance Segmentation on the Mobile

Keyword: pruning

Explainable Intrusion Detection Systems Using Competitive Learning Techniques

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Keyword: voxel

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

Keyword: lidar

T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals

BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation

Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving