【CS-part2】New submissions for Fri, 5 Apr 24

Keyword: webgpu

There is no result

Keyword: webgl

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Authors: Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.03648
Pdf link: https://arxiv.org/pdf/2404.03648
Abstract Large language models (LLMs) have fueled many intelligent agent tasks, such as web navigation -- but most existing agents perform far from satisfying in real-world webpages due to three factors: (1) the versatility of actions on webpages, (2) HTML text exceeding model processing capacity, and (3) the complexity of decision-making due to the open-domain nature of web. In light of the challenge, we develop AutoWebGLM, a GPT-4-outperforming automated web navigation agent built upon ChatGLM3-6B. Inspired by human browsing patterns, we design an HTML simplification algorithm to represent webpages, preserving vital information succinctly. We employ a hybrid human-AI method to build web browsing data for curriculum training. Then, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For testing, we establish a bilingual benchmark -- AutoWebBench -- for real-world web browsing tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, revealing its improvements but also underlying challenges to tackle real environments. Related code, model, and data will be released at \url{https://github.com/THUDM/AutoWebGLM}.
Keyword: pre-rendering

There is no result

Keyword: prerendering

There is no result

Keyword: motion prediction

Towards more realistic human motion prediction with attention to motion coordination
Authors: Pengxiang Ding, Jianqin Yin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03584
Pdf link: https://arxiv.org/pdf/2404.03584
Abstract Joint relation modeling is a curial component in human motion prediction. Most existing methods rely on skeletal-based graphs to build the joint relations, where local interactive relations between joint pairs are well learned. However, the motion coordination, a global joint relation reflecting the simultaneous cooperation of all joints, is usually weakened because it is learned from part to whole progressively and asynchronously. Thus, the final predicted motions usually appear unrealistic. To tackle this issue, we learn a medium, called coordination attractor (CA), from the spatiotemporal features of motion to characterize the global motion features, which is subsequently used to build new relative joint relations. Through the CA, all joints are related simultaneously, and thus the motion coordination of all joints can be better learned. Based on this, we further propose a novel joint relation modeling module, Comprehensive Joint Relation Extractor (CJRE), to combine this motion coordination with the local interactions between joint pairs in a unified manner. Additionally, we also present a Multi-timescale Dynamics Extractor (MTDE) to extract enriched dynamics from the raw position information for effective prediction. Extensive experiments show that the proposed framework outperforms state-of-the-art methods in both short- and long-term predictions on H3.6M, CMU-Mocap, and 3DPW.
Keyword: incremental learning

Future-Proofing Class Incremental Learning
Authors: Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03200
Pdf link: https://arxiv.org/pdf/2404.03200
Abstract Exemplar-Free Class Incremental Learning is a highly challenging setting where replay memory is unavailable. Methods relying on frozen feature extractors have drawn attention recently in this setting due to their impressive performances and lower computational costs. However, those methods are highly dependent on the data used to train the feature extractor and may struggle when an insufficient amount of classes are available during the first incremental step. To overcome this limitation, we propose to use a pre-trained text-to-image diffusion model in order to generate synthetic images of future classes and use them to train the feature extractor. Experiments on the standard benchmarks CIFAR100 and ImageNet-Subset demonstrate that our proposed method can be used to improve state-of-the-art methods for exemplar-free class incremental learning, especially in the most difficult settings where the first incremental step only contains few classes. Moreover, we show that using synthetic samples of future classes achieves higher performance than using real data from different classes, paving the way for better and less costly pre-training methods for incremental learning.
Keyword: svm incremental

There is no result

Keyword: nerf

VF-NeRF: Viewshed Fields for Rigid NeRF Registration
Authors: Leo Segre, Shai Avidan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03349
Pdf link: https://arxiv.org/pdf/2404.03349
Abstract 3D scene registration is a fundamental problem in computer vision that seeks the best 6-DoF alignment between two scenes. This problem was extensively investigated in the case of point clouds and meshes, but there has been relatively limited work regarding Neural Radiance Fields (NeRF). In this paper, we consider the problem of rigid registration between two NeRFs when the position of the original cameras is not given. Our key novelty is the introduction of Viewshed Fields (VF), an implicit function that determines, for each 3D point, how likely it is to be viewed by the original cameras. We demonstrate how VF can help in the various stages of NeRF registration, with an extensive evaluation showing that VF-NeRF achieves SOTA results on various datasets with different capturing approaches such as LLFF and Objaverese.
OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views
Authors: Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03650
Pdf link: https://arxiv.org/pdf/2404.03650
Abstract Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
RaFE: Generative Radiance Fields Restoration
Authors: Zhongkai Wu, Ziyu Wan, Jing Zhang, Jing Liao, Dong Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03654
Pdf link: https://arxiv.org/pdf/2404.03654
Abstract NeRF (Neural Radiance Fields) has demonstrated tremendous potential in novel view synthesis and 3D reconstruction, but its performance is sensitive to input image quality, which struggles to achieve high-fidelity rendering when provided with low-quality sparse input viewpoints. Previous methods for NeRF restoration are tailored for specific degradation type, ignoring the generality of restoration. To overcome this limitation, we propose a generic radiance fields restoration pipeline, named RaFE, which applies to various types of degradations, such as low resolution, blurriness, noise, compression artifacts, or their combinations. Our approach leverages the success of off-the-shelf 2D restoration methods to recover the multi-view images individually. Instead of reconstructing a blurred NeRF by averaging inconsistencies, we introduce a novel approach using Generative Adversarial Networks (GANs) for NeRF generation to better accommodate the geometric and appearance inconsistencies present in the multi-view images. Specifically, we adopt a two-level tri-plane architecture, where the coarse level remains fixed to represent the low-quality NeRF, and a fine-level residual tri-plane to be added to the coarse level is modeled as a distribution with GAN to capture potential variations in restoration. We validate RaFE on both synthetic and real cases for various restoration tasks, demonstrating superior performance in both quantitative and qualitative evaluations, surpassing other 3D restoration methods specific to single task. Please see our project website https://zkaiwu.github.io/RaFE-Project/.
Keyword: multiorgan

There is no result

Keyword: multi-organ

There is no result

Keyword: multi organ

There is no result

Keyword: SAM

READ: Improving Relation Extraction from an ADversarial Perspective
Authors: Dawei Li, William Hogan, Jingbo Shang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.02931
Pdf link: https://arxiv.org/pdf/2404.02931
Abstract Recent works in relation extraction (RE) have achieved promising benchmark accuracy; however, our adversarial attack experiments show that these works excessively rely on entities, making their generalization capability questionable. To address this issue, we propose an adversarial training method specifically designed for RE. Our approach introduces both sequence- and token-level perturbations to the sample and uses a separate perturbation vocabulary to improve the search for entity and context perturbations. Furthermore, we introduce a probabilistic strategy for leaving clean tokens in the context during adversarial training. This strategy enables a larger attack budget for entities and coaxes the model to leverage relational patterns embedded in the context. Extensive experiments show that compared to various adversarial training methods, our method significantly improves both the accuracy and robustness of the model. Additionally, experiments on different data availability settings highlight the effectiveness of our method in low-resource scenarios. We also perform in-depth analyses of our proposed method and provide further hints. We will release our code at https://github.com/David-Li0406/READ.
Learning in Convolutional Neural Networks Accelerated by Transfer Entropy
Authors: Adrian Moldovan, Angel Caţaron, Răzvan Andonie
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.02943
Pdf link: https://arxiv.org/pdf/2404.02943
Abstract Recently, there is a growing interest in applying Transfer Entropy (TE) in quantifying the effective connectivity between artificial neurons. In a feedforward network, the TE can be used to quantify the relationships between neuron output pairs located in different layers. Our focus is on how to include the TE in the learning mechanisms of a Convolutional Neural Network (CNN) architecture. We introduce a novel training mechanism for CNN architectures which integrates the TE feedback connections. Adding the TE feedback parameter accelerates the training process, as fewer epochs are needed. On the flip side, it adds computational overhead to each epoch. According to our experiments on CNN classifiers, to achieve a reasonable computational overhead--accuracy trade-off, it is efficient to consider only the inter-neural information transfer of a random subset of the neuron pairs from the last two fully connected layers. The TE acts as a smoothing factor, generating stability and becoming active only periodically, not after processing each input sample. Therefore, we can consider the TE is in our model a slowly changing meta-parameter.
Optimizing the Deployment of Tiny Transformers on Low-Power MCUs
Authors: Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2404.02945
Pdf link: https://arxiv.org/pdf/2404.02945
Abstract Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
Authors: Fanxu Meng, Zhaohui Wang, Muhan Zhang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.02948
Pdf link: https://arxiv.org/pdf/2404.02948
Abstract As the parameters of LLMs expand, the computational cost of fine-tuning the entire model becomes prohibitive. To address this challenge, we introduce a PEFT method, Principal Singular values and Singular vectors Adaptation (PiSSA), which optimizes a significantly reduced parameter space while achieving or surpassing the performance of full-parameter fine-tuning. PiSSA is inspired by Intrinsic SAID, which suggests that pre-trained, over-parametrized models inhabit a space of low intrinsic dimension. Consequently, PiSSA represents a matrix W within the model by the product of two trainable matrices A and B, plus a residual matrix $W^{res}$ for error correction. SVD is employed to factorize W, and the principal singular values and vectors of W are utilized to initialize A and B. The residual singular values and vectors initialize the residual matrix $W^{res}$, which keeps frozen during fine-tuning. Notably, PiSSA shares the same architecture with LoRA. However, LoRA approximates Delta W through the product of two matrices, A, initialized with Gaussian noise, and B, initialized with zeros, while PiSSA initializes A and B with principal singular values and vectors of the original matrix W. PiSSA can better approximate the outcomes of full-parameter fine-tuning at the beginning by changing the essential parts while freezing the "noisy" parts. In comparison, LoRA freezes the original matrix and updates the "noise". This distinction enables PiSSA to convergence much faster than LoRA and also achieve better performance in the end. Due to the same architecture, PiSSA inherits many of LoRA's advantages, such as parameter efficiency and compatibility with quantization. Leveraging a fast SVD method, the initialization of PiSSA takes only a few seconds, inducing negligible cost of switching LoRA to PiSSA.
Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections
Authors: Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L. Caterini, Jesse C. Cresswell
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.02954
Pdf link: https://arxiv.org/pdf/2404.02954
Abstract In recent years there has been increased interest in understanding the interplay between deep generative models (DGMs) and the manifold hypothesis. Research in this area focuses on understanding the reasons why commonly-used DGMs succeed or fail at learning distributions supported on unknown low-dimensional manifolds, as well as developing new models explicitly designed to account for manifold-supported data. This manifold lens provides both clarity as to why some DGMs (e.g. diffusion models and some generative adversarial networks) empirically surpass others (e.g. likelihood-based models such as variational autoencoders, normalizing flows, or energy-based models) at sample generation, and guidance for devising more performant DGMs. We carry out the first survey of DGMs viewed through this lens, making two novel contributions along the way. First, we formally establish that numerical instability of high-dimensional likelihoods is unavoidable when modelling low-dimensional data. We then show that DGMs on learned representations of autoencoders can be interpreted as approximately minimizing Wasserstein distance: this result, which applies to latent diffusion models, helps justify their outstanding empirical results. The manifold lens provides a rich perspective from which to understand DGMs, which we aim to make more accessible and widespread.
Universal Functional Regression with Neural Operator Flows
Authors: Yaozhong Shi, Angela F. Gao, Zachary E. Ross, Kamyar Azizzadenesheli
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.02986
Pdf link: https://arxiv.org/pdf/2404.02986
Abstract Regression on function spaces is typically limited to models with Gaussian process priors. We introduce the notion of universal functional regression, in which we aim to learn a prior distribution over non-Gaussian function spaces that remains mathematically tractable for functional regression. To do this, we develop Neural Operator Flows (OpFlow), an infinite-dimensional extension of normalizing flows. OpFlow is an invertible operator that maps the (potentially unknown) data function space into a Gaussian process, allowing for exact likelihood estimation of functional point evaluations. OpFlow enables robust and accurate uncertainty quantification via drawing posterior samples of the Gaussian process and subsequently mapping them into the data function space. We empirically study the performance of OpFlow on regression and generation tasks with data generated from Gaussian processes with known posterior forms and non-Gaussian processes, as well as real-world earthquake seismograms with an unknown closed-form distribution.
Risk-averse Learning with Non-Stationary Distributions
Authors: Siyi Wang, Zifan Wang, Xinlei Yi, Michael M. Zavlanos, Karl H. Johansson, Sandra Hirche
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.02988
Pdf link: https://arxiv.org/pdf/2404.02988
Abstract Considering non-stationary environments in online optimization enables decision-maker to effectively adapt to changes and improve its performance over time. In such cases, it is favorable to adopt a strategy that minimizes the negative impact of change to avoid potentially risky situations. In this paper, we investigate risk-averse online optimization where the distribution of the random cost changes over time. We minimize risk-averse objective function using the Conditional Value at Risk (CVaR) as risk measure. Due to the difficulty in obtaining the exact CVaR gradient, we employ a zeroth-order optimization approach that queries the cost function values multiple times at each iteration and estimates the CVaR gradient using the sampled values. To facilitate the regret analysis, we use a variation metric based on Wasserstein distance to capture time-varying distributions. Given that the distribution variation is sub-linear in the total number of episodes, we show that our designed learning algorithm achieves sub-linear dynamic regret with high probability for both convex and strongly convex functions. Moreover, theoretical results suggest that increasing the number of samples leads to a reduction in the dynamic regret bounds until the sampling number reaches a specific limit. Finally, we provide numerical experiments of dynamic pricing in a parking lot to illustrate the efficacy of the designed algorithm.
An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models
Authors: Emmy Liu, Graham Neubig, Jacob Andreas
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.03028
Pdf link: https://arxiv.org/pdf/2404.03028
Abstract Modern language models (LMs) can learn to perform new tasks in different ways: in instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly with a small number of examples; in instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description before making predictions. Each of these procedures may be thought of as invoking a different form of reasoning: instruction following involves deductive reasoning, few-shot prompting involves inductive reasoning, and instruction inference involves abductive reasoning. How do these different capabilities relate? Across four LMs (from the gpt and llama families) and two learning problems (involving arithmetic functions and machine translation) we find a strong dissociation between the different types of reasoning: LMs can sometimes learn effectively from few-shot prompts even when they are unable to explain their own prediction rules; conversely, they sometimes infer useful task descriptions while completely failing to learn from human-generated descriptions of the same task. Our results highlight the non-systematic nature of reasoning even in some of today's largest LMs, and underscore the fact that very different learning mechanisms may be invoked by seemingly similar prompting procedures.
Model-based Reinforcement Learning for Parameterized Action Spaces
Authors: Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.03037
Pdf link: https://arxiv.org/pdf/2404.03037
Abstract We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.
vPALs: Towards Verified Performance-aware Learning System For Resource Management
Authors: Guoliang He, Gingfung Yeung, Sheriffo Ceesay, Adam Barker
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2404.03079
Pdf link: https://arxiv.org/pdf/2404.03079
Abstract Accurately predicting task performance at runtime in a cluster is advantageous for a resource management system to determine whether a task should be migrated due to performance degradation caused by interference. This is beneficial for both cluster operators and service owners. However, deploying performance prediction systems with learning methods requires sophisticated safeguard mechanisms due to the inherent stochastic and black-box natures of these models, such as Deep Neural Networks (DNNs). Vanilla Neural Networks (NNs) can be vulnerable to out-of-distribution data samples that can lead to sub-optimal decisions. To take a step towards a safe learning system in performance prediction, We propose vPALs that leverage well-correlated system metrics, and verification to produce safe performance prediction at runtime, providing an extra layer of safety to integrate learning techniques to cluster resource management systems. Our experiments show that vPALs can outperform vanilla NNs across our benchmark workload.
Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot
Authors: Catherine Henry, Casey Kennington
Subjects: Computation and Language (cs.CL); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.03092
Pdf link: https://arxiv.org/pdf/2404.03092
Abstract Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world. That is, rather than starting with a top-down symbol (e.g., a word referring to an object) and providing meaning through the application of predetermined samples, the robot autonomously and gradually breaks up its exploration space into a series of increasingly specific unlabeled categories at which point an external expert may optionally provide a symbol association. We extend prior work by using a robot that can observe the visual world, introducing a higher dimensional sensory space, and using a more generalizable method of category building. Our experiments show that the robot learns categories based on actions and what it visually observes, and that those categories can be symbolically grounded into.https://info.arxiv.org/help/prep#comments
Low Frequency Sampling in Model Predictive Path Integral Control
Authors: Bogdan Vlahov, Jason Gibson, David D. Fan, Patrick Spieler, Ali-akbar Agha-mohammadi, Evangelos A. Theodorou
Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2404.03094
Pdf link: https://arxiv.org/pdf/2404.03094
Abstract Sampling-based model-predictive controllers have become a powerful optimization tool for planning and control problems in various challenging environments. In this paper, we show how the default choice of uncorrelated Gaussian distributions can be improved upon with the use of a colored noise distribution. Our choice of distribution allows for the emphasis on low frequency control signals, which can result in smoother and more exploratory samples. We use this frequency-based sampling distribution with Model Predictive Path Integral (MPPI) in both hardware and simulation experiments to show better or equal performance on systems with various speeds of input response.
MARL-LNS: Cooperative Multi-agent Reinforcement Learning via Large Neighborhoods Search
Authors: Weizhe Chen, Sven Koenig, Bistra Dilkina
Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03101
Pdf link: https://arxiv.org/pdf/2404.03101
Abstract Cooperative multi-agent reinforcement learning (MARL) has been an increasingly important research topic in the last half-decade because of its great potential for real-world applications. Because of the curse of dimensionality, the popular "centralized training decentralized execution" framework requires a long time in training, yet still cannot converge efficiently. In this paper, we propose a general training framework, MARL-LNS, to algorithmically address these issues by training on alternating subsets of agents using existing deep MARL algorithms as low-level trainers, while not involving any additional parameters to be trained. Based on this framework, we provide three algorithm variants based on the framework: random large neighborhood search (RLNS), batch large neighborhood search (BLNS), and adaptive large neighborhood search (ALNS), which alternate the subsets of agents differently. We test our algorithms on both the StarCraft Multi-Agent Challenge and Google Research Football, showing that our algorithms can automatically reduce at least 10% of training time while reaching the same final skill level as the original algorithm.
Many-to-many Image Generation with Auto-regressive Diffusion Models
Authors: Ying Shen, Yizhe Zhang, Shuangfei Zhai, Lifu Huang, Joshua M. Susskind, Jiatao Gu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03109
Pdf link: https://arxiv.org/pdf/2404.03109
Abstract Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and generating an arbitrary number of interrelated images within a broad context. This limitation becomes increasingly critical as the demand for multi-image scenarios, such as multi-view images and visual narratives, grows with the expansion of multimedia platforms. This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios. To facilitate this, we present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. Utilizing Stable Diffusion with varied latent noises, our method produces a set of interconnected images from a single caption. Leveraging MIS, we learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework. Throughout training on the synthetic MIS, the model excels in capturing style and content from preceding images - synthetic or real - and generates novel images following the captured patterns. Furthermore, through task-specific fine-tuning, our model demonstrates its adaptability to various multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation.
Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking
Authors: Navid Mahdian, Mohammad Jani, Amir M. Soufi Enayati, Homayoun Najjaran
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03110
Pdf link: https://arxiv.org/pdf/2404.03110
Abstract Multi-object tracking (MOT) is a prominent task in computer vision with application in autonomous driving, responsible for the simultaneous tracking of multiple object trajectories. Detection-based multi-object tracking (DBT) algorithms detect objects using an independent object detector and predict the imminent location of each target. Conventional prediction methods in DBT utilize Kalman Filter(KF) to extrapolate the target location in the upcoming frames by supposing a constant velocity motion model. These methods are especially hindered in autonomous driving applications due to dramatic camera motion or unavailable detections. Such limitations lead to tracking failures manifested by numerous identity switches and disrupted trajectories. In this paper, we introduce a novel KF-based prediction module called the Ego-motion Aware Target Prediction (EMAP) module by focusing on the integration of camera motion and depth information with object motion models. Our proposed method decouples the impact of camera rotational and translational velocity from the object trajectories by reformulating the Kalman Filter. This reformulation enables us to reject the disturbances caused by camera motion and maximizes the reliability of the object motion model. We integrate our module with four state-of-the-art base MOT algorithms, namely OC-SORT, Deep OC-SORT, ByteTrack, and BoT-SORT. In particular, our evaluation on the KITTI MOT dataset demonstrates that EMAP remarkably drops the number of identity switches (IDSW) of OC-SORT and Deep OC-SORT by 73% and 21%, respectively. At the same time, it elevates other performance metrics such as HOTA by more than 5%. Our source code is available at https://github.com/noyzzz/EMAP.
Biodegradable Interactive Materials
Authors: Zhihan Zhang, Mallory Parker, Kuotian Liao, Jerry Cao, Anandghan Waghmare, Joseph Breda, Chris Matsumura, Serena Eley, Eleftheria Roumeli, Shwetak Patel, Vikram Iyer
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.03130
Pdf link: https://arxiv.org/pdf/2404.03130
Abstract The sense of touch is fundamental to how we interact with the physical and digital world. Conventional interactive surfaces and tactile interfaces use electronic sensors embedded into objects, however this approach poses serious challenges both for environmental sustainability and a future of truly ubiquitous interaction systems where information is encoded into everyday objects. In this work, we present Biodegradable Interactive Materials: backyard-compostable interactive interfaces that leverage information encoded in material properties. Inspired by natural systems, we propose an architecture that programmatically encodes multidimensional information into materials themselves and combines them with wearable devices that extend human senses to perceive the embedded data. We combine unrefined biological matter from plants and algae like chlorella with natural minerals like graphite and magnetite to produce materials with varying electrical, magnetic, and surface properties. We perform in-depth analysis using physics models, computational simulations, and real-world experiments to characterize their information density and develop decoding methods. Our passive, chip-less materials can robustly encode 12 bits of information, equivalent to 4096 unique classes. We further develop wearable device prototypes that can decode this information during touch interactions using off-the-shelf sensors. We demonstrate sample applications such as customized buttons, tactile maps, and interactive surfaces. We further demonstrate the natural degradation of these interactive materials in degrade outdoors within 21 days and perform a comparative environmental analysis of the benefits of this approach.
A Framework for Guided Motion Planning
Authors: Amnon Attali, Stav Ashur, Isaac Burton Love, Courtney McBeth, James Motes, Marco Morales, Nancy M. Amato
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.03133
Pdf link: https://arxiv.org/pdf/2404.03133
Abstract Randomized sampling based algorithms are widely used in robot motion planning due to the problem's intractability, and are experimentally effective on a wide range of problem instances. Most variants bias their sampling using various heuristics related to the known underlying structure of the search space. In this work, we formalize the intuitive notion of guided search by defining the concept of a guiding space. This new language encapsulates many seemingly distinct prior methods under the same framework, and allows us to reason about guidance, a previously obscured core contribution of different algorithms. We suggest an information theoretic method to evaluate guidance, which experimentally matches intuition when tested on known algorithms in a variety of environments. The language and evaluation of guidance suggests improvements to existing methods, and allows for simple hybrid algorithms that combine guidance from multiple sources.
Multi-modal Learning for WebAssembly Reverse Engineering
Authors: Hanxian Huang, Jishen Zhao
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2404.03171
Pdf link: https://arxiv.org/pdf/2404.03171
Abstract The increasing adoption of WebAssembly (Wasm) for performance-critical and security-sensitive tasks drives the demand for WebAssembly program comprehension and reverse engineering. Recent studies have introduced machine learning (ML)-based WebAssembly reverse engineering tools. Yet, the generalization of task-specific ML solutions remains challenging, because their effectiveness hinges on the availability of an ample supply of high-quality task-specific labeled data. Moreover, previous works overlook the high-level semantics present in source code and its documentation. Acknowledging the abundance of available source code with documentation, which can be compiled into WebAssembly, we propose to learn representations of them concurrently and harness their mutual relationships for effective WebAssembly reverse engineering. In this paper, we present WasmRev, the first multi-modal pre-trained language model for WebAssembly reverse engineering. WasmRev is pre-trained using self-supervised learning on a large-scale multi-modal corpus encompassing source code, code documentation and the compiled WebAssembly, without requiring labeled data. WasmRev incorporates three tailored multi-modal pre-training tasks to capture various characteristics of WebAssembly and cross-modal relationships. WasmRev is only trained once to produce general-purpose representations that can broadly support WebAssembly reverse engineering tasks through few-shot fine-tuning with much less labeled data, improving data efficiency. We fine-tune WasmRev onto three important reverse engineering tasks: type recovery, function purpose identification and WebAssembly summarization. Our results show that WasmRev pre-trained on the corpus of multi-modal samples establishes a robust foundation for these tasks, achieving high task accuracy and outperforming the state-of-the-art ML methods for WebAssembly reverse engineering.
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Authors: Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2404.03179
Pdf link: https://arxiv.org/pdf/2404.03179
Abstract Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.
MonoCD: Monocular 3D Object Detection with Complementary Depths
Authors: Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, Yihua Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03181
Pdf link: https://arxiv.org/pdf/2404.03181
Abstract Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.
Reservoir Sampling over Joins
Authors: Binyang Dai, Xiao Hu, Ke Yi
Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2404.03194
Pdf link: https://arxiv.org/pdf/2404.03194
Abstract Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this paper, we study the problem of how to maintain a random sample over joins while the tuples are streaming in. Without the join, this problem can be solved by some simple and classical reservoir sampling algorithms. However, the join operator makes the problem significantly harder, as the join size can be polynomially larger than the input. We present a new algorithm for this problem that achieves a near-linear complexity. The key technical components are a generalized reservoir sampling algorithm that supports a predicate, and a dynamic index for sampling over joins. We also conduct extensive experiments on both graph and relational data over various join queries, and the experimental results demonstrate significant performance improvement over the state of the art.
Okay, Let's Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation
Authors: Abhijnan Nath, Shadi Manafi, Avyakta Chelle, Nikhil Krishnaswamy
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.03196
Pdf link: https://arxiv.org/pdf/2404.03196
Abstract In NLP, Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event, usually via neural systems. In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs as distant supervision of smaller student models for cross-document coreference (CDCR) of events. We implement novel rationale-oriented event clustering and knowledge distillation methods for event coreference scoring that leverage enriched information from the FTRs for improved CDCR without additional annotation or expensive document clustering. Our model using coreference specific knowledge distillation achieves SOTA B3 F1 on the ECB+ and GVC corpora and we establish a new baseline on the AIDA Phase 1 corpus. Our code can be found at https://github.com/csu-signal/llama_cdcr
Future-Proofing Class Incremental Learning
Authors: Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03200
Pdf link: https://arxiv.org/pdf/2404.03200
Abstract Exemplar-Free Class Incremental Learning is a highly challenging setting where replay memory is unavailable. Methods relying on frozen feature extractors have drawn attention recently in this setting due to their impressive performances and lower computational costs. However, those methods are highly dependent on the data used to train the feature extractor and may struggle when an insufficient amount of classes are available during the first incremental step. To overcome this limitation, we propose to use a pre-trained text-to-image diffusion model in order to generate synthetic images of future classes and use them to train the feature extractor. Experiments on the standard benchmarks CIFAR100 and ImageNet-Subset demonstrate that our proposed method can be used to improve state-of-the-art methods for exemplar-free class incremental learning, especially in the most difficult settings where the first incremental step only contains few classes. Moreover, we show that using synthetic samples of future classes achieves higher performance than using real data from different classes, paving the way for better and less costly pre-training methods for incremental learning.
Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption
Authors: Jianming Tong, Jingtian Dang, Anupam Golder, Callie Hao, Arijit Raychowdhury, Tushar Krishna
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.03216
Pdf link: https://arxiv.org/pdf/2404.03216
Abstract As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU and MaxPooling) with high-degree Polynomial Approximated Function (PAF). We propose SmartPAF, a framework to replace non-polynomial operators with low-degree PAF and then recover the accuracy of PAF-approximated model through four techniques: (1) Coefficient Tuning (CT) -- adjust PAF coefficients based on the input distributions before training, (2) Progressive Approximation (PA) -- progressively replace one non-polynomial operator at a time followed by a fine-tuning, (3) Alternate Training (AT) -- alternate the training between PAFs and other linear operators in the decoupled manner, and (4) Dynamic Scale (DS) / Static Scale (SS) -- dynamically scale PAF input value within (-1, 1) in training, and fix the scale as the running max value in FHE deployment. The synergistic effect of CT, PA, AT, and DS/SS enables SmartPAF to enhance the accuracy of the various models approximated by PAFs with various low degrees under multiple datasets. For ResNet-18 under ImageNet-1k, the Pareto-frontier spotted by SmartPAF in latency-accuracy tradeoff space achieves 1.42x ~ 13.64x accuracy improvement and 6.79x ~ 14.9x speedup than prior works. Further, SmartPAF enables a 14-degree PAF (f1^2 g_1^2) to achieve 7.81x speedup compared to the 27-degree PAF obtained by minimax approximation with the same 69.4% post-replacement accuracy. Our code is available at https://github.com/TorchFHE/SmartPAF.
iSeg: Interactive 3D Segmentation via Interactive Attention
Authors: Itai Lang, Fei Xu, Dale Decatur, Sudarshan Babu, Rana Hanocka
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2404.03219
Pdf link: https://arxiv.org/pdf/2404.03219
Abstract We present iSeg, a new interactive technique for segmenting 3D shapes. Previous works have focused mainly on leveraging pre-trained 2D foundation models for 3D segmentation based on text. However, text may be insufficient for accurately describing fine-grained spatial segmentations. Moreover, achieving a consistent 3D segmentation using a 2D model is challenging since occluded areas of the same semantic region may not be visible together from any 2D view. Thus, we design a segmentation method conditioned on fine user clicks, which operates entirely in 3D. Our system accepts user clicks directly on the shape's surface, indicating the inclusion or exclusion of regions from the desired shape partition. To accommodate various click settings, we propose a novel interactive attention module capable of processing different numbers and types of clicks, enabling the training of a single unified interactive segmentation model. We apply iSeg to a myriad of shapes from different domains, demonstrating its versatility and faithfulness to the user's specifications. Our project page is at https://threedle.github.io/iSeg/.
FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification
Authors: Xu Wang, Tian Ye, Rajgopal Kannan, Viktor Prasanna
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03225
Pdf link: https://arxiv.org/pdf/2404.03225
Abstract Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR), while delivering improved performance, have been shown to be quite vulnerable to adversarial attacks. Existing works improve robustness by training models on adversarial samples. However, by focusing mostly on attacks that manipulate images randomly, they neglect the real-world feasibility of such attacks. In this paper, we propose FACTUAL, a novel Contrastive Learning framework for Adversarial Training and robust SAR classification. FACTUAL consists of two components: (1) Differing from existing works, a novel perturbation scheme that incorporates realistic physical adversarial attacks (such as OTSA) to build a supervised adversarial pre-training network. This network utilizes class labels for clustering clean and perturbed images together into a more informative feature space. (2) A linear classifier cascaded after the encoder to use the computed representations to predict the target labels. By pre-training and fine-tuning our model on both clean and adversarial samples, we show that our model achieves high prediction accuracy on both cases. Our model achieves 99.7% accuracy on clean samples, and 89.6% on perturbed samples, both outperforming previous state-of-the-art methods.
Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning
Authors: Hongsheng Hu, Shuo Wang, Tian Dong, Minhui Xue
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.03233
Pdf link: https://arxiv.org/pdf/2404.03233
Abstract Machine unlearning has become a promising solution for fulfilling the "right to be forgotten", under which individuals can request the deletion of their data from machine learning models. However, existing studies of machine unlearning mainly focus on the efficacy and efficiency of unlearning methods, while neglecting the investigation of the privacy vulnerability during the unlearning process. With two versions of a model available to an adversary, that is, the original model and the unlearned model, machine unlearning opens up a new attack surface. In this paper, we conduct the first investigation to understand the extent to which machine unlearning can leak the confidential content of the unlearned data. Specifically, under the Machine Learning as a Service setting, we propose unlearning inversion attacks that can reveal the feature and label information of an unlearned sample by only accessing the original and unlearned model. The effectiveness of the proposed unlearning inversion attacks is evaluated through extensive experiments on benchmark datasets across various model architectures and on both exact and approximate representative unlearning approaches. The experimental results indicate that the proposed attack can reveal the sensitive information of the unlearned data. As such, we identify three possible defenses that help to mitigate the proposed attacks, while at the cost of reducing the utility of the unlearned model. The study in this paper uncovers an underexplored gap between machine unlearning and the privacy of unlearned data, highlighting the need for the careful design of mechanisms for implementing unlearning without leaking the information of the unlearned data.
Learning Transferable Negative Prompts for Out-of-Distribution Detection
Authors: Tianqi Li, Guansong Pang, Xiao Bai, Wenjun Miao, Jin Zheng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03248
Pdf link: https://arxiv.org/pdf/2404.03248
Abstract Existing prompt learning methods have shown certain capabilities in Out-of-Distribution (OOD) detection, but the lack of OOD images in the target dataset in their training can lead to mismatches between OOD images and In-Distribution (ID) categories, resulting in a high false positive rate. To address this issue, we introduce a novel OOD detection method, named 'NegPrompt', to learn a set of negative prompts, each representing a negative connotation of a given class label, for delineating the boundaries between ID and OOD images. It learns such negative prompts with ID data only, without any reliance on external outlier data. Further, current methods assume the availability of samples of all ID classes, rendering them ineffective in open-vocabulary learning scenarios where the inference stage can contain novel ID classes not present during training. In contrast, our learned negative prompts are transferable to novel class labels. Experiments on various ImageNet benchmarks show that NegPrompt surpasses state-of-the-art prompt-learning-based OOD detection methods and maintains a consistent lead in hard OOD detection in closed- and open-vocabulary classification scenarios. Code is available at https://github.com/mala-lab/negprompt.
Cryptographic Hardness of Score Estimation
Authors: Min Jae Song
Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.03272
Pdf link: https://arxiv.org/pdf/2404.03272
Abstract We show that $L^2$-accurate score estimation, in the absence of strong assumptions on the data distribution, is computationally hard even when sample complexity is polynomial in the relevant problem parameters. Our reduction builds on the result of Chen et al. (ICLR 2023), who showed that the problem of generating samples from an unknown data distribution reduces to $L^2$-accurate score estimation. Our hard-to-estimate distributions are the "Gaussian pancakes" distributions, originally due to Diakonikolas et al. (FOCS 2017), which have been shown to be computationally indistinguishable from the standard Gaussian under widely believed hardness assumptions from lattice-based cryptography (Bruna et al., STOC 2021; Gupte et al., FOCS 2022).
Gaussian-Smoothed Sliced Probability Divergences
Authors: Mokhtar Z. Alaya (LMAC), Alain Rakotomamonjy (LITIS), Maxime Berar (LITIS), Gilles Gasso (LITIS)
Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.03273
Pdf link: https://arxiv.org/pdf/2404.03273
Abstract Gaussian smoothed sliced Wasserstein distance has been recently introduced for comparing probability distributions, while preserving privacy on the data. It has been shown that it provides performances similar to its non-smoothed (non-private) counterpart. However, the computationaland statistical properties of such a metric have not yet been well-established. This work investigates the theoretical properties of this distance as well as those of generalized versions denoted as Gaussian-smoothed sliced divergences. We first show that smoothing and slicing preserve the metric property and the weak topology. To study the sample complexity of such divergences, we then introduce $\hat{\hat\mu}{n}$ the double empirical distribution for the smoothed-projected $\mu$. The distribution $\hat{\hat\mu}{n}$ is a result of a double sampling process: one from sampling according to the origin distribution $\mu$ and the second according to the convolution of the projection of $\mu$ on the unit sphere and the Gaussian smoothing. We particularly focus on the Gaussian smoothed sliced Wasserstein distance and prove that it converges with a rate $O(n^{-1/2})$. We also derive other properties, including continuity, of different divergences with respect to the smoothing parameter. We support our theoretical findings with empirical studies in the context of privacy-preserving domain adaptation.
Traversability-aware Adaptive Optimization for Path Planning and Control in Mountainous Terrain
Authors: Se-Wook Yoo, E In Son, Seung-Woo Seo
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.03274
Pdf link: https://arxiv.org/pdf/2404.03274
Abstract Autonomous navigation in extreme mountainous terrains poses challenges due to the presence of mobility-stressing elements and undulating surfaces, making it particularly difficult compared to conventional off-road driving scenarios. In such environments, estimating traversability solely based on exteroceptive sensors often leads to the inability to reach the goal due to a high prevalence of non-traversable areas. In this paper, we consider traversability as a relative value that integrates the robot's internal state, such as speed and torque to exhibit resilient behavior to reach its goal successfully. We separate traversability into apparent traversability and relative traversability, then incorporate these distinctions in the optimization process of sampling-based planning and motion predictive control. Our method enables the robots to execute the desired behaviors more accurately while avoiding hazardous regions and getting stuck. Experiments conducted on simulation with 27 diverse types of mountainous terrain and real-world demonstrate the robustness of the proposed framework, with increasingly better performance observed in more complex environments.
MPOFI: Multichannel Partially Observed Functional Modeling for Defect Classification with Imbalanced Dataset via Deep Metric Learning
Authors: Yukun Xie, Juan Du, Chen Zhang
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.03329
Pdf link: https://arxiv.org/pdf/2404.03329
Abstract In modern manufacturing, most of the product lines are conforming. Few products are nonconforming but with different defect types. The identification of defect types can help further root cause diagnosis of production lines. With the sensing development, continuous signals of process variables can be collected in high resolution, which can be regarded as multichannel functional data. They have abundant information to characterize the process and help identify the defect types. Motivated by a real example from the pipe tightening process, we target at detect classification when each sample is a multichannel functional data. However, the available samples for each defect type are limited and imbalanced. Moreover, the functions are partially observed since the pre-tightening process before the pipe tightening process is unobserved. To classify the defect samples based on imbalanced, multichannel, and partially observed functional data is very important but challenging. Thus, we propose an innovative framework known as "Multichannel Partially Observed Functional Modeling for Defect Classification with an Imbalanced Dataset" (MPOFI). The framework leverages the power of deep metric learning in conjunction with a neural network specially crafted for processing functional data. This paper introduces a neural network explicitly tailored for handling multichannel and partially observed functional data, complemented by developing a corresponding loss function for training on imbalanced datasets. The results from a real-world case study demonstrate the superior accuracy of our framework when compared to existing benchmarks.
Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks
Authors: Lei Zhang, Yuhang Zhou, Yi Yang, Xinbo Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03340
Pdf link: https://arxiv.org/pdf/2404.03340
Abstract Despite providing high-performance solutions for computer vision tasks, the deep neural network (DNN) model has been proved to be extremely vulnerable to adversarial attacks. Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. Besides, commonly used adaptive learning and fine-tuning technique is unsuitable for adversarial defense since it is essentially a zero-shot problem when deployed. Thus, to tackle this challenge, we propose an attack-agnostic defense method named Meta Invariance Defense (MID). Specifically, various combinations of adversarial attacks are randomly sampled from a manually constructed Attacker Pool to constitute different defense tasks against unknown attacks, in which a student encoder is supervised by multi-consistency distillation to learn the attack-invariant features via a meta principle. The proposed MID has two merits: 1) Full distillation from pixel-, feature- and prediction-level between benign and adversarial samples facilitates the discovery of attack-invariance. 2) The model simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration. Theoretical and empirical studies on numerous benchmarks such as ImageNet verify the generalizable robustness and superiority of MID under various attacks.
Graph Neural Networks for Electric and Hydraulic Data Fusion to Enhance Short-term Forecasting of Pumped-storage Hydroelectricity
Authors: Raffael Theiler, Olga Fink
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.03368
Pdf link: https://arxiv.org/pdf/2404.03368
Abstract Pumped-storage hydropower plants (PSH) actively participate in grid power-frequency control and therefore often operate under dynamic conditions, which results in rapidly varying system states. Predicting these dynamically changing states is essential for comprehending the underlying sensor and machine conditions. This understanding aids in detecting anomalies and faults, ensuring the reliable operation of the connected power grid, and in identifying faulty and miscalibrated sensors. PSH are complex, highly interconnected systems encompassing electrical and hydraulic subsystems, each characterized by their respective underlying networks that can individually be represented as graphs. To take advantage of this relational inductive bias, graph neural networks (GNNs) have been separately applied to state forecasting tasks in the individual subsystems, but without considering their interdependencies. In PSH, however, these subsystems depend on the same control input, making their operations highly interdependent and interconnected. Consequently, hydraulic and electrical sensor data should be fused across PSH subsystems to improve state forecasting accuracy. This approach has not been explored in GNN literature yet because many available PSH graphs are limited to their respective subsystem boundaries, which makes the method unsuitable to be applied directly. In this work, we introduce the application of spectral-temporal graph neural networks, which leverage self-attention mechanisms to concurrently capture and learn meaningful subsystem interdependencies and the dynamic patterns observed in electric and hydraulic sensors. Our method effectively fuses data from the PSH's subsystems by operating on a unified, system-wide graph, learned directly from the data, This approach leads to demonstrably improved state forecasting performance and enhanced generalizability.
A superconvergence result in the RBF-FD method
Authors: Andrej Kolar-Požun, Mitja Jančič, Gregor Kosec
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2404.03393
Pdf link: https://arxiv.org/pdf/2404.03393
Abstract Radial Basis Function-generated Finite Differences (RBF-FD) is a meshless method that can be used to numerically solve partial differential equations. The solution procedure consists of two steps. First, the differential operator is discretised on given scattered nodes and afterwards, a global sparse matrix is assembled and inverted to obtain an approximate solution. Focusing on Polyharmonic Splines as our Radial Basis Functions (RBFs) of choice, appropriately augmented with monomials, it is well known that the truncation error of the differential operator approximation is determined by the degree of monomial augmentation. Naively, one might think that the solution error will have the same order of convergence. We present a superconvergence result that shows otherwise - for some augmentation degrees, order of convergence is higher than expected.
Accurate estimation of feature importance faithfulness for tree models
Authors: Mateusz Gajewski, Adam Karczmarz, Mateusz Rapicki, Piotr Sankowski
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03426
Pdf link: https://arxiv.org/pdf/2404.03426
Abstract In this paper, we consider a perturbation-based metric of predictive faithfulness of feature rankings (or attributions) that we call PGI squared. When applied to decision tree-based regression models, the metric can be computed accurately and efficiently for arbitrary independent feature perturbation distributions. In particular, the computation does not involve Monte Carlo sampling that has been typically used for computing similar metrics and which is inherently prone to inaccuracies. Moreover, we propose a method of ranking features by their importance for the tree model's predictions based on PGI squared. Our experiments indicate that in some respects, the method may identify the globally important features better than the state-of-the-art SHAP explainer
Design and Optimization of Cooperative Sensing With Limited Backhaul Capacity
Authors: Wenrui Li, Min Li, An Liu, Tony Xiao Han
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.03440
Pdf link: https://arxiv.org/pdf/2404.03440
Abstract This paper introduces a cooperative sensing framework designed for integrated sensing and communication cellular networks. The framework comprises one base station (BS) functioning as the sensing transmitter, while several nearby BSs act as sensing receivers. The primary objective is to facilitate cooperative target localization by enabling each receiver to share specific information with a fusion center (FC) over a limited capacity backhaul link. To achieve this goal, we propose an advanced cooperative sensing design that enhances the communication process between the receivers and the FC. Each receiver independently estimates the time delay and the reflecting coefficient associated with the reflected path from the target. Subsequently, each receiver transmits the estimated values and the received signal samples centered around the estimated time delay to the FC. To efficiently quantize the signal samples, a Karhunen-Lo`eve Transform coding scheme is employed. Furthermore, an optimization problem is formulated to allocate backhaul resources for quantizing different samples, improving target localization. Numerical results validate the effectiveness of our proposed advanced design and demonstrate its superiority over a baseline design, where only the locally estimated values are transmitted from each receiver to the FC.
SP$^2$OT: Semantic-Regularized Progressive Partial Optimal Transport for Imbalanced Clustering
Authors: Chuyu Zhang, Hui Ren, Xuming He
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03446
Pdf link: https://arxiv.org/pdf/2404.03446
Abstract Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we propose a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To address this challenge, we introduce a novel optimal transport-based pseudo-label learning framework. Our framework formulates pseudo-label generation as a Semantic-regularized Progressive Partial Optimal Transport (SP$^2$OT) problem, which progressively transports each sample to imbalanced clusters under several prior distribution and semantic relation constraints, thus generating high-quality and imbalance-aware pseudo-labels. To solve SP$^2$OT, we develop a Majorization-Minimization-based optimization algorithm. To be more precise, we employ the strategy of majorization to reformulate the SP$^2$OT problem into a Progressive Partial Optimal Transport problem, which can be transformed into an unbalanced optimal transport problem with augmented constraints and can be solved efficiently by a fast matrix scaling algorithm. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method.
Computational analysis of a contraction rheometer for the grade-two fluid model
Authors: Sara Pollock, L. Ridgway Scott
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2404.03450
Pdf link: https://arxiv.org/pdf/2404.03450
Abstract We explore the possibility of simulating the grade-two fluid model in a geometry related to a contraction rheometer, and we provide details on several key aspects of the computation. We show how the results can be used to determine the viscosity $\nu$ from experimental data. We also explore the identifiability of the grade-two parameters $\alpha_1$ and $\alpha_2$ from experimental data. In particular, as the flow rate varies, force data appears to be nearly the same for certain distinct pairs of values $\alpha_1$ and $\alpha_2$; however we determine a regime for $\alpha_1$ and $\alpha_2$ for which the parameters may be identifiable with a contraction rheometer.
Generalization Bounds for Message Passing Networks on Mixture of Graphons
Authors: Sohir Maskey, Gitta Kutyniok, Ron Levie
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03473
Pdf link: https://arxiv.org/pdf/2404.03473
Abstract We study the generalization capabilities of Message Passing Neural Networks (MPNNs), a prevalent class of Graph Neural Networks (GNN). We derive generalization bounds specifically for MPNNs with normalized sum aggregation and mean aggregation. Our analysis is based on a data generation model incorporating a finite set of template graphons. Each graph within this framework is generated by sampling from one of the graphons with a certain degree of perturbation. In particular, we extend previous MPNN generalization results to a more realistic setting, which includes the following modifications: 1) we analyze simple random graphs with Bernoulli-distributed edges instead of weighted graphs; 2) we sample both graphs and graph signals from perturbed graphons instead of clean graphons; and 3) we analyze sparse graphs instead of dense graphs. In this more realistic and challenging scenario, we provide a generalization bound that decreases as the average number of nodes in the graphs increases. Our results imply that MPNNs with higher complexity than the size of the training set can still generalize effectively, as long as the graphs are sufficiently large.
Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images
Authors: Rita Pucci, Vincent J. Kalkman, Dan Stowell
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.03474
Pdf link: https://arxiv.org/pdf/2404.03474
Abstract With fine-grained classification, we identify unique characteristics to distinguish among classes of the same super-class. We are focusing on species recognition in Insecta, as they are critical for biodiversity monitoring and at the base of many ecosystems. With citizen science campaigns, billions of images are collected in the wild. Once these are labelled, experts can use them to create distribution maps. However, the labelling process is time-consuming, which is where computer vision comes in. The field of computer vision offers a wide range of algorithms, each with its strengths and weaknesses; how do we identify the algorithm that is in line with our application? To answer this question, we provide a full and detailed evaluation of nine algorithms among deep convolutional networks (CNN), vision transformers (ViT), and locality-based vision transformers (LBVT) on 4 different aspects: classification performance, embedding quality, computational cost, and gradient activity. We offer insights that we haven't yet had in this domain proving to which extent these algorithms solve the fine-grained tasks in Insecta. We found that the ViT performs the best on inference speed and computational cost while the LBVT outperforms the others on performance and embedding quality; the CNN provide a trade-off among the metrics.
Learn When (not) to Trust Language Models: A Privacy-Centric Adaptive Model-Aware Approach
Authors: Chengkai Huang, Rui Wang, Kaige Xie, Tong Yu, Lina Yao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.03514
Pdf link: https://arxiv.org/pdf/2404.03514
Abstract Retrieval-augmented large language models (LLMs) have been remarkably competent in various NLP tasks. Despite their great success, the knowledge provided by the retrieval process is not always useful for improving the model prediction, since in some samples LLMs may already be quite knowledgeable and thus be able to answer the question correctly without retrieval. Aiming to save the cost of retrieval, previous work has proposed to determine when to do/skip the retrieval in a data-aware manner by analyzing the LLMs' pretraining data. However, these data-aware methods pose privacy risks and memory limitations, especially when requiring access to sensitive or extensive pretraining data. Moreover, these methods offer limited adaptability under fine-tuning or continual learning settings. We hypothesize that token embeddings are able to capture the model's intrinsic knowledge, which offers a safer and more straightforward way to judge the need for retrieval without the privacy risks associated with accessing pre-training data. Moreover, it alleviates the need to retain all the data utilized during model pre-training, necessitating only the upkeep of the token embeddings. Extensive experiments and in-depth analyses demonstrate the superiority of our model-aware approach.
If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces
Authors: Andrea Atzori, Fadi Boutros, Naser Damer, Gianni Fenu, Mirko Marras
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03537
Pdf link: https://arxiv.org/pdf/2404.03537
Abstract Recent advances in deep face recognition have spurred a growing demand for large, diverse, and manually annotated face datasets. Acquiring authentic, high-quality data for face recognition has proven to be a challenge, primarily due to privacy concerns. Large face datasets are primarily sourced from web-based images, lacking explicit user consent. In this paper, we examine whether and how synthetic face data can be used to train effective face recognition models with reduced reliance on authentic images, thereby mitigating data collection concerns. First, we explored the performance gap among recent state-of-the-art face recognition models, trained with synthetic data only and authentic (scarce) data only. Then, we deepened our analysis by training a state-of-the-art backbone with various combinations of synthetic and authentic data, gaining insights into optimizing the limited use of the latter for verification accuracy. Finally, we assessed the effectiveness of data augmentation approaches on synthetic and authentic data, with the same goal in mind. Our results highlighted the effectiveness of FR trained on combined datasets, particularly when combined with appropriate augmentation techniques.
No Panacea in Planning: Algorithm Selection for Suboptimal Multi-Agent Path Finding
Authors: Weizhe Chen, Zhihan Wang, Jiaoyang Li, Sven Koenig, Bistra Dilkina
Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2404.03554
Pdf link: https://arxiv.org/pdf/2404.03554
Abstract Since more and more algorithms are proposed for multi-agent path finding (MAPF) and each of them has its strengths, choosing the correct one for a specific scenario that fulfills some specified requirements is an important task. Previous research in algorithm selection for MAPF built a standard workflow and showed that machine learning can help. In this paper, we study general solvers for MAPF, which further include suboptimal algorithms. We propose different groups of optimization objectives and learning tasks to handle the new tradeoff between runtime and solution quality. We conduct extensive experiments to show that the same loss can not be used for different groups of optimization objectives, and that standard computer vision models are no worse than customized architecture. We also provide insightful discussions on how feature-sensitive pre-processing is needed for learning for MAPF, and how different learning metrics are correlated to different learning tasks.
Factored Task and Motion Planning with Combined Optimization, Sampling and Learning
Authors: Joaquim Ortiz-Haro
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.03567
Pdf link: https://arxiv.org/pdf/2404.03567
Abstract In this thesis, we aim to improve the performance of TAMP algorithms from three complementary perspectives. First, we investigate the integration of discrete task planning with continuous trajectory optimization. Our main contribution is a conflict-based solver that automatically discovers why a task plan might fail when considering the constraints of the physical world. This information is then fed back into the task planner, resulting in an efficient, bidirectional, and intuitive interface between task and motion, capable of solving TAMP problems with multiple objects, robots, and tight physical constraints. In the second part, we first illustrate that, given the wide range of tasks and environments within TAMP, neither sampling nor optimization is superior in all settings. To combine the strengths of both approaches, we have designed meta-solvers for TAMP, adaptive solvers that automatically select which algorithms and computations to use and how to best decompose each problem to find a solution faster. In the third part, we combine deep learning architectures with model-based reasoning to accelerate computations within our TAMP solver. Specifically, we target infeasibility detection and nonlinear optimization, focusing on generalization, accuracy, compute time, and data efficiency. At the core of our contributions is a refined, factored representation of the trajectory optimization problems inside TAMP. This structure not only facilitates more efficient planning, encoding of geometric infeasibility, and meta-reasoning but also provides better generalization in neural architectures.
DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
Authors: Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, Pengyuan Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03575
Pdf link: https://arxiv.org/pdf/2404.03575
Abstract Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors. Despite significant progress, existing methods struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework, to tackle the aforementioned three challenges mainly via two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a multi-timestep sampling strategy guided by the formation patterns of 3D objects, to form fast, semantically rich, and high-quality representations. FPS uses 3D Gaussian filtering for optimization stability, and leverages reconstruction techniques to generate plausible textures. Second, DreamScene employs a progressive three-stage camera sampling strategy, specifically designed for both indoor and outdoor settings, to effectively ensure object-environment integration and scene-wide 3D consistency. Last, DreamScene enhances scene editing flexibility by integrating objects and environments, enabling targeted adjustments. Extensive experiments validate DreamScene's superiority over current state-of-the-art techniques, heralding its wide-ranging potential for diverse applications. Code and demos will be released at https://dreamscene-project.github.io .
Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm
Authors: Miao Lu, Han Zhong, Tong Zhang, Jose Blanchet
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.03578
Pdf link: https://arxiv.org/pdf/2404.03578
Abstract The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.
LCM-Lookahead for Encoder-based Text-to-Image Personalization
Authors: Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2404.03620
Pdf link: https://arxiv.org/pdf/2404.03620
Abstract Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.
Training LLMs over Neurally Compressed Text
Authors: Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.03626
Pdf link: https://arxiv.org/pdf/2404.03626
Abstract In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text na\"ively compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.
WorDepth: Variational Language Prior for Monocular Depth Estimation
Authors: Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2404.03635
Pdf link: https://arxiv.org/pdf/2404.03635
Abstract Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.
DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior
Authors: Yiming Zhang, Zhe Wang, Xinjie Li, Yunchen Yuan, Chengsong Zhang, Xiao Sun, Zhihang Zhong, Jian Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.03642
Pdf link: https://arxiv.org/pdf/2404.03642
Abstract Human body restoration plays a vital role in various applications related to the human body. Despite recent advances in general image restoration using generative models, their performance in human body restoration remains mediocre, often resulting in foreground and background blending, over-smoothing surface textures, missing accessories, and distorted limbs. Addressing these challenges, we propose a novel approach by constructing a human body-aware diffusion model that leverages domain-specific knowledge to enhance performance. Specifically, we employ a pretrained body attention module to guide the diffusion model's focus on the foreground, addressing issues caused by blending between the subject and background. We also demonstrate the value of revisiting the language modality of the diffusion model in restoration tasks by seamlessly incorporating text prompt to improve the quality of surface texture and additional clothing and accessories details. Additionally, we introduce a diffusion sampler tailored for fine-grained human body parts, utilizing local semantic information to rectify limb distortions. Lastly, we collect a comprehensive dataset for benchmarking and advancing the field of human body restoration. Extensive experimental validation showcases the superiority of our approach, both quantitatively and qualitatively, over existing methods.
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Authors: Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.03648
Pdf link: https://arxiv.org/pdf/2404.03648
Abstract Large language models (LLMs) have fueled many intelligent agent tasks, such as web navigation -- but most existing agents perform far from satisfying in real-world webpages due to three factors: (1) the versatility of actions on webpages, (2) HTML text exceeding model processing capacity, and (3) the complexity of decision-making due to the open-domain nature of web. In light of the challenge, we develop AutoWebGLM, a GPT-4-outperforming automated web navigation agent built upon ChatGLM3-6B. Inspired by human browsing patterns, we design an HTML simplification algorithm to represent webpages, preserving vital information succinctly. We employ a hybrid human-AI method to build web browsing data for curriculum training. Then, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For testing, we establish a bilingual benchmark -- AutoWebBench -- for real-world web browsing tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, revealing its improvements but also underlying challenges to tackle real environments. Related code, model, and data will be released at \url{https://github.com/THUDM/AutoWebGLM}.

Yukeaaa / arxiv-daily

【CS-part2】New submissions for Fri, 5 Apr 24 #1347

Keyword: webgpu

Keyword: webgl

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Keyword: pre-rendering

Keyword: prerendering

Keyword: motion prediction

Towards more realistic human motion prediction with attention to motion coordination

Keyword: incremental learning

Future-Proofing Class Incremental Learning

Keyword: svm incremental

Keyword: nerf

VF-NeRF: Viewshed Fields for Rigid NeRF Registration

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

RaFE: Generative Radiance Fields Restoration

Keyword: multiorgan

Keyword: multi-organ

Keyword: multi organ

Keyword: SAM

READ: Improving Relation Extraction from an ADversarial Perspective

Learning in Convolutional Neural Networks Accelerated by Transfer Entropy

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Deep Generative Models through the Lens of the Manifold Hypothesis: A Survey and New Connections

Universal Functional Regression with Neural Operator Flows

Risk-averse Learning with Non-Stationary Distributions

An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models

Model-based Reinforcement Learning for Parameterized Action Spaces

vPALs: Towards Verified Performance-aware Learning System For Resource Management

Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Low Frequency Sampling in Model Predictive Path Integral Control

MARL-LNS: Cooperative Multi-agent Reinforcement Learning via Large Neighborhoods Search

Many-to-many Image Generation with Auto-regressive Diffusion Models

Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking

Biodegradable Interactive Materials

A Framework for Guided Motion Planning

Multi-modal Learning for WebAssembly Reverse Engineering

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

MonoCD: Monocular 3D Object Detection with Complementary Depths

Reservoir Sampling over Joins

Okay, Let's Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation

Future-Proofing Class Incremental Learning

Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption

iSeg: Interactive 3D Segmentation via Interactive Attention

FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

Learning Transferable Negative Prompts for Out-of-Distribution Detection

Cryptographic Hardness of Score Estimation

Gaussian-Smoothed Sliced Probability Divergences

Traversability-aware Adaptive Optimization for Path Planning and Control in Mountainous Terrain

MPOFI: Multichannel Partially Observed Functional Modeling for Defect Classification with Imbalanced Dataset via Deep Metric Learning

Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks

Graph Neural Networks for Electric and Hydraulic Data Fusion to Enhance Short-term Forecasting of Pumped-storage Hydroelectricity

A superconvergence result in the RBF-FD method

Accurate estimation of feature importance faithfulness for tree models

Design and Optimization of Cooperative Sensing With Limited Backhaul Capacity

SP$^2$OT: Semantic-Regularized Progressive Partial Optimal Transport for Imbalanced Clustering

Computational analysis of a contraction rheometer for the grade-two fluid model

Generalization Bounds for Message Passing Networks on Mixture of Graphons

Performance of computer vision algorithms for fine-grained classification using crowdsourced insect images

Learn When (not) to Trust Language Models: A Privacy-Centric Adaptive Model-Aware Approach

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces

No Panacea in Planning: Algorithm Selection for Suboptimal Multi-Agent Path Finding

Factored Task and Motion Planning with Combined Optimization, Sampling and Learning

DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

LCM-Lookahead for Encoder-based Text-to-Image Personalization

Training LLMs over Neurally Compressed Text

WorDepth: Variational Language Prior for Monocular Depth Estimation

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent