New submissions for Mon, 4 Sep 23

Keyword: efficient

When Measures are Unreliable: Imperceptible Adversarial Perturbations toward Top-$k$ Multi-Label Learning

Authors: Yuchen Sun, Qianqian Xu, Zitai Wang, Qingming Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00007
Pdf link: https://arxiv.org/pdf/2309.00007
Abstract With the great success of deep neural networks, adversarial learning has received widespread attention in various studies, ranging from multi-class learning to multi-label learning. However, existing adversarial attacks toward multi-label learning only pursue the traditional visual imperceptibility but ignore the new perceptible problem coming from measures such as Precision@$k$ and mAP@$k$. Specifically, when a well-trained multi-label classifier performs far below the expectation on some samples, the victim can easily realize that this performance degeneration stems from attack, rather than the model itself. Therefore, an ideal multi-labeling adversarial attack should manage to not only deceive visual perception but also evade monitoring of measures. To this end, this paper first proposes the concept of measure imperceptibility. Then, a novel loss function is devised to generate such adversarial perturbations that could achieve both visual and measure imperceptibility. Furthermore, an efficient algorithm, which enjoys a convex objective, is established to optimize this objective. Finally, extensive experiments on large-scale benchmark datasets, such as PASCAL VOC 2012, MS COCO, and NUS WIDE, demonstrate the superiority of our proposed method in attacking the top-$k$ multi-label systems.
Improving NeRF Quality by Progressive Camera Placement for Unrestricted Navigation in Complex Environments
Authors: Georgios Kopanas, George Drettakis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.00014
Pdf link: https://arxiv.org/pdf/2309.00014
Abstract Neural Radiance Fields, or NeRFs, have drastically improved novel view synthesis and 3D reconstruction for rendering. NeRFs achieve impressive results on object-centric reconstructions, but the quality of novel view synthesis with free-viewpoint navigation in complex environments (rooms, houses, etc) is often problematic. While algorithmic improvements play an important role in the resulting quality of novel view synthesis, in this work, we show that because optimizing a NeRF is inherently a data-driven process, good quality data play a fundamental role in the final quality of the reconstruction. As a consequence, it is critical to choose the data samples -- in this case the cameras -- in a way that will eventually allow the optimization to converge to a solution that allows free-viewpoint navigation with good quality. Our main contribution is an algorithm that efficiently proposes new camera placements that improve visual quality with minimal assumptions. Our solution can be used with any NeRF model and outperforms baselines and similar work.
Physics-Based Trajectory Design for Cellular-Connected UAV in Rainy Environments Based on Deep Reinforcement Learning
Authors: Hao Qin, Zhaozhou Wu, Xingqi Zhang
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00017
Pdf link: https://arxiv.org/pdf/2309.00017
Abstract Cellular-connected unmanned aerial vehicles (UAVs) have gained increasing attention due to their potential to enhance conventional UAV capabilities by leveraging existing cellular infrastructure for reliable communications between UAVs and base stations. They have been used for various applications, including weather forecasting and search and rescue operations. However, under extreme weather conditions such as rainfall, it is challenging for the trajectory design of cellular UAVs, due to weak coverage regions in the sky, limitations of UAV flying time, and signal attenuation caused by raindrops. To this end, this paper proposes a physics-based trajectory design approach for cellular-connected UAVs in rainy environments. A physics-based electromagnetic simulator is utilized to take into account detailed environment information and the impact of rain on radio wave propagation. The trajectory optimization problem is formulated to jointly consider UAV flying time and signal-to-interference ratio, and is solved through a Markov decision process using deep reinforcement learning algorithms based on multi-step learning and double Q-learning. Optimal UAV trajectories are compared in examples with homogeneous atmosphere medium and rain medium. Additionally, a thorough study of varying weather conditions on trajectory design is provided, and the impact of weight coefficients in the problem formulation is discussed. The proposed approach has demonstrated great potential for UAV trajectory design under rainy weather conditions.
An Energy-Aware Approach to Design Self-Adaptive AI-based Applications on the Edge
Authors: Alessandro Tundo, Marco Mobilio, Shashikant Ilager, Ivona Brandić, Ezio Bartocci, Leonardo Mariani
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.00022
Pdf link: https://arxiv.org/pdf/2309.00022
Abstract The advent of edge devices dedicated to machine learning tasks enabled the execution of AI-based applications that efficiently process and classify the data acquired by the resource-constrained devices populating the Internet of Things. The proliferation of such applications (e.g., critical monitoring in smart cities) demands new strategies to make these systems also sustainable from an energetic point of view. In this paper, we present an energy-aware approach for the design and deployment of self-adaptive AI-based applications that can balance application objectives (e.g., accuracy in object detection and frames processing rate) with energy consumption. We address the problem of determining the set of configurations that can be used to self-adapt the system with a meta-heuristic search procedure that only needs a small number of empirical samples. The final set of configurations are selected using weighted gray relational analysis, and mapped to the operation modes of the self-adaptive application. We validate our approach on an AI-based application for pedestrian detection. Results show that our self-adaptive application can outperform non-adaptive baseline configurations by saving up to 81\% of energy while loosing only between 2% and 6% in accuracy.
Continual Learning From a Stream of APIs
Authors: Enneng Yang, Zhenyi Wang, Li Shen, Nan Yin, Tongliang Liu, Guibing Guo, Xingwei Wang, Dacheng Tao
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00023
Pdf link: https://arxiv.org/pdf/2309.00023
Abstract Continual learning (CL) aims to learn new tasks without forgetting previous tasks. However, existing CL methods require a large amount of raw data, which is often unavailable due to copyright considerations and privacy risks. Instead, stakeholders usually release pre-trained machine learning models as a service (MLaaS), which users can access via APIs. This paper considers two practical-yet-novel CL settings: data-efficient CL (DECL-APIs) and data-free CL (DFCL-APIs), which achieve CL from a stream of APIs with partial or no raw data. Performing CL under these two new settings faces several challenges: unavailable full raw data, unknown model parameters, heterogeneous models of arbitrary architecture and scale, and catastrophic forgetting of previous APIs. To overcome these issues, we propose a novel data-free cooperative continual distillation learning framework that distills knowledge from a stream of APIs into a CL model by generating pseudo data, just by querying APIs. Specifically, our framework includes two cooperative generators and one CL model, forming their training as an adversarial game. We first use the CL model and the current API as fixed discriminators to train generators via a derivative-free method. Generators adversarially generate hard and diverse synthetic data to maximize the response gap between the CL model and the API. Next, we train the CL model by minimizing the gap between the responses of the CL model and the black-box API on synthetic data, to transfer the API's knowledge to the CL model. Furthermore, we propose a new regularization term based on network similarity to prevent catastrophic forgetting of previous APIs.Our method performs comparably to classic CL with full raw data on the MNIST and SVHN in the DFCL-APIs setting. In the DECL-APIs setting, our method achieves 0.97x, 0.75x and 0.69x performance of classic CL on CIFAR10, CIFAR100, and MiniImageNet.
Efficient Multi-View Graph Clustering with Local and Global Structure Preservation
Authors: Yi Wen, Suyuan Liu, Xinhang Wan, Siwei Wang, Ke Liang, Xinwang Liu, Xihong Yang, Pei Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00024
Pdf link: https://arxiv.org/pdf/2309.00024
Abstract Anchor-based multi-view graph clustering (AMVGC) has received abundant attention owing to its high efficiency and the capability to capture complementary structural information across multiple views. Intuitively, a high-quality anchor graph plays an essential role in the success of AMVGC. However, the existing AMVGC methods only consider single-structure information, i.e., local or global structure, which provides insufficient information for the learning task. To be specific, the over-scattered global structure leads to learned anchors failing to depict the cluster partition well. In contrast, the local structure with an improper similarity measure results in potentially inaccurate anchor assignment, ultimately leading to sub-optimal clustering performance. To tackle the issue, we propose a novel anchor-based multi-view graph clustering framework termed Efficient Multi-View Graph Clustering with Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified framework with a theoretical guarantee is designed to capture local and global information. Besides, EMVGC-LG jointly optimizes anchor construction and graph learning to enhance the clustering quality. In addition, EMVGC-LG inherits the linear complexity of existing AMVGC methods respecting the sample number, which is time-economical and scales well with the data size. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method.
YaRN: Efficient Context Window Extension of Large Language Models
Authors: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00071
Pdf link: https://arxiv.org/pdf/2309.00071
Abstract Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. We publish the checkpoints of Llama 2 7B/13B fine-tuned using YaRN with 64k and 128k context windows at https://github.com/jquesnelle/yarn
Few-shot Diagnosis of Chest x-rays Using an Ensemble of Random Discriminative Subspaces
Authors: Kshitiz, Garvit Garg, Angshuman Paul
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00081
Pdf link: https://arxiv.org/pdf/2309.00081
Abstract Due to the scarcity of annotated data in the medical domain, few-shot learning may be useful for medical image analysis tasks. We design a few-shot learning method using an ensemble of random subspaces for the diagnosis of chest x-rays (CXRs). Our design is computationally efficient and almost 1.8 times faster than method that uses the popular truncated singular value decomposition (t-SVD) for subspace decomposition. The proposed method is trained by minimizing a novel loss function that helps create well-separated clusters of training data in discriminative subspaces. As a result, minimizing the loss maximizes the distance between the subspaces, making them discriminative and assisting in better classification. Experiments on large-scale publicly available CXR datasets yield promising results. Code for the project will be available at https://github.com/Few-shot-Learning-on-chest-x-ray/fsl_subspace.
Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection
Authors: Reza Azad, Amirhossein Kazerouni, Babak Azad, Ehsan Khodapanah Aghdam, Yury Velichko, Ulas Bagci, Dorit Merhof
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00108
Pdf link: https://arxiv.org/pdf/2309.00108
Abstract Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks. However, compared to the Convolutional Neural Network (CNN) models, it has been observed that the ViT models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information. As abnormalities in human tissue, such as tumors and lesions, may greatly vary in structure, texture, and shape, high-frequency information such as texture is crucial for effective semantic segmentation tasks. To address this limitation in ViT models, we propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. More specifically, our proposed method utilizes a dual attention mechanism via efficient attention and frequency attention while the efficient attention mechanism reduces the complexity of self-attention to linear while producing the same output, selectively intensifying the contribution of shape and texture features. Furthermore, we introduce a novel efficient enhancement multi-scale bridge that effectively transfers spatial information from the encoder to the decoder while preserving the fundamental features. We demonstrate the efficacy of Laplacian-former on multi-organ and skin lesion segmentation tasks with +1.87\% and +0.76\% dice scores compared to SOTA approaches, respectively. Our implementation is publically available at https://github.com/mindflow-institue/Laplacian-Former
Inverse designing surface curvatures by deep learning
Authors: Yaqi Guo, Saurav Sharma, Siddhant Kumar
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2309.00163
Pdf link: https://arxiv.org/pdf/2309.00163
Abstract Smooth and curved microstructural topologies found in nature - from soap films to trabecular bone - have inspired several mimetic design spaces for architected metamaterials and bio-scaffolds. However, the design approaches so far have been ad hoc, raising the challenge: how to systematically and efficiently inverse design such artificial microstructures with targeted topological features? Here, we explore surface curvature as a design modality and present a deep learning framework to produce topologies with as-desired curvature profiles. The inverse design framework can generalize to diverse topological features such as tubular, membranous, and particulate features. Moreover, we demonstrate successful generalization beyond both the design and data space by inverse designing topologies that mimic the curvature profile of trabecular bone, spinodoid topologies, and periodic nodal surfaces for application in bio-scaffolds and implants. Lastly, we bridge curvature and mechanics by showing how topological curvature can be designed to promote mechanically beneficial stretching-dominated deformation over bending-dominated deformation.
Diffusion Model with Clustering-based Conditioning for Food Image Generation
Authors: Yue Han, Jiangpeng He, Mridul Gupta, Edward J. Delp, Fengqing Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00199
Pdf link: https://arxiv.org/pdf/2309.00199
Abstract Image-based dietary assessment serves as an efficient and accurate solution for recording and analyzing nutrition intake using eating occasion images as input. Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation, which rely on large amounts of food images with annotations for training. However, such data dependency poses significant barriers to real-world applications, because acquiring a substantial, diverse, and balanced set of food images can be challenging. One potential solution is to use synthetic food images for data augmentation. Although existing work has explored the use of generative adversarial networks (GAN) based structures for generation, the quality of synthetic food images still remains subpar. In addition, while diffusion-based generative models have shown promising results for general image generation tasks, the generation of food images can be challenging due to the substantial intra-class variance. In this paper, we investigate the generation of synthetic food images based on the conditional diffusion model and propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images. The proposed method is evaluated on the Food-101 dataset and shows improved performance when compared with existing image generation works. We also demonstrate that the synthetic food images generated by ClusDiff can help address the severe class imbalance issue in long-tailed food classification using the VFN-LT dataset.
Data-Driven Projection for Reducing Dimensionality of Linear Programs: Generalization Bound and Learning Methods
Authors: Shinsaku Sakaue, Taihei Oki
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00203
Pdf link: https://arxiv.org/pdf/2309.00203
Abstract This paper studies a simple data-driven approach to high-dimensional linear programs (LPs). Given data of past $n$-dimensional LPs, we learn an $n\times k$ \textit{projection matrix} ($n > k$), which reduces the dimensionality from $n$ to $k$. Then, we address future LP instances by solving $k$-dimensional LPs and recovering $n$-dimensional solutions by multiplying the projection matrix. This idea is compatible with any user-preferred LP solvers, hence a versatile approach to faster LP solving. One natural question is: how much data is sufficient to ensure the recovered solutions' quality? We address this question based on the idea of \textit{data-driven algorithm design}, which relates the amount of data sufficient for generalization guarantees to the \textit{pseudo-dimension} of performance metrics. We present an $\tilde{\mathrm{O}}(nk^2)$ upper bound on the pseudo-dimension ($\tilde{\mathrm{O}}$ compresses logarithmic factors) and complement it by an $\Omega(nk)$ lower bound, hence tight up to an $\tilde{\mathrm{O}}(k)$ factor. On the practical side, we study two natural methods for learning projection matrices: PCA- and gradient-based methods. While the former is simple and efficient, the latter sometimes leads to better solution quality. Experiments confirm that learned projection matrices are beneficial for reducing the time for solving LPs while maintaining high solution quality.
Gap and Overlap Detection in Automated Fiber Placement
Authors: Assef Ghamisi, Homayoun Najjaran
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.00206
Pdf link: https://arxiv.org/pdf/2309.00206
Abstract The identification and correction of manufacturing defects, particularly gaps and overlaps, are crucial for ensuring high-quality composite parts produced through Automated Fiber Placement (AFP). These imperfections are the most commonly observed issues that can significantly impact the overall quality of the composite parts. Manual inspection is both time-consuming and labor-intensive, making it an inefficient approach. To overcome this challenge, the implementation of an automated defect detection system serves as the optimal solution. In this paper, we introduce a novel method that uses an Optical Coherence Tomography (OCT) sensor and computer vision techniques to detect and locate gaps and overlaps in composite parts. Our approach involves generating a depth map image of the composite surface that highlights the elevation of composite tapes (or tows) on the surface. By detecting the boundaries of each tow, our algorithm can compare consecutive tows and identify gaps or overlaps that may exist between them. Any gaps or overlaps exceeding a predefined tolerance threshold are considered manufacturing defects. To evaluate the performance of our approach, we compare the detected defects with the ground truth annotated by experts. The results demonstrate a high level of accuracy and efficiency in gap and overlap segmentation.
Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies
Authors: Junwon Sung, Woojin Heo, Yunkyung Byun, Youngsam Kim
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.00208
Pdf link: https://arxiv.org/pdf/2309.00208
Abstract In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top 50 publicly traded companies listed on the Korean KOSPI, based on market capitalization, and scrutinizes their monthly disclosure summaries over a period of 17 months. Each summary was assigned a sentiment rating on a scale ranging from 1(very negative) to 5(very positive). To gauge the effectiveness of the language models, their sentiment ratings were compared with those generated by human experts. Our findings reveal a notable performance disparity between GPT-3.5-turbo and GPT-4, with the latter demonstrating significant accuracy in human evaluation tests. The Spearman correlation coefficient was registered at 0.61, while the simple concordance rate was recorded at 0.82. This research contributes valuable insights into the evaluative characteristics of GPT models, thereby laying the groundwork for future innovations in the field of automated semantic monitoring.
JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning
Authors: Wai-Chung Kwan, Huimin Wang, Hongru Wang, Zezhong Wang, Xian Wu, Yefeng Zheng, Kam-Fai Wong
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2309.00230
Pdf link: https://arxiv.org/pdf/2309.00230
Abstract Dialogue policy learning (DPL) is a crucial component of dialogue modelling. Its primary role is to determine the appropriate abstract response, commonly referred to as the "dialogue action". Traditional DPL methodologies have treated this as a sequential decision problem, using pre-defined action candidates extracted from a corpus. However, these incomplete candidates can significantly limit the diversity of responses and pose challenges when dealing with edge cases, which are scenarios that occur only at extreme operating parameters. To address these limitations, we introduce a novel framework, JoTR. This framework is unique as it leverages a text-to-text Transformer-based model to generate flexible dialogue actions. Unlike traditional methods, JoTR formulates a word-level policy that allows for a more dynamic and adaptable dialogue action generation, without the need for any action templates. This setting enhances the diversity of responses and improves the system's ability to handle edge cases effectively. In addition, JoTR employs reinforcement learning with a reward-shaping mechanism to efficiently finetune the word-level dialogue policy, which allows the model to learn from its interactions, improving its performance over time. We conducted an extensive evaluation of JoTR to assess its effectiveness. Our extensive evaluation shows that JoTR achieves state-of-the-art performance on two benchmark dialogue modelling tasks, as assessed by both user simulators and human evaluators.
DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models
Authors: Michael Shenoda, Edward Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.00248
Pdf link: https://arxiv.org/pdf/2309.00248
Abstract Generating high-quality labeled image datasets is crucial for training accurate and robust machine learning models in the field of computer vision. However, the process of manually labeling real images is often time-consuming and costly. To address these challenges associated with dataset generation, we introduce "DiffuGen," a simple and adaptable approach that harnesses the power of stable diffusion models to create labeled image datasets efficiently. By leveraging stable diffusion models, our approach not only ensures the quality of generated datasets but also provides a versatile solution for label generation. In this paper, we present the methodology behind DiffuGen, which combines the capabilities of diffusion models with two distinct labeling techniques: unsupervised and supervised. Distinctively, DiffuGen employs prompt templating for adaptable image generation and textual inversion to enhance diffusion model capabilities.
Optimal Repair Strategy Against Advanced Persistent Threats Under Time-Varying Networks
Authors: Zixuan Wang, Jiliang Li, Yuntao Wang, Zhou Su, Shui Yu, Weizhi Meng
Subjects: Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2309.00251
Pdf link: https://arxiv.org/pdf/2309.00251
Abstract Advanced persistent threat (APT) is a kind of stealthy, sophisticated, and long-term cyberattack that has brought severe financial losses and critical infrastructure damages. Existing works mainly focus on APT defense under stable network topologies, while the problem under time-varying dynamic networks (e.g., vehicular networks) remains unexplored, which motivates our work. Besides, the spatiotemporal dynamics in defense resources, complex attackers' lateral movement behaviors, and lack of timely defense make APT defense a challenging issue under time-varying networks. In this paper, we propose a novel game-theoretical APT defense approach to promote real-time and optimal defense strategy-making under both periodic time-varying and general time-varying environments. Specifically, we first model the interactions between attackers and defenders in an APT process as a dynamic APT repair game, and then formulate the APT damage minimization problem as the precise prevention and control (PPAC) problem. To derive the optimal defense strategy under both latency and defense resource constraints, we further devise an online optimal control-based mechanism integrated with two backtracking-forward algorithms to fastly derive the near-optimal solution of the PPAC problem in real time. Extensive experiments are carried out, and the results demonstrate that our proposed scheme can efficiently obtain optimal defense strategy in 54481 ms under seven attack-defense interactions with 9.64$\%$ resource occupancy in stimulated periodic time-varying and general time-varying networks. Besides, even under static networks, our proposed scheme still outperforms existing representative APT defense approaches in terms of service stability and defense resource utilization.
SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks
Authors: Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00255
Pdf link: https://arxiv.org/pdf/2309.00255
Abstract As the size of deep learning models continues to grow, finding optimal models under memory and computation constraints becomes increasingly more important. Although usually the architecture and constituent building blocks of neural networks allow them to be used in a modular way, their training process is not aware of this modularity. Consequently, conventional neural network training lacks the flexibility to adapt the computational load of the model during inference. This paper proposes SortedNet, a generalized and scalable solution to harness the inherent modularity of deep neural networks across various dimensions for efficient dynamic inference. Our training considers a nested architecture for the sub-models with shared parameters and trains them together with the main model in a sorted and probabilistic manner. This sorted training of sub-networks enables us to scale the number of sub-networks to hundreds using a single round of training. We utilize a novel updating scheme during training that combines random sampling of sub-networks with gradient accumulation to improve training efficiency. Furthermore, the sorted nature of our training leads to a search-free sub-network selection at inference time; and the nested architecture of the resulting sub-networks leads to minimal storage requirement and efficient switching between sub-networks at inference. Our general dynamic training approach is demonstrated across various architectures and tasks, including large language models and pre-trained vision models. Experimental results show the efficacy of the proposed approach in achieving efficient sub-networks while outperforming state-of-the-art dynamic training approaches. Our findings demonstrate the feasibility of training up to 160 different sub-models simultaneously, showcasing the extensive scalability of our proposed method while maintaining 96% of the model performance.
Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms
Authors: Isuru Dharmadasa, Faheem Ullah
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2309.00269
Pdf link: https://arxiv.org/pdf/2309.00269
Abstract Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The efficient use of these platforms requires users to (i) configure the cloud i.e., determine the number and type of computing nodes, and (ii) tune the configuration parameters (e.g., data replication factor) of the platform. However, both these tasks require in-depth knowledge of the cloud infrastructure and distributed data processing platforms. Therefore, in this paper, we first study the relationship between the configuration of the cloud and the configuration of distributed data processing platforms to determine how cloud configuration impacts platform configuration. After understanding the impacts, we propose a co-tuning approach for recommending optimal co-configuration of cloud and distributed data processing platforms. The proposed approach utilizes machine learning and optimization techniques to maximize the performance of the distributed data processing system deployed on the cloud. We evaluated our approach for Hadoop, Spark, and Flink in a cluster deployed on the OpenStack cloud. We used three benchmarking workloads (WordCount, Sort, and K-means) in our evaluation. Our results reveal that, in comparison to default settings, our co-tuning approach reduces execution time by 17.5% and $ cost by 14.9% solely via configuration tuning.
On the Aggregation of Rules for Knowledge Graph Completion
Authors: Patrick Betz, Stefan Lüdtke, Christian Meilicke, Heiner Stuckenschmidt
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.00306
Pdf link: https://arxiv.org/pdf/2309.00306
Abstract Rule learning approaches for knowledge graph completion are efficient, interpretable and competitive to purely neural models. The rule aggregation problem is concerned with finding one plausibility score for a candidate fact which was simultaneously predicted by multiple rules. Although the problem is ubiquitous, as data-driven rule learning can result in noisy and large rulesets, it is underrepresented in the literature and its theoretical foundations have not been studied before in this context. In this work, we demonstrate that existing aggregation approaches can be expressed as marginal inference operations over the predicting rules. In particular, we show that the common Max-aggregation strategy, which scores candidates based on the rule with the highest confidence, has a probabilistic interpretation. Finally, we propose an efficient and overlooked baseline which combines the previous strategies and is competitive to computationally more expensive approaches.
Multi-fidelity reduced-order surrogate modeling
Authors: Paolo Conti, Mengwu Guo, Andrea Manzoni, Attilio Frangi, Steven L. Brunton, J. Nathan Kutz
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2309.00325
Pdf link: https://arxiv.org/pdf/2309.00325
Abstract High-fidelity numerical simulations of partial differential equations (PDEs) given a restricted computational budget can significantly limit the number of parameter configurations considered and/or time window evaluated for modeling a given system. Multi-fidelity surrogate modeling aims to leverage less accurate, lower-fidelity models that are computationally inexpensive in order to enhance predictive accuracy when high-fidelity data are limited or scarce. However, low-fidelity models, while often displaying important qualitative spatio-temporal features, fail to accurately capture the onset of instability and critical transients observed in the high-fidelity models, making them impractical as surrogate models. To address this shortcoming, we present a new data-driven strategy that combines dimensionality reduction with multi-fidelity neural network surrogates. The key idea is to generate a spatial basis by applying the classical proper orthogonal decomposition (POD) to high-fidelity solution snapshots, and approximate the dynamics of the reduced states - time-parameter-dependent expansion coefficients of the POD basis - using a multi-fidelity long-short term memory (LSTM) network. By mapping low-fidelity reduced states to their high-fidelity counterpart, the proposed reduced-order surrogate model enables the efficient recovery of full solution fields over time and parameter variations in a non-intrusive manner. The generality and robustness of this method is demonstrated by a collection of parametrized, time-dependent PDE problems where the low-fidelity model can be defined by coarser meshes and/or time stepping, as well as by misspecified physical features. Importantly, the onset of instabilities and transients are well captured by this surrogate modeling technique.
Towards a "Swiss Army Knife" for Scalable User-Defined Temporal $(k,\mathcal{X})$-Core Analysis
Authors: Ming Zhong, Junyong Yang, Yuanyuan Zhu, Tieyun Qian, Mengchi Liu, Jeffrey Xu Yu
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2309.00361
Pdf link: https://arxiv.org/pdf/2309.00361
Abstract Querying cohesive subgraphs on temporal graphs (e.g., social network, finance network, etc.) with various conditions has attracted intensive research interests recently. In this paper, we study a novel Temporal $(k,\mathcal{X})$-Core Query (TXCQ) that extends a fundamental Temporal $k$-Core Query (TCQ) proposed in our conference paper by optimizing or constraining an arbitrary metric $\mathcal{X}$ of $k$-core, such as size, engagement, interaction frequency, time span, burstiness, periodicity, etc. Our objective is to address specific TXCQ instances with conditions on different $\mathcal{X}$ in a unified algorithm framework that guarantees scalability. For that, this journal paper proposes a taxonomy of measurement $\mathcal{X}(\cdot)$ and achieve our objective using a two-phase framework while $\mathcal{X}(\cdot)$ is time-insensitive or time-monotonic. Specifically, Phase 1 still leverages the query processing algorithm of TCQ to induce all distinct $k$-cores during a given time range, and meanwhile locates the "time zones" in which the cores emerge. Then, Phase 2 conducts fast local search and $\mathcal{X}$ evaluation in each time zone with respect to the time insensitivity or monotonicity of $\mathcal{X}(\cdot)$. By revealing two insightful concepts named tightest time interval and loosest time interval that bound time zones, the redundant core induction and unnecessary $\mathcal{X}$ evaluation in a zone can be reduced dramatically. Our experimental results demonstrate that TXCQ can be addressed as efficiently as TCQ, which achieves the latest state-of-the-art performance, by using a general algorithm framework that leaves $\mathcal{X}(\cdot)$ as a user-defined function.
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning
Authors: Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00363
Pdf link: https://arxiv.org/pdf/2309.00363
Abstract LLMs have demonstrated great capabilities in various NLP tasks. Different entities can further improve the performance of those LLMs on their specific downstream tasks by fine-tuning LLMs. When several entities have similar interested tasks, but their data cannot be shared because of privacy concerns regulations, federated learning (FL) is a mainstream solution to leverage the data of different entities. However, fine-tuning LLMs in federated learning settings still lacks adequate support from existing FL frameworks because it has to deal with optimizing the consumption of significant communication and computational resources, data preparation for different tasks, and distinct information protection demands. This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution, which consists of the following components: (1) we build an end-to-end benchmarking pipeline, automizing the processes of dataset preprocessing, federated fine-tuning execution, and performance evaluation on federated LLM fine-tuning; (2) we provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios with low communication and computation costs, even without accessing the full model; (3) we adopt several accelerating and resource-efficient operators for fine-tuning LLMs with limited resources and the flexible pluggable sub-routines for interdisciplinary study. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings, which also yields valuable insights into federated fine-tuning LLMs for the research community. To facilitate further research and adoption, we release FS-LLM at https://github.com/alibaba/FederatedScope/tree/llm.
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2309.00398
Pdf link: https://arxiv.org/pdf/2309.00398
Abstract In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.
Fine-grained Recognition with Learnable Semantic Data Augmentation
Authors: Yifan Pu, Yizeng Han, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00399
Pdf link: https://arxiv.org/pdf/2309.00399
Abstract Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.
Selective Scene Text Removal
Authors: Hayato Mitani, Akisato Kimura, Seiichi Uchida
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00410
Pdf link: https://arxiv.org/pdf/2309.00410
Abstract Scene text removal (STR) is the image transformation task to remove text regions in scene images. The conventional STR methods remove all scene text. This means that the existing methods cannot select text to be removed. In this paper, we propose a novel task setting named selective scene text removal (SSTR) that removes only target words specified by the user. Although SSTR is a more complex task than STR, the proposed multi-module structure enables efficient training for SSTR. Experimental results show that the proposed method can remove target words as expected.
A Locality-based Neural Solver for Optical Motion Capture
Authors: Xiaoyu Pan, Bowen Zheng, Xinwei Jiang, Guanglong Xu, Xianli Gu, Jingxiang Li, Qilong Kou, He Wang, Tianjia Shao, Kun Zhou, Xiaogang Jin
Subjects: Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00428
Pdf link: https://arxiv.org/pdf/2309.00428
Abstract We present a novel locality-based learning method for cleaning and solving optical motion capture data. Given noisy marker data, we propose a new heterogeneous graph neural network which treats markers and joints as different types of nodes, and uses graph convolution operations to extract the local features of markers and joints and transform them to clean motions. To deal with anomaly markers (e.g. occluded or with big tracking errors), the key insight is that a marker's motion shows strong correlations with the motions of its immediate neighboring markers but less so with other markers, a.k.a. locality, which enables us to efficiently fill missing markers (e.g. due to occlusion). Additionally, we also identify marker outliers due to tracking errors by investigating their acceleration profiles. Finally, we propose a training regime based on representation learning and data augmentation, by training the model on data with masking. The masking schemes aim to mimic the occluded and noisy markers often observed in the real data. Finally, we show that our method achieves high accuracy on multiple metrics across various datasets. Extensive comparison shows our method outperforms state-of-the-art methods in terms of prediction accuracy of occluded marker position error by approximately 20%, which leads to a further error reduction on the reconstructed joint rotations and positions by 30%. The code and data for this paper are available at https://github.com/non-void/LocalMoCap.
Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices
Authors: Junhao Huang, Haosong Zhao, Jipeng Zhang, Wangchen Dai, Lu Zhou, Ray C.C. Cheung, Cetin Kaya Koc, Donglong Chen
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2309.00440
Pdf link: https://arxiv.org/pdf/2309.00440
Abstract This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.45 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.
Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering
Authors: Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, Shangling Jui, Jian Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00528
Pdf link: https://arxiv.org/pdf/2309.00528
Abstract Domain adaptation (DA) aims to alleviate the domain shift between source domain and target domain. Most DA methods require access to the source data, but often that is not possible (e.g. due to data privacy or intellectual property). In this paper, we address the challenging source-free domain adaptation (SFDA) problem, where the source pretrained model is adapted to the target domain in the absence of source data. Our method is based on the observation that target data, which might not align with the source domain classifier, still forms clear clusters. We capture this intrinsic structure by defining local affinity of the target data, and encourage label consistency among data with high local affinity. We observe that higher affinity should be assigned to reciprocal neighbors. To aggregate information with more context, we consider expanded neighborhoods with small affinity values. Furthermore, we consider the density around each target sample, which can alleviate the negative impact of potential outliers. In the experimental results we verify that the inherent structure of the target features is an important source of information for domain adaptation. We demonstrate that this local structure can be efficiently captured by considering the local neighbors, the reciprocal neighbors, and the expanded neighborhood. Finally, we achieve state-of-the-art performance on several 2D image and 3D point cloud recognition datasets.
FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
Authors: Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2309.00558
Pdf link: https://arxiv.org/pdf/2309.00558
Abstract Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to the ease of deployment and pay-per-use benefits. However, existing FaaS platforms utilize GPUs in a coarse manner for DL inferences, without taking into account spatio-temporal resource multiplexing and isolation, which results in severe GPU under-utilization, high usage expenses, and SLO (Service Level Objectives) violation. There is an imperative need to enable an efficient and SLO-aware GPU-sharing mechanism in serverless computing to facilitate cost-effective DL inferences. In this paper, we propose \textbf{FaST-GShare}, an efficient \textit{\textbf{Fa}aS-oriented \textbf{S}patio-\textbf{T}emporal \textbf{G}PU \textbf{Sharing}} architecture for deep learning inferences. In the architecture, we introduce the FaST-Manager to limit and isolate spatio-temporal resources for GPU multiplexing. In order to realize function performance, the automatic and flexible FaST-Profiler is proposed to profile function throughput under various resource allocations. Based on the profiling data and the isolation mechanism, we introduce the FaST-Scheduler with heuristic auto-scaling and efficient resource allocation to guarantee function SLOs. Meanwhile, FaST-Scheduler schedules function with efficient GPU node selection to maximize GPU usage. Furthermore, model sharing is exploited to mitigate memory contention. Our prototype implementation on the OpenFaaS platform and experiments on MLPerf-based benchmark prove that FaST-GShare can ensure resource isolation and function SLOs. Compared to the time sharing mechanism, FaST-GShare can improve throughput by 3.15x, GPU utilization by 1.34x, and SM (Streaming Multiprocessor) occupancy by 3.13x on average.
Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models
Authors: Janghoon Ock, Chakradhar Guntuboina, Amir Barati Farimani
Subjects: Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph)
Arxiv link: https://arxiv.org/abs/2309.00563
Pdf link: https://arxiv.org/pdf/2309.00563
Abstract Efficient catalyst screening necessitates predictive models for adsorption energy, a key property of reactivity. However, prevailing methods, notably graph neural networks (GNNs), demand precise atomic coordinates for constructing graph representations, while integrating observable attributes remains challenging. This research introduces CatBERTa, an energy prediction Transformer model using textual inputs. Built on a pretrained Transformer encoder, CatBERTa processes human-interpretable text, incorporating target features. Attention score analysis reveals CatBERTa's focus on tokens related to adsorbates, bulk composition, and their interacting atoms. Moreover, interacting atoms emerge as effective descriptors for adsorption configurations, while factors such as bond length and atomic properties of these atoms offer limited predictive contributions. By predicting adsorption energy from the textual representation of initial structures, CatBERTa achieves a mean absolute error (MAE) of 0.75 eV-comparable to vanilla Graph Neural Networks (GNNs). Furthermore, the subtraction of the CatBERTa-predicted energies effectively cancels out their systematic errors by as much as 19.3% for chemically similar systems, surpassing the error reduction observed in GNNs. This outcome highlights its potential to enhance the accuracy of energy difference predictions. This research establishes a fundamental framework for text-based catalyst property prediction, without relying on graph representations, while also unveiling intricate feature-property relationships.
Geometry-Informed Neural Operator for Large-Scale 3D PDEs
Authors: Zongyi Li, Nikola Borislavov Kovachki, Chris Choy, Boyi Li, Jean Kossaifi, Shourya Prakash Otta, Mohammad Amin Nabian, Maximilian Stadler, Christian Hundt, Kamyar Azizzadenesheli, Anima Anandkumar
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2309.00583
Pdf link: https://arxiv.org/pdf/2309.00583
Abstract We propose the geometry-informed neural operator (GINO), a highly efficient approach to learning the solution operator of large-scale partial differential equations with varying geometries. GINO uses a signed distance function and point-cloud representations of the input shape and neural operators based on graph and Fourier architectures to learn the solution operator. The graph neural operator handles irregular grids and transforms them into and from regular latent grids on which Fourier neural operator can be efficiently applied. GINO is discretization-convergent, meaning the trained model can be applied to arbitrary discretization of the continuous domain and it converges to the continuum operator as the discretization is refined. To empirically validate the performance of our method on large-scale simulation, we generate the industry-standard aerodynamics dataset of 3D vehicle geometries with Reynolds numbers as high as five million. For this large-scale 3D fluid simulation, numerical methods are expensive to compute surface pressure. We successfully trained GINO to predict the pressure on car surfaces using only five hundred data points. The cost-accuracy experiments show a $26,000 \times$ speed-up compared to optimized GPU-based computational fluid dynamics (CFD) simulators on computing the drag coefficient. When tested on new combinations of geometries and boundary conditions (inlet velocities), GINO obtains a one-fourth reduction in error rate compared to deep neural network approaches.
Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion
Authors: Zaynab Zahra, Zihao Li, Rosa Filgueira
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00584
Pdf link: https://arxiv.org/pdf/2309.00584
Abstract This paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large lenguage models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.
Discrete Morphological Neural Networks
Authors: Diego Marcondes, Junior Barrera
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00588
Pdf link: https://arxiv.org/pdf/2309.00588
Abstract A classical approach to designing binary image operators is Mathematical Morphology (MM). We propose the Discrete Morphological Neural Networks (DMNN) for binary image analysis to represent W-operators and estimate them via machine learning. A DMNN architecture, which is represented by a Morphological Computational Graph, is designed as in the classical heuristic design of morphological operators, in which the designer should combine a set of MM operators and Boolean operations based on prior information and theoretical knowledge. Then, once the architecture is fixed, instead of adjusting its parameters (i.e., structural elements or maximal intervals) by hand, we propose a lattice gradient descent algorithm (LGDA) to train these parameters based on a sample of input and output images under the usual machine learning approach. We also propose a stochastic version of the LGDA that is more efficient, is scalable and can obtain small error in practical problems. The class represented by a DMNN can be quite general or specialized according to expected properties of the target operator, i.e., prior information, and the semantic expressed by algebraic properties of classes of operators is a differential relative to other methods. The main contribution of this paper is the merger of the two main paradigms for designing morphological operators: classical heuristic design and automatic design via machine learning. Thus, conciliating classical heuristic morphological operator design with machine learning. We apply the DMNN to recognize the boundary of digits with noise, and we discuss many topics for future research.
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Authors: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2309.00615
Pdf link: https://arxiv.org/pdf/2309.00615
Abstract We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
Keyword: faster

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis
Authors: Linsen Song, Wayne Wu, Chaoyou Fu, Chen Change Loy, Ran He
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00030
Pdf link: https://arxiv.org/pdf/2309.00030
Abstract Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.
Few-shot Diagnosis of Chest x-rays Using an Ensemble of Random Discriminative Subspaces
Authors: Kshitiz, Garvit Garg, Angshuman Paul
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00081
Pdf link: https://arxiv.org/pdf/2309.00081
Abstract Due to the scarcity of annotated data in the medical domain, few-shot learning may be useful for medical image analysis tasks. We design a few-shot learning method using an ensemble of random subspaces for the diagnosis of chest x-rays (CXRs). Our design is computationally efficient and almost 1.8 times faster than method that uses the popular truncated singular value decomposition (t-SVD) for subspace decomposition. The proposed method is trained by minimizing a novel loss function that helps create well-separated clusters of training data in discriminative subspaces. As a result, minimizing the loss maximizes the distance between the subspaces, making them discriminative and assisting in better classification. Experiments on large-scale publicly available CXR datasets yield promising results. Code for the project will be available at https://github.com/Few-shot-Learning-on-chest-x-ray/fsl_subspace.
A matching pursuit approach to the geophysical inverse problem of seismic travel time tomography under the ray theory approximation
Authors: Naomi Schneider, Volker Michel, Karin Sigloch, Eoghan J. Totten
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2309.00085
Pdf link: https://arxiv.org/pdf/2309.00085
Abstract Seismic travel time tomography is a geophysical imaging method to infer the 3-D interior structure of the solid Earth. Most commonly formulated as a linear(ized) inverse problem, it maps differences between observed and expected wave travel times to interior regions where waves propagate faster or slower than the expected average. The Earth's interior is typically parametrized by a single kind of localized basis function. Here we present an alternative approach that uses matching pursuits on large dictionaries of basis functions. Within the past decade the (Learning) Inverse Problem Matching Pursuits ((L)IPMPs) have been developed. They combine global and local trial functions. An approximation is built in a so-called best basis, chosen iteratively from an intentionally overcomplete set or dictionary. In each iteration, the choice for the next best basis element reduces the Tikhonov-Phillips functional. This is in contrast to classical methods that use either global or local basis functions. The LIPMPs have proven its applicability in inverse problems like the downward continuation of the gravitational potential as well as the MEG-/EEG-problem from medical imaging. Here, we remodel the Learning Regularized Functional Matching Pursuit (LRFMP), which is one of the LIPMPs, for travel time tomography in a ray theoretical setting. In particular, we introduce the operator, some possible trial functions and the regularization. We show a numerical proof of concept for artificial travel time delays obtained from a contrived model for velocity differences. The corresponding code is available at https://doi.org/10.5281/zenodo.8227888 under the licence CC-BY-NC-SA 3.0 DE.
Charliecloud's layer-free, Git-based container build cache
Authors: Reid Priedhorsky (1), Jordan Ogas (1), Claude H. (Rusty)Davis IV (1), Z. Noah Hounshel (1 and 2), Ashlyn Lee (1 and 3), Benjamin Stormer (1 and 4), R. Shane Goff (1) ((1) Los Alamos National Laboratory, (2) University of North Carolina Wilmington, (3) Colorado State University, (4) University of Texas at Austin)
Subjects: Software Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2309.00166
Pdf link: https://arxiv.org/pdf/2309.00166
Abstract A popular approach to deploying scientific applications in high performance computing (HPC) is Linux containers, which package an application and all its dependencies as a single unit. This image is built by interpreting instructions in a machine-readable recipe, which is faster with a build cache that stores instruction results for re-use. The standard approach (used e.g. by Docker and Podman) is a many-layered union filesystem, encoding differences between layers as tar archives. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.
Data-Driven Projection for Reducing Dimensionality of Linear Programs: Generalization Bound and Learning Methods
Authors: Shinsaku Sakaue, Taihei Oki
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00203
Pdf link: https://arxiv.org/pdf/2309.00203
Abstract This paper studies a simple data-driven approach to high-dimensional linear programs (LPs). Given data of past $n$-dimensional LPs, we learn an $n\times k$ \textit{projection matrix} ($n > k$), which reduces the dimensionality from $n$ to $k$. Then, we address future LP instances by solving $k$-dimensional LPs and recovering $n$-dimensional solutions by multiplying the projection matrix. This idea is compatible with any user-preferred LP solvers, hence a versatile approach to faster LP solving. One natural question is: how much data is sufficient to ensure the recovered solutions' quality? We address this question based on the idea of \textit{data-driven algorithm design}, which relates the amount of data sufficient for generalization guarantees to the \textit{pseudo-dimension} of performance metrics. We present an $\tilde{\mathrm{O}}(nk^2)$ upper bound on the pseudo-dimension ($\tilde{\mathrm{O}}$ compresses logarithmic factors) and complement it by an $\Omega(nk)$ lower bound, hence tight up to an $\tilde{\mathrm{O}}(k)$ factor. On the practical side, we study two natural methods for learning projection matrices: PCA- and gradient-based methods. While the former is simple and efficient, the latter sometimes leads to better solution quality. Experiments confirm that learned projection matrices are beneficial for reducing the time for solving LPs while maintaining high solution quality.
Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU
Authors: Nick Brown, Maurice Jamieson, Joseph Lee
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2309.00381
Pdf link: https://arxiv.org/pdf/2309.00381
Abstract The Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V. In this paper we undertaking a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.
Keyword: mobile

The Role of User-Agent Interactions on Mobile Money Practices in Kenya and Tanzania
Authors: Karen Sowon (1), Edith Luhanga (2), Lorrie Faith Cranor (1), Giulia Fanti (1), Conrad Tucker (1), Assane Gueye (2) ((1) Carnegie Mellon University, (2) Carnegie Mellon University-Africa)
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2309.00226
Pdf link: https://arxiv.org/pdf/2309.00226
Abstract Digital financial services have catalyzed financial inclusion in Africa. Commonly implemented as a mobile wallet service referred to as mobile money (MoMo), the technology provides enormous benefits to its users, some of whom have long been unbanked. While the benefits of mobile money services have largely been documented, the challenges that arise -- especially in the interactions between human stakeholders -- remain relatively unexplored. In this study, we investigate the practices of mobile money users in their interactions with mobile money agents. We conduct 72 structured interviews in Kenya and Tanzania (n=36 per country). The results show that users and agents design workarounds in response to limitations and challenges that users face within the ecosystem. These include advances or loans from agents, relying on the user-agent relationships in place of legal identification requirements, and altering the intended transaction execution to improve convenience. Overall, the workarounds modify one or more of what we see as the core components of mobile money: the user, the agent, and the transaction itself. The workarounds pose new risks and challenges for users and the overall ecosystem. The results suggest a need for rethinking privacy and security of various components of the ecosystem, as well as policy and regulatory controls to safeguard interactions while ensuring the usability of mobile money.
Spiking based Cellular Learning Automata (SCLA) algorithm for mobile robot motion formulation
Authors: Vahid Pashaei Rad, Vahid Azimi Rad, Saleh Valizadeh Sotubadi
Subjects: Robotics (cs.RO); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2309.00241
Pdf link: https://arxiv.org/pdf/2309.00241
Abstract In this paper a new method called SCLA which stands for Spiking based Cellular Learning Automata is proposed for a mobile robot to get to the target from any random initial point. The proposed method is a result of the integration of both cellular automata and spiking neural networks. The environment consists of multiple squares of the same size and the robot only observes the neighboring squares of its current square. It should be stated that the robot only moves either up and down or right and left. The environment returns feedback to the learning automata to optimize its decision making in the next steps resulting in cellular automata training. Simultaneously a spiking neural network is trained to implement long term improvements and reductions on the paths. The results show that the integration of both cellular automata and spiking neural network ends up in reinforcing the proper paths and training time reduction at the same time.
Beyond Screens: Supporting Co-located Augmented Reality Experiences with Smart Home Devices
Authors: Ava Robinson, Yu Jiang Tham, Rajan Vaish, Andrés Monroy-Hernández
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2309.00256
Pdf link: https://arxiv.org/pdf/2309.00256
Abstract We introduce Spooky Spirits, an AR game that makes novel use of everyday smart home devices to support co-located play. Recent exploration of co-located AR experiences consists mainly of digital visual augmentations on mobile or head-mounted screens. In this work, we leverage widely adopted smart lightbulbs to expand AR capabilities beyond the digital and into the physical world, further leveraging the physicality of users' shared environment.
Learning State-Space Models for Mapping Spatial Motion Patterns
Authors: Junyi Shi, Tomasz Piotr Kucner
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2309.00333
Pdf link: https://arxiv.org/pdf/2309.00333
Abstract Mapping the surrounding environment is essential for the successful operation of autonomous robots. While extensive research has focused on mapping geometric structures and static objects, the environment is also influenced by the movement of dynamic objects. Incorporating information about spatial motion patterns can allow mobile robots to navigate and operate successfully in populated areas. In this paper, we propose a deep state-space model that learns the map representations of spatial motion patterns and how they change over time at a certain place. To evaluate our methods, we use two different datasets: one generated dataset with specific motion patterns and another with real-world pedestrian data. We test the performance of our model by evaluating its learning ability, mapping quality, and application to downstream tasks. The results demonstrate that our model can effectively learn the corresponding motion pattern, and has the potential to be applied to robotic application tasks.
An Improved Encoder-Decoder Framework for Food EnergyEstimation
Authors: Jack Ma, Jiangpeng He, Fengqing Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00468
Pdf link: https://arxiv.org/pdf/2309.00468
Abstract Dietary assessment is essential to maintaining a healthy lifestyle. Automatic image-based dietary assessment is a growing field of research due to the increasing prevalence of image capturing devices (e.g. mobile phones). In this work, we estimate food energy from a single monocular image, a difficult task due to the limited hard-to-extract amount of energy information present in an image. To do so, we employ an improved encoder-decoder framework for energy estimation; the encoder transforms the image into a representation embedded with food energy information in an easier-to-extract format, which the decoder then extracts the energy information from. To implement our method, we compile a high-quality food image dataset verified by registered dietitians containing eating scene images, food-item segmentation masks, and ground truth calorie values. Our method improves upon previous caloric estimation methods by over 10\% and 30 kCal in terms of MAPE and MAE respectively.
Keyword: pruning

There is no result

Keyword: diffusion

BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models
Authors: Yao Wei, George Vosselman, Michael Ying Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00158
Pdf link: https://arxiv.org/pdf/2309.00158
Abstract 3D building generation with low data acquisition costs, such as single image-to-3D, becomes increasingly important. However, most of the existing single image-to-3D building creation works are restricted to those images with specific viewing angles, hence they are difficult to scale to general-view images that commonly appear in practical cases. To fill this gap, we propose a novel 3D building shape generation method exploiting point cloud diffusion models with image conditioning schemes, which demonstrates flexibility to the input images. By cooperating two conditional diffusion models and introducing a regularization strategy during denoising process, our method is able to synthesize building roofs while maintaining the overall structures. We validate our framework on two newly built datasets and extensive experiments show that our method outperforms previous works in terms of building generation quality.
Diffusion Model with Clustering-based Conditioning for Food Image Generation
Authors: Yue Han, Jiangpeng He, Mridul Gupta, Edward J. Delp, Fengqing Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00199
Pdf link: https://arxiv.org/pdf/2309.00199
Abstract Image-based dietary assessment serves as an efficient and accurate solution for recording and analyzing nutrition intake using eating occasion images as input. Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation, which rely on large amounts of food images with annotations for training. However, such data dependency poses significant barriers to real-world applications, because acquiring a substantial, diverse, and balanced set of food images can be challenging. One potential solution is to use synthetic food images for data augmentation. Although existing work has explored the use of generative adversarial networks (GAN) based structures for generation, the quality of synthetic food images still remains subpar. In addition, while diffusion-based generative models have shown promising results for general image generation tasks, the generation of food images can be challenging due to the substantial intra-class variance. In this paper, we investigate the generation of synthetic food images based on the conditional diffusion model and propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images. The proposed method is evaluated on the Food-101 dataset and shows improved performance when compared with existing image generation works. We also demonstrate that the synthetic food images generated by ClusDiff can help address the severe class imbalance issue in long-tailed food classification using the VFN-LT dataset.
DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models
Authors: Michael Shenoda, Edward Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.00248
Pdf link: https://arxiv.org/pdf/2309.00248
Abstract Generating high-quality labeled image datasets is crucial for training accurate and robust machine learning models in the field of computer vision. However, the process of manually labeling real images is often time-consuming and costly. To address these challenges associated with dataset generation, we introduce "DiffuGen," a simple and adaptable approach that harnesses the power of stable diffusion models to create labeled image datasets efficiently. By leveraging stable diffusion models, our approach not only ensures the quality of generated datasets but also provides a versatile solution for label generation. In this paper, we present the methodology behind DiffuGen, which combines the capabilities of diffusion models with two distinct labeling techniques: unsupervised and supervised. Distinctively, DiffuGen employs prompt templating for adaptable image generation and textual inversion to enhance diffusion model capabilities.
Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution
Authors: Charles Laroche, Andrés Almansa, Eva Coupete
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00287
Pdf link: https://arxiv.org/pdf/2309.00287
Abstract Using diffusion models to solve inverse problems is a growing field of research. Current methods assume the degradation to be known and provide impressive results in terms of restoration quality and diversity. In this work, we leverage the efficiency of those models to jointly estimate the restored image and unknown parameters of the degradation model. In particular, we designed an algorithm based on the well-known Expectation-Minimization (EM) estimation method and diffusion models. Our method alternates between approximating the expected log-likelihood of the inverse problem using samples drawn from a diffusion model and a maximization step to estimate unknown model parameters. For the maximization step, we also introduce a novel blur kernel regularization based on a Plug \& Play denoiser. Diffusion models are long to run, thus we provide a fast version of our algorithm. Extensive experiments on blind image deblurring demonstrate the effectiveness of our method when compared to other state-of-the-art approaches.
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2309.00398
Pdf link: https://arxiv.org/pdf/2309.00398
Abstract In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.
Iterative Multi-granular Image Editing using Diffusion Models
Authors: K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2309.00613
Pdf link: https://arxiv.org/pdf/2309.00613
Abstract Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting.
Keyword: adaptive

An Energy-Aware Approach to Design Self-Adaptive AI-based Applications on the Edge
Authors: Alessandro Tundo, Marco Mobilio, Shashikant Ilager, Ivona Brandić, Ezio Bartocci, Leonardo Mariani
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.00022
Pdf link: https://arxiv.org/pdf/2309.00022
Abstract The advent of edge devices dedicated to machine learning tasks enabled the execution of AI-based applications that efficiently process and classify the data acquired by the resource-constrained devices populating the Internet of Things. The proliferation of such applications (e.g., critical monitoring in smart cities) demands new strategies to make these systems also sustainable from an energetic point of view. In this paper, we present an energy-aware approach for the design and deployment of self-adaptive AI-based applications that can balance application objectives (e.g., accuracy in object detection and frames processing rate) with energy consumption. We address the problem of determining the set of configurations that can be used to self-adapt the system with a meta-heuristic search procedure that only needs a small number of empirical samples. The final set of configurations are selected using weighted gray relational analysis, and mapped to the operation modes of the self-adaptive application. We validate our approach on an AI-based application for pedestrian detection. Results show that our self-adaptive application can outperform non-adaptive baseline configurations by saving up to 81\% of energy while loosing only between 2% and 6% in accuracy.
Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection
Authors: Reza Azad, Amirhossein Kazerouni, Babak Azad, Ehsan Khodapanah Aghdam, Yury Velichko, Ulas Bagci, Dorit Merhof
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00108
Pdf link: https://arxiv.org/pdf/2309.00108
Abstract Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks. However, compared to the Convolutional Neural Network (CNN) models, it has been observed that the ViT models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information. As abnormalities in human tissue, such as tumors and lesions, may greatly vary in structure, texture, and shape, high-frequency information such as texture is crucial for effective semantic segmentation tasks. To address this limitation in ViT models, we propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. More specifically, our proposed method utilizes a dual attention mechanism via efficient attention and frequency attention while the efficient attention mechanism reduces the complexity of self-attention to linear while producing the same output, selectively intensifying the contribution of shape and texture features. Furthermore, we introduce a novel efficient enhancement multi-scale bridge that effectively transfers spatial information from the encoder to the decoder while preserving the fundamental features. We demonstrate the efficacy of Laplacian-former on multi-organ and skin lesion segmentation tasks with +1.87\% and +0.76\% dice scores compared to SOTA approaches, respectively. Our implementation is publically available at https://github.com/mindflow-institue/Laplacian-Former
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality
Authors: Xingyu Fu, Mingze Xi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2309.00174
Pdf link: https://arxiv.org/pdf/2309.00174
Abstract Frustrating text entry interface has been a major obstacle in participating in social activities in augmented reality (AR). Popular options, such as mid-air keyboard interface, wireless keyboards or voice input, either suffer from poor ergonomic design, limited accuracy, or are simply embarrassing to use in public. This paper proposes and validates a deep-learning based approach, that enables AR applications to accurately predict keystrokes from the user perspective RGB video stream that can be captured by any AR headset. This enables a user to perform typing activities on any flat surface and eliminates the need of a physical or virtual keyboard. A two-stage model, combing an off-the-shelf hand landmark extractor and a novel adaptive Convolutional Recurrent Neural Network (C-RNN), was trained using our newly built dataset. The final model was capable of adaptive processing user-perspective video streams at ~32 FPS. This base model achieved an overall accuracy of $91.05\%$ when typing 40 Words per Minute (wpm), which is how fast an average person types with two hands on a physical keyboard. The Normalised Levenshtein Distance also further confirmed the real-world applicability of that our approach. The promising results highlight the viability of our approach and the potential for our method to be integrated into various applications. We also discussed the limitations and future research required to bring such technique into a production system.
Vision-aided nonlinear control framework for shake table tests
Authors: Zhongwei Chen, T.Y. Yang, Yifei Xiao, Xiao Pan, Wanyan Yang
Subjects: Systems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2309.00187
Pdf link: https://arxiv.org/pdf/2309.00187
Abstract The structural response under the earthquake excitations can be simulated by scaled-down model shake table tests or full-scale model shake table tests. In this paper, adaptive control theory is used as a nonlinear shake table control algorithm which considers the inherent nonlinearity of the shake table system and the Control-Structural Interaction (CSI) effect that the linear controller cannot consider, such as the Proportional-Integral-Derivative (PID) controller. The mass of the specimen can be assumed as an unknown variation and the unknown parameter will be replaced by an estimated value in the proposed control framework. The signal generated by the control law of the adaptive control method will be implemented by a loop-shaping controller. To verify the stability and feasibility of the proposed control framework, a simulation of a bare shake table and experiments with a bare shake table with a two-story frame were carried out. This study randomly selects Earthquake recordings from the Pacific Earthquake Engineering Research Center (PEER) database. The simulation and experimental results show that the proposed control framework can be effectively used in shake table control.
Data-Driven Safety Filter: An Input-Output Perspective
Authors: Mohammad Bajelani, Klaske van Heusden
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2309.00189
Pdf link: https://arxiv.org/pdf/2309.00189
Abstract Implementation of learning-based control remains challenging due to the absence of safety guarantees. Safe control methods have turned to model-based safety filters to address these challenges, but this is paradoxical when the ultimate goal is a model-free, data-driven control solution. Addressing the core question of "Can we ensure the safety of any learning-based algorithm without explicit prediction models and state estimation?" this paper proposes a Data-Driven Safety Filter (DDSF) grounded in Behavioral System Theory (BST). The proposed method needs only a single system trajectory available in an offline dataset to modify unsafe learning inputs to safe inputs. This contribution addresses safe control in the input-output framework and therefore does not require full state measurements or explicit state estimation. Since no explicit model is required, the proposed safe control solution is not affected by unmodeled dynamics and unstructured uncertainty and can provide a safe solution for systems with unknown time delays. The effectiveness of the proposed DDSF is illustrated in simulation for a high-order six-degree-of-freedom aerial robot and a time-delay adaptive cruise control system.
Identifiable Cognitive Diagnosis with Encoder-decoder for Modelling Students' Performance
Authors: Jiatong Li, Qi Liu, Fei Wang, Jiayu Liu, Zhenya Huang, Enhong Chen
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2309.00300
Pdf link: https://arxiv.org/pdf/2309.00300
Abstract Cognitive diagnosis aims to diagnose students' knowledge proficiencies based on their response scores on exam questions, which is the basis of many domains such as computerized adaptive testing. Existing cognitive diagnosis models (CDMs) follow a proficiency-response paradigm, which views diagnostic results as learnable embeddings that are the cause of students' responses and learns the diagnostic results through optimization. However, such a paradigm can easily lead to unidentifiable diagnostic results and the explainability overfitting problem, which is harmful to the quantification of students' learning performance. To address these problems, we propose a novel identifiable cognitive diagnosis framework. Specifically, we first propose a flexible diagnostic module which directly diagnose identifiable and explainable examinee traits and question features from response logs. Next, we leverage a general predictive module to reconstruct response logs from the diagnostic results to ensure the preciseness of the latter. We furthermore propose an implementation of the framework, i.e., ID-CDM, to demonstrate the availability of the former. Finally, we demonstrate the identifiability, explainability and preciseness of diagnostic results of ID-CDM through experiments on four public real-world datasets.
Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels
Authors: Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gündüz
Subjects: Information Theory (cs.IT); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.00470
Pdf link: https://arxiv.org/pdf/2309.00470
Abstract This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-based benchmarks with robustness to channel estimation errors and showcases remarkable flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining. Specifically, by harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently learns feature mapping and power allocation strategies tailored to the unique characteristics of the source image and prevailing channel conditions. Extensive numerical experiments validate the significant improvements in transmission quality achieved by DeepJSCC-MIMO for both open-loop and closed-loop MIMO systems across a wide range of scenarios. Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it an appealing solution for emerging semantic communication systems.
A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm
Authors: Fanyi Wang, Pin Cao, Yihui Zhang, Haotian Hu, Yongying Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2309.00514
Pdf link: https://arxiv.org/pdf/2309.00514
Abstract In the procedure of surface defects detection for large-aperture aspherical optical elements, it is of vital significance to adjust the optical axis of the element to be coaxial with the mechanical spin axis accurately. Therefore, a machine vision method for eccentric error correction is proposed in this paper. Focusing on the severe defocus blur of reference crosshair image caused by the imaging characteristic of the aspherical optical element, which may lead to the failure of correction, an Adaptive Enhancement Algorithm (AEA) is proposed to strengthen the crosshair image. AEA is consisted of existed Guided Filter Dark Channel Dehazing Algorithm (GFA) and proposed lightweight Multi-scale Densely Connected Network (MDC-Net). The enhancement effect of GFA is excellent but time-consuming, and the enhancement effect of MDC-Net is slightly inferior but strongly real-time. As AEA will be executed dozens of times during each correction procedure, its real-time performance is very important. Therefore, by setting the empirical threshold of definition evaluation function SMD2, GFA and MDC-Net are respectively applied to highly and slightly blurred crosshair images so as to ensure the enhancement effect while saving as much time as possible. AEA has certain robustness in time-consuming performance, which takes an average time of 0.2721s and 0.0963s to execute GFA and MDC-Net separately on ten 200pixels 200pixels Region of Interest (ROI) images with different degrees of blur. And the eccentricity error can be reduced to within 10um by our method.
Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms
Authors: Qining Zhang, Lei Ying
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2309.00591
Pdf link: https://arxiv.org/pdf/2309.00591
Abstract This paper considers a stochastic multi-armed bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of $T$ consecutive rounds. Though each objective has been individually well-studied, i.e., best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present the $\mathsf{EOCP}$ algorithm and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to sample complexity) of ROBAI, showing that $\mathsf{EOCP}$ and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting ``over-exploration'' phenomenon carried by classic $\mathsf{UCB}$ algorithms, such that $\mathsf{EOCP}$ has smaller regret even though it stops exploration much earlier than $\mathsf{UCB}$ ($\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$), which suggests over-exploration is unnecessary and potentially harmful to system performance.
Keyword: quantization

There is no result

A-suozhang / GetArxivDaily

New submissions for Mon, 4 Sep 23 #146

Keyword: efficient

When Measures are Unreliable: Imperceptible Adversarial Perturbations toward Top-$k$ Multi-Label Learning

Improving NeRF Quality by Progressive Camera Placement for Unrestricted Navigation in Complex Environments

Physics-Based Trajectory Design for Cellular-Connected UAV in Rainy Environments Based on Deep Reinforcement Learning

An Energy-Aware Approach to Design Self-Adaptive AI-based Applications on the Edge

Continual Learning From a Stream of APIs

Efficient Multi-View Graph Clustering with Local and Global Structure Preservation

YaRN: Efficient Context Window Extension of Large Language Models

Few-shot Diagnosis of Chest x-rays Using an Ensemble of Random Discriminative Subspaces

Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection

Inverse designing surface curvatures by deep learning

Diffusion Model with Clustering-based Conditioning for Food Image Generation

Data-Driven Projection for Reducing Dimensionality of Linear Programs: Generalization Bound and Learning Methods

Gap and Overlap Detection in Automated Fiber Placement

Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialog Policy Learning

DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models

Optimal Repair Strategy Against Advanced Persistent Threats Under Time-Varying Networks

SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks

Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms

On the Aggregation of Rules for Knowledge Graph Completion

Multi-fidelity reduced-order surrogate modeling

Towards a "Swiss Army Knife" for Scalable User-Defined Temporal $(k,\mathcal{X})$-Core Analysis

FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Fine-grained Recognition with Learnable Semantic Data Augmentation

Selective Scene Text Removal

A Locality-based Neural Solver for Optical Motion Capture

Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering

FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models

Geometry-Informed Neural Operator for Large-Scale 3D PDEs

Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion

Discrete Morphological Neural Networks

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Keyword: faster

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Few-shot Diagnosis of Chest x-rays Using an Ensemble of Random Discriminative Subspaces

A matching pursuit approach to the geophysical inverse problem of seismic travel time tomography under the ray theory approximation

Charliecloud's layer-free, Git-based container build cache

Data-Driven Projection for Reducing Dimensionality of Linear Programs: Generalization Bound and Learning Methods

Is RISC-V ready for HPC prime-time: Evaluating the 64-core Sophon SG2042 RISC-V CPU

Keyword: mobile

The Role of User-Agent Interactions on Mobile Money Practices in Kenya and Tanzania

Spiking based Cellular Learning Automata (SCLA) algorithm for mobile robot motion formulation

Beyond Screens: Supporting Co-located Augmented Reality Experiences with Smart Home Devices

Learning State-Space Models for Mapping Spatial Motion Patterns

An Improved Encoder-Decoder Framework for Food EnergyEstimation

Keyword: pruning

Keyword: diffusion

BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models

Diffusion Model with Clustering-based Conditioning for Food Image Generation

DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models

Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Iterative Multi-granular Image Editing using Diffusion Models

Keyword: adaptive

An Energy-Aware Approach to Design Self-Adaptive AI-based Applications on the Edge

Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection

Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality

Vision-aided nonlinear control framework for shake table tests

Data-Driven Safety Filter: An Input-Output Perspective

Identifiable Cognitive Diagnosis with Encoder-decoder for Modelling Students' Performance

Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels

A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms

Keyword: quantization