Abstract
In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.
Keyword: volumetric render
There is no result
Keyword: remote render
There is no result
Keyword: hybrid render
There is no result
Keyword: raycast
There is no result
Keyword: medical imaging
COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images
Abstract
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
Keyword: medical visualization
There is no result
Keyword: interactive volume
There is no result
Keyword: rendering
DeviceRadar: Online IoT Device Fingerprinting in ISPs using Programmable Switches
Authors: Ruoyu Li, Qing Li, Tao Lin, Qingsong Zou, Dan Zhao, Yucheng Huang, Gareth Tyson, Guorui Xie, Yong Jiang
Subjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Abstract
Device fingerprinting can be used by Internet Service Providers (ISPs) to identify vulnerable IoT devices for early prevention of threats. However, due to the wide deployment of middleboxes in ISP networks, some important data, e.g., 5-tuples and flow statistics, are often obscured, rendering many existing approaches invalid. It is further challenged by the high-speed traffic of hundreds of terabytes per day in ISP networks. This paper proposes DeviceRadar, an online IoT device fingerprinting framework that achieves accurate, real-time processing in ISPs using programmable switches. We innovatively exploit "key packets" as a basis of fingerprints only using packet sizes and directions, which appear periodically while exhibiting differences across different IoT devices. To utilize them, we propose a packet size embedding model to discover the spatial relationships between packets. Meanwhile, we design an algorithm to extract the "key packets" of each device, and propose an approach that jointly considers the spatial relationships and the key packets to produce a neighboring key packet distribution, which can serve as a feature vector for machine learning models for inference. Last, we design a model transformation method and a feature extraction process to deploy the model on a programmable data plane within its constrained arithmetic operations and memory to achieve line-speed processing. Our experiments show that DeviceRadar can achieve state-of-the-art accuracy across 77 IoT devices with 40 Gbps throughput, and requires only 1.3% of the processing time compared to GPU-accelerated approaches.
EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation
Authors: Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-resolution, large-scale scenes. We analyze the densification process in 3DGS and identify areas of Gaussian over-proliferation. We propose a selective strategy, limiting Gaussian increase to key primitives, thereby enhancing the representational efficiency. Additionally, we develop a pruning mechanism to remove redundant Gaussians, those that are merely auxiliary to adjacent ones. For further enhancement, we integrate a sparse order increment for Spherical Harmonics (SH), designed to alleviate storage constraints and reduce training overhead. Our empirical evaluations, conducted on a range of datasets including extensive 4K+ aerial images, demonstrate that 'EfficientGS' not only expedites training and rendering times but also achieves this with a model size approximately tenfold smaller than conventional 3DGS while maintaining high rendering fidelity.
Unveiling the Ambiguity in Neural Inverse Rendering: A Parameter Compensation Analysis
Abstract
Inverse rendering aims to reconstruct the scene properties of objects solely from multiview images. However, it is an ill-posed problem prone to producing ambiguous estimations deviating from physically accurate representations. In this paper, we utilize Neural Microfacet Fields (NMF), a state-of-the-art neural inverse rendering method to illustrate the inherent ambiguity. We propose an evaluation framework to assess the degree of compensation or interaction between the estimated scene properties, aiming to explore the mechanisms behind this ill-posed problem and potential mitigation strategies. Specifically, we introduce artificial perturbations to one scene property and examine how adjusting another property can compensate for these perturbations. To facilitate such experiments, we introduce a disentangled NMF where material properties are independent. The experimental findings underscore the intrinsic ambiguity present in neural inverse rendering and highlight the importance of providing additional guidance through geometry, material, and illumination priors.
3D Multi-frame Fusion for Video Stabilization
Authors: Zhan Peng, Xinyi Ye, Weiyue Zhao, Tianqi Liu, Huiqiang Sun, Baopu Li, Zhiguo Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.
Keyword: cinematic rendering
There is no result
Keyword: volume data
There is no result
Keyword: remote visualization
There is no result
Keyword: direct volume rendering
There is no result
Keyword: mobile device
ESPM-D: Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Authors: Jieyu Zheng, Hong Zhang, Le Tian, Zhuo Zhang, Hanyu Wei, Zhiwei Chu, Yafang Yang, Yunlei Zhao
Abstract
Dilithium is a lattice-based digital signature scheme standardized by the NIST post-quantum cryptography (PQC) project. In this study, we focus on developing efficient sparse polynomial multiplication implementations of Dilithium for ARM Cortex-M4 and Apple M2, which are both based on the ARM architecture. The ARM Cortex-M4 is commonly utilized in resource-constrained devices such as sensors. Conversely, the Apple M2 is typically found on mobile devices, emphasizing high performance and versatility. Accordingly, our optimization strategies differ between ARM Cortex-M4 and Apple M2. We prioritize optimizing stack usage for the former while enhancing computational efficiency for the latter. Our optimized sparse polynomial multiplication achieves significant speedups of up to 30% on ARM Cortex-M4 and 55% on Apple M2 compared to the state-of-the-art Number-Theoretic Transform (NTT) implementation. Additionally, we integrate the sparse polynomial multiplication with the infinity norm judgments in the Dilithium signing process, further enhancing signing efficiency. Our optimized implementation not only reduces stack usage by 10.8%, 1.2%, and 7.7% in the signing procedure of Dilithium2, Dilithium3, and Dilithium5, respectively, but also enhances signing performance by 0.4% to 0.8% compared to the state-of-the-art ARM Cortex-M4 implementation. Furthermore, we optimize polynomial sampling, rounding functions, and polynomial packing and unpacking using ARM Cortex-M4 DSP instructions, resulting in a 0.4%-3.2% improvement in key generation and verification procedures. On the MacBook Air 2022, our Dilithium implementation achieves 10% to 11% speedups in the signing procedure. To the best of our knowledge, our work sets new performance records for Dilithium on both ARM Cortex-M4 and Apple M2 platforms.
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
Authors: Julian Geus, Jenny Ottmann, Felix Freiling
Abstract
Due to the increasing security standards of modern smartphones, forensic data acquisition from such devices is a growing challenge. One rather generic way to access data on smartphones in practice is to use the local backup mechanism offered by the mobile operating systems. We study the suitability of such mechanisms for forensic data acquisition by performing a thorough evaluation of iOS's and Android's local backup mechanisms on two mobile devices. Based on a systematic and generic evaluation procedure comparing the contents of local backup to the original storage, we show that in our exemplary practical evaluations, in most cases (but not all) local backup actually yields a correct copy of the original data from storage. Our study also highlights corner cases, such as database files with pending changes, that need to be considered when assessing the integrity and authenticity of evidence acquired through local backup.
Keyword: transfer function
There is no result
Keyword: retrieval
Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds
Authors: Oliver Lemke, Zuria Bauer, René Zurbrügg, Marc Pollefeys, Francis Engelmann, Hermann Blum
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Abstract
Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose Thoth, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design Thoth, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. Thoth proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement Thoth and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that Thoth reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
Dubo-SQL: Diverse Retrieval-Augmented Generation and Fine Tuning for Text-to-SQL
Authors: Dayton G. Thorpe, Andrew J. Duberstein, Ian A. Kinsey
Subjects: Computation and Language (cs.CL); Databases (cs.DB)
Abstract
The current state-of-the-art (SOTA) for automated text-to-SQL still falls well short of expert human performance as measured by execution accuracy (EX) on the BIRD-SQL benchmark. The most accurate methods are also slow and expensive. To advance the SOTA for text-to-SQL while reducing cost and improving speed, we explore the combination of low-cost fine tuning, novel methods for diverse retrieval-augmented generation (RAG) and new input and output formats that help large language models (LLMs) achieve higher EX. We introduce two new methods, Dubo-SQL v1 and v2. Dubo-SQL v1 sets a new record for EX on the holdout test set of BIRD-SQL. Dubo-SQL v2 achieves even higher performance on the BIRD-SQL dev set. Dubo-SQL v1 relies on LLMs from OpenAI, but uses the low-cost GPT-3.5 Turbo while exceeding the performance of the next-best model using OpenAI, which instead uses the more expensive GPT-4. Dubo-SQL v1 exceeds the performance of the next-best model using GPT-3.5 by over 20%. Dubo-SQL v2 uses GPT-4 Turbo and RAG in place of fine tuning to push EX higher.
Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
Abstract
Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Authors: Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Ke Liu, Liang Hu, Duoqian Miao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract
Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich-semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
Towards Human-centered Proactive Conversational Agents
Abstract
Recent research on proactive conversational agents (PCAs) mainly focuses on improving the system's capabilities in anticipating and planning action sequences to accomplish tasks and achieve goals before users articulate their requests. This perspectives paper highlights the importance of moving towards building human-centered PCAs that emphasize human needs and expectations, and that considers ethical and social implications of these agents, rather than solely focusing on technological capabilities. The distinction between a proactive and a reactive system lies in the proactive system's initiative-taking nature. Without thoughtful design, proactive systems risk being perceived as intrusive by human users. We address the issue by establishing a new taxonomy concerning three key dimensions of human-centered PCAs, namely Intelligence, Adaptivity, and Civility. We discuss potential research opportunities and challenges based on this new taxonomy upon the five stages of PCA system construction. This perspectives paper lays a foundation for the emerging area of conversational information retrieval research and paves the way towards advancing human-centered proactive conversational systems.
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Abstract
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
The Solution for the CVPR2024 NICE Image Captioning Challenge
Authors: Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li, Qingguo Chen, Yang Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.
Generating Test Scenarios from NL Requirements using Retrieval-Augmented LLMs: An Industrial Study
Abstract
Test scenarios are specific instances of test cases that describe actions to validate a particular software functionality. By outlining the conditions under which the software operates and the expected outcomes, test scenarios ensure that the software functionality is tested in an integrated manner. Test scenarios are crucial for systematically testing an application under various conditions, including edge cases, to identify potential issues and guarantee overall performance and reliability. Specifying test scenarios is tedious and requires a deep understanding of software functionality and the underlying domain. It further demands substantial effort and investment from already time- and budget-constrained requirements engineers and testing teams. This paper presents an automated approach (RAGTAG) for test scenario generation using Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs). RAG allows the integration of specific domain knowledge with LLMs' generation capabilities. We evaluate RAGTAG on two industrial projects from Austrian Post with bilingual requirements in German and English. Our results from an interview survey conducted with four experts on five dimensions -- relevance, coverage, correctness, coherence and feasibility, affirm the potential of RAGTAG in automating test scenario generation. Specifically, our results indicate that, despite the difficult task of analyzing bilingual requirements, RAGTAG is able to produce scenarios that are well-aligned with the underlying requirements and provide coverage of different aspects of the intended functionality. The generated scenarios are easily understandable to experts and feasible for testing in the project environment. The overall correctness is deemed satisfactory; however, gaps in capturing exact action sequences and domain nuances remain, underscoring the need for domain expertise when applying LLMs.
Coexistence of Push Wireless Access with Pull Communication for Content-based Wake-up Radios
Authors: Junya Shiraishi, Sara Cavallero, Shashi Raj Pandey, Fabio Saggese, Petar Popovski
Subjects: Networking and Internet Architecture (cs.NI)
Abstract
This paper considers energy-efficient connectivity for Internet of Things (IoT) devices in a coexistence scenario between two distinctive communication models: pull- and push-based. In pull-based, the base station (BS) decides when to retrieve a specific type of data from the IoT devices, while in push-based, the IoT device decides when and which data to transmit. To this end, this paper advocates introducing the content-based wake-up (CoWu), which enables the BS to remotely activate only a subset of pull-based nodes equipped with wake-up receivers, observing the relevant data. In this setup, a BS pulls data with CoWu at a specific time instance to fulfill its tasks while collecting data from the nodes operating with a push-based communication model. The resource allocation plays an important role: longer data collection duration for pull-based nodes can lead to high retrieval accuracy while decreasing the probability of data transmission success for push-based nodes, and vice versa. Numerical results show that CoWu can manage communication requirements for both pull-based and push-based nodes while realizing the high energy efficiency (up to 38%) of IoT devices, compared to the baseline scheduling method.
Benchmarking the performance of a self-custody, non-ledger-based, obliviously managed digital payment system
Abstract
As global governments intensify efforts to operationalize retail central bank digital currencies (CBDCs), the imperative for architectures that preserve user privacy has never been more pronounced. This paper advances an existing retail CBDC framework developed at University College London. Utilizing the capabilities of the Comet research framework, our proposed design allows users to retain direct custody of their assets without the need for intermediary service providers, all while preserving transactional anonymity. The study unveils a novel technique to expedite the retrieval of Proof of Provenance, significantly accelerating the verification of transaction legitimacy through the refinement of Merkle Trie structures. In parallel, we introduce a streamlined Digital Ledger designed to offer fast, immutable, and decentralized transaction validation within a permissioned ecosystem. The ultimate objective of this research is to benchmark the performance of the legacy system formulated by the original Comet research team against the newly devised system elucidated in this paper. Our endeavour is to establish a foundational design for a scalable national infrastructure proficient in seamlessly processing thousands of transactions in real-time, without compromising consumer privacy or data integrity.
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
Authors: Yang Luo, Zangwei Zheng, Zirui Zhu, Yang You
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract
The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.
Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
Abstract
While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic forms of queries, neglecting the expression of specific domain knowledge perspectives. This paper introduces a novel multi-view RAG framework, MVRAG, tailored for knowledge-dense domains that utilizes intention-aware query rewriting from multiple domain viewpoints to enhance retrieval precision, thereby improving the effectiveness of the final inference. Experiments conducted on legal and medical case retrieval demonstrate significant improvements in recall and precision rates with our framework. Our multi-perspective retrieval approach unleashes the potential of multi-view information enhancing RAG tasks, accelerating the further application of LLMs in knowledge-intensive fields.
Cloud-based Digital Twin for Cognitive Robotics
Authors: Arthur Niedźwiecki, Sascha Jongebloed, Yanxiang Zhan, Michaela Kümpel, Jörn Syrbe, Michael Beetz
Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
The paper presents a novel cloud-based digital twin learning platform for teaching and training concepts of cognitive robotics. Instead of forcing interested learners or students to install a new operating system and bulky, fragile software onto their personal laptops just to solve tutorials or coding assignments of a single lecture on robotics, it would be beneficial to avoid technical setups and directly dive into the content of cognitive robotics. To achieve this, the authors utilize containerization technologies and Kubernetes to deploy and operate containerized applications, including robotics simulation environments and software collections based on the Robot operating System (ROS). The web-based Integrated Development Environment JupyterLab is integrated with RvizWeb and XPRA to provide real-time visualization of sensor data and robot behavior in a user-friendly environment for interacting with robotics software. The paper also discusses the application of the platform in teaching Knowledge Representation, Reasoning, Acquisition and Retrieval, and Task-Executives. The authors conclude that the proposed platform is a valuable tool for education and research in cognitive robotics, and that it has the potential to democratize access to these fields. The platform has already been successfully employed in various academic courses, demonstrating its effectiveness in fostering knowledge and skill development.
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs
Authors: Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Abstract
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.
Keyword: video retrieval
There is no result
Keyword: mobile
Requirements Satisfiability with In-Context Learning
Authors: Sarah Santos, Travis Breaux, Thomas Norton, Sara Haghighi, Sepideh Ghanavati
Abstract
Language models that can learn a task at inference time, called in-context learning (ICL), show increasing promise in natural language inference tasks. In ICL, a model user constructs a prompt to describe a task with a natural language instruction and zero or more examples, called demonstrations. The prompt is then input to the language model to generate a completion. In this paper, we apply ICL to the design and evaluation of satisfaction arguments, which describe how a requirement is satisfied by a system specification and associated domain knowledge. The approach builds on three prompt design patterns, including augmented generation, prompt tuning, and chain-of-thought prompting, and is evaluated on a privacy problem to check whether a mobile app scenario and associated design description satisfies eight consent requirements from the EU General Data Protection Regulation (GDPR). The overall results show that GPT-4 can be used to verify requirements satisfaction with 96.7% accuracy and dissatisfaction with 93.2% accuracy. Inverting the requirement improves verification of dissatisfaction to 97.2%. Chain-of-thought prompting improves overall GPT-3.5 performance by 9.0% accuracy. We discuss the trade-offs among templates, models and prompt strategies and provide a detailed analysis of the generated specifications to inform how the approach can be applied in practice.
ESPM-D: Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Authors: Jieyu Zheng, Hong Zhang, Le Tian, Zhuo Zhang, Hanyu Wei, Zhiwei Chu, Yafang Yang, Yunlei Zhao
Abstract
Dilithium is a lattice-based digital signature scheme standardized by the NIST post-quantum cryptography (PQC) project. In this study, we focus on developing efficient sparse polynomial multiplication implementations of Dilithium for ARM Cortex-M4 and Apple M2, which are both based on the ARM architecture. The ARM Cortex-M4 is commonly utilized in resource-constrained devices such as sensors. Conversely, the Apple M2 is typically found on mobile devices, emphasizing high performance and versatility. Accordingly, our optimization strategies differ between ARM Cortex-M4 and Apple M2. We prioritize optimizing stack usage for the former while enhancing computational efficiency for the latter. Our optimized sparse polynomial multiplication achieves significant speedups of up to 30% on ARM Cortex-M4 and 55% on Apple M2 compared to the state-of-the-art Number-Theoretic Transform (NTT) implementation. Additionally, we integrate the sparse polynomial multiplication with the infinity norm judgments in the Dilithium signing process, further enhancing signing efficiency. Our optimized implementation not only reduces stack usage by 10.8%, 1.2%, and 7.7% in the signing procedure of Dilithium2, Dilithium3, and Dilithium5, respectively, but also enhances signing performance by 0.4% to 0.8% compared to the state-of-the-art ARM Cortex-M4 implementation. Furthermore, we optimize polynomial sampling, rounding functions, and polynomial packing and unpacking using ARM Cortex-M4 DSP instructions, resulting in a 0.4%-3.2% improvement in key generation and verification procedures. On the MacBook Air 2022, our Dilithium implementation achieves 10% to 11% speedups in the signing procedure. To the best of our knowledge, our work sets new performance records for Dilithium on both ARM Cortex-M4 and Apple M2 platforms.
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
Authors: Julian Geus, Jenny Ottmann, Felix Freiling
Abstract
Due to the increasing security standards of modern smartphones, forensic data acquisition from such devices is a growing challenge. One rather generic way to access data on smartphones in practice is to use the local backup mechanism offered by the mobile operating systems. We study the suitability of such mechanisms for forensic data acquisition by performing a thorough evaluation of iOS's and Android's local backup mechanisms on two mobile devices. Based on a systematic and generic evaluation procedure comparing the contents of local backup to the original storage, we show that in our exemplary practical evaluations, in most cases (but not all) local backup actually yields a correct copy of the original data from storage. Our study also highlights corner cases, such as database files with pending changes, that need to be considered when assessing the integrity and authenticity of evidence acquired through local backup.
A Mobile Additive Manufacturing Robot Framework for Smart Manufacturing Systems
Abstract
Recent technological innovations in the areas of additive manufacturing and collaborative robotics have paved the way toward realizing the concept of on-demand, personalized production on the shop floor. Additive manufacturing process can provide the capability of printing highly customized parts based on various customer requirements. Autonomous, mobile systems provide the flexibility to move custom parts around the shop floor to various manufacturing operations, as needed by product requirements. In this work, we proposed a mobile additive manufacturing robot framework for merging an additive manufacturing process system with an autonomous mobile base. Two case studies showcase the potential benefits of the proposed mobile additive manufacturing framework. The first case study overviews the effect that a mobile system can have on a fused deposition modeling process. The second case study showcases how integrating a mobile additive manufacturing machine can improve the throughput of the manufacturing system. The major findings of this study are that the proposed mobile robotic AM has increased throughput by taking advantage of the travel time between operations/processing sites. It is particularly suited to perform intermittent operations (e.g., preparing feedstock) during the travel time of the robotic AM. One major implication of this study is its application in manufacturing structural components (e.g., concrete construction, and feedstock preparation during reconnaissance missions) in remote or extreme terrains with on-site or on-demand feedstocks.
Keyword: smartphone
RetailOpt: An Opt-In, Easy-to-Deploy Trajectory Estimation System Leveraging Smartphone Motion Data and Retail Facility Information
Authors: Ryo Yonetani, Jun Baba, Yasutaka Furukawa
Abstract
We present RetailOpt, a novel opt-in, easy-to-deploy system for tracking customer movements in indoor retail environments. The system utilizes information presently accessible to customers through smartphones and retail apps: motion data, store map, and purchase records. The approach eliminates the need for additional hardware installations/maintenance and ensures customers maintain full control of their data. Specifically, RetailOpt first employs inertial navigation to recover relative trajectories from smartphone motion data. The store map and purchase records are then cross-referenced to identify a list of visited shelves, providing anchors to localize the relative trajectories in a store through continuous and discrete optimization. We demonstrate the effectiveness of our system through systematic experiments in five diverse environments. The proposed system, if successful, would produce accurate customer movement data, essential for a broad range of retail applications, including customer behavior analysis and in-store navigation. The potential application could also extend to other domains such as entertainment and assistive technologies.
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Authors: Raghavendra Ramachandra, Narayan Vetrekar, Sushma Venkatesh, Savita Nageshker, Jag Mohan Singh, R. S. Gad
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Abstract
Facial biometrics are an essential components of smartphones to ensure reliable and trustworthy authentication. However, face biometric systems are vulnerable to Presentation Attacks (PAs), and the availability of more sophisticated presentation attack instruments such as 3D silicone face masks will allow attackers to deceive face recognition systems easily. In this work, we propose a novel Presentation Attack Detection (PAD) algorithm based on 3D point clouds captured using the frontal camera of a smartphone to detect presentation attacks. The proposed PAD algorithm, VoxAtnNet, processes 3D point clouds to obtain voxelization to preserve the spatial structure. Then, the voxelized 3D samples were trained using the novel convolutional attention network to detect PAs on the smartphone. Extensive experiments were carried out on the newly constructed 3D face point cloud dataset comprising bona fide and two different 3D PAIs (3D silicone face mask and wrap photo mask), resulting in 3480 samples. The performance of the proposed method was compared with existing methods to benchmark the detection performance using three different evaluation protocols. The experimental results demonstrate the improved performance of the proposed method in detecting both known and unknown face presentation attacks.
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
Authors: Julian Geus, Jenny Ottmann, Felix Freiling
Abstract
Due to the increasing security standards of modern smartphones, forensic data acquisition from such devices is a growing challenge. One rather generic way to access data on smartphones in practice is to use the local backup mechanism offered by the mobile operating systems. We study the suitability of such mechanisms for forensic data acquisition by performing a thorough evaluation of iOS's and Android's local backup mechanisms on two mobile devices. Based on a systematic and generic evaluation procedure comparing the contents of local backup to the original storage, we show that in our exemplary practical evaluations, in most cases (but not all) local backup actually yields a correct copy of the original data from storage. Our study also highlights corner cases, such as database files with pending changes, that need to be considered when assessing the integrity and authenticity of evidence acquired through local backup.
Nyon Unchained: Forensic Analysis of Bosch's eBike Board Computers
Authors: Marcel Stachak, Julian Geus, Gaston Pugliese, Felix Freiling
Abstract
Modern eBike on-board computers are basically small PCs that not only offer motor control, navigation, and performance monitoring, but also store lots of sensitive user data. The Bosch Nyon series of board computers are cutting-edge devices from one of the market leaders in the eBike business, which is why they are especially interesting for forensics. Therefore, we conducted an in-depth forensic analysis of the two available Nyon models released in 2014 and 2021. On a first-generation Nyon device, Telnet access could be established by abusing a design flaw in the update procedure, which allowed the acquisition of relevant data without risking damage to the hardware. Besides the user's personal information, the data analysis revealed databases containing user activities, including timestamps and GPS coordinates. Furthermore, it was possible to forge the data on the device and transfer it to Bosch's servers to be persisted across their online service and smartphone app. On a current second-generation Nyon device, no software-based access could be obtained. For this reason, more intrusive hardware-based options were considered, and the data could be extracted via chip-off eventually. Despite encryption, the user data could be accessed and evaluated. Besides location and user information, the newer model holds even more forensically relevant data, such as nearby Bluetooth devices.
Keyword: volume render
3D Multi-frame Fusion for Video Stabilization
Keyword: volumetric render
There is no result
Keyword: remote render
There is no result
Keyword: hybrid render
There is no result
Keyword: raycast
There is no result
Keyword: medical imaging
COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images
Keyword: medical visualization
There is no result
Keyword: interactive volume
There is no result
Keyword: rendering
DeviceRadar: Online IoT Device Fingerprinting in ISPs using Programmable Switches
EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation
Unveiling the Ambiguity in Neural Inverse Rendering: A Parameter Compensation Analysis
3D Multi-frame Fusion for Video Stabilization
Keyword: cinematic rendering
There is no result
Keyword: volume data
There is no result
Keyword: remote visualization
There is no result
Keyword: direct volume rendering
There is no result
Keyword: mobile device
ESPM-D: Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
Keyword: transfer function
There is no result
Keyword: retrieval
Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Dubo-SQL: Diverse Retrieval-Augmented Generation and Fine Tuning for Text-to-SQL
Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Towards Human-centered Proactive Conversational Agents
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
The Solution for the CVPR2024 NICE Image Captioning Challenge
Generating Test Scenarios from NL Requirements using Retrieval-Augmented LLMs: An Industrial Study
Coexistence of Push Wireless Access with Pull Communication for Content-based Wake-up Radios
Benchmarking the performance of a self-custody, non-ledger-based, obliviously managed digital payment system
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
Cloud-based Digital Twin for Cognitive Robotics
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs
Keyword: video retrieval
There is no result
Keyword: mobile
Requirements Satisfiability with In-Context Learning
ESPM-D: Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
A Mobile Additive Manufacturing Robot Framework for Smart Manufacturing Systems
Keyword: smartphone
RetailOpt: An Opt-In, Easy-to-Deploy Trajectory Estimation System Leveraging Smartphone Motion Data and Retail Facility Information
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Systematic Evaluation of Forensic Data Acquisition using Smartphone Local Backup
Nyon Unchained: Forensic Analysis of Bosch's eBike Board Computers
Keyword: medical volume data
There is no result