【CS-part1】New submissions for Wed, 17 Apr 24

Keyword: volume render

Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes

Authors: Zehao Yu, Torsten Sattler, Andreas Geiger
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10772
Pdf link: https://arxiv.org/pdf/2404.10772
Abstract Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results, while allowing the rendering of high-resolution images in real-time. However, leveraging 3D Gaussians for surface reconstruction poses significant challenges due to the explicit and disconnected nature of 3D Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of 3D Gaussians, enabling direct geometry extraction from 3D Gaussians by identifying its levelset, without resorting to Poisson reconstruction or TSDF fusion as in previous work. We approximate the surface normal of Gaussians as the normal of the ray-Gaussian intersection plane, enabling the application of regularization that significantly enhances geometry. Furthermore, we develop an efficient geometry extraction method utilizing marching tetrahedra, where the tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based methods in surface reconstruction and novel view synthesis. Further, it compares favorably to, or even outperforms, neural implicit methods in both quality and speed.
Keyword: volumetric render

There is no result

Keyword: remote render

There is no result

Keyword: hybrid render

There is no result

Keyword: raycast

There is no result

Keyword: medical imaging

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation
Authors: Shehan Perera, Pouyan Navard, Alper Yilmaz
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10156
Pdf link: https://arxiv.org/pdf/2404.10156
Abstract The adoption of Vision Transformers (ViTs) based architectures represents a significant advancement in 3D Medical Image (MI) segmentation, surpassing traditional Convolutional Neural Network (CNN) models by enhancing global contextual understanding. While this paradigm shift has significantly enhanced 3D segmentation performance, state-of-the-art architectures require extremely large and complex architectures with large scale computing resources for training and deployment. Furthermore, in the context of limited datasets, often encountered in medical imaging, larger models can present hurdles in both model generalization and convergence. In response to these challenges and to demonstrate that lightweight models are a valuable area of research in 3D medical imaging, we present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features. Additionally, SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features to produce highly accurate segmentation masks. The proposed memory efficient Transformer preserves the performance characteristics of a significantly larger model in a compact design. SegFormer3D democratizes deep learning for 3D medical image segmentation by offering a model with 33x less parameters and a 13x reduction in GFLOPS compared to the current state-of-the-art (SOTA). We benchmark SegFormer3D against the current SOTA models on three widely used datasets Synapse, BRaTs, and ACDC, achieving competitive results. Code: https://github.com/OSUPCVLab/SegFormer3D.git
Keyword: medical visualization

There is no result

Keyword: interactive volume

There is no result

Keyword: rendering

Plug-and-Play Acceleration of Occupancy Grid-based NeRF Rendering using VDB Grid and Hierarchical Ray Traversal
Authors: Yoshio Kato, Shuhei Tarashima
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10272
Pdf link: https://arxiv.org/pdf/2404.10272
Abstract Transmittance estimators such as Occupancy Grid (OG) can accelerate the training and rendering of Neural Radiance Field (NeRF) by predicting important samples that contributes much to the generated image. However, OG manages occupied regions in the form of the dense binary grid, in which there are many blocks with the same values that cause redundant examination of voxels' emptiness in ray-tracing. In our work, we introduce two techniques to improve the efficiency of ray-tracing in trained OG without fine-tuning. First, we replace the dense grids with VDB grids to reduce the spatial redundancy. Second, we use hierarchical digital differential analyzer (HDDA) to efficiently trace voxels in the VDB grids. Our experiments on NeRF-Synthetic and Mip-NeRF 360 datasets show that our proposed method successfully accelerates rendering NeRF-Synthetic dataset by 12% in average and Mip-NeRF 360 dataset by 4% in average, compared to a fast implementation of OG, NerfAcc, without losing the quality of rendered images.
EucliDreamer: Fast and High-Quality Texturing for 3D Models with Depth-Conditioned Stable Diffusion
Authors: Cindy Le, Congrui Hetang, Chendi Lin, Ang Cao, Yihui He
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10279
Pdf link: https://arxiv.org/pdf/2404.10279
Abstract We present EucliDreamer, a simple and effective method to generate textures for 3D models given text prompts and meshes. The texture is parametrized as an implicit function on the 3D surface, which is optimized with the Score Distillation Sampling (SDS) process and differentiable rendering. To generate high-quality textures, we leverage a depth-conditioned Stable Diffusion model guided by the depth image rendered from the mesh. We test our approach on 3D models in Objaverse and conducted a user study, which shows its superior quality compared to existing texturing methods like Text2Tex. In addition, our method converges 2 times faster than DreamFusion. Through text prompting, textures of diverse art styles can be produced. We hope Euclidreamer proides a viable solution to automate a labor-intensive stage in 3D content creation.
SRGS: Super-Resolution 3D Gaussian Splatting
Authors: Xiang Feng, Yongbo He, Yubo Wang, Yan Yang, Zhenzhong Kuang, Yu Jun, Jianping Fan, Jiajun ding
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10318
Pdf link: https://arxiv.org/pdf/2404.10318
Abstract Recently, 3D Gaussian Splatting (3DGS) has gained popularity as a novel explicit 3D representation. This approach relies on the representation power of Gaussian primitives to provide a high-quality rendering. However, primitives optimized at low resolution inevitably exhibit sparsity and texture deficiency, posing a challenge for achieving high-resolution novel view synthesis (HRNVS). To address this problem, we propose Super-Resolution 3D Gaussian Splatting (SRGS) to perform the optimization in a high-resolution (HR) space. The sub-pixel constraint is introduced for the increased viewpoints in HR space, exploiting the sub-pixel cross-view information of the multiple low-resolution (LR) views. The gradient accumulated from more viewpoints will facilitate the densification of primitives. Furthermore, a pre-trained 2D super-resolution model is integrated with the sub-pixel constraint, enabling these dense primitives to learn faithful texture features. In general, our method focuses on densification and texture learning to effectively enhance the representation ability of primitives. Experimentally, our method achieves high rendering quality on HRNVS only with LR inputs, outperforming state-of-the-art methods on challenging datasets such as Mip-NeRF 360 and Tanks & Temples. Related codes will be released upon acceptance.
Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior
Authors: Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10394
Pdf link: https://arxiv.org/pdf/2404.10394
Abstract Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360{\deg} canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Authors: Yu Li, Zhihua Wei, Han Jiang, Chuanyang Gong
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10464
Pdf link: https://arxiv.org/pdf/2404.10464
Abstract Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving fine-tuning or auxiliary models usually require extensive memory and computational resources, rendering them less practical for deployment in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxififies LMs by altering their internal representations in the activation space with lower resource and time cost. Specifically, we leverage self-induced steering pairs to identify detoxification vectors through arithmetic operations in the activation space. During inference, detoxification is achieved by blending the detoxification vectors with the original representations. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on popular detoxification metrics, while also maintaining satisfactory generation quality and diversity. Furthermore, we extend our method to multiple LLMs, demonstrating its practicality and scalability. Warning: some example model outputs contain highly offensive or disturbing text.
AbsGS: Recovering Fine Details for 3D Gaussian Splatting
Authors: Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10484
Pdf link: https://arxiv.org/pdf/2404.10484
Abstract 3D Gaussian Splatting (3D-GS) technique couples 3D Gaussian primitives with differentiable rasterization to achieve high-quality novel view synthesis results while providing advanced real-time rendering performance. However, due to the flaw of its adaptive density control strategy in 3D-GS, it frequently suffers from over-reconstruction issue in intricate scenes containing high-frequency details, leading to blurry rendered images. The underlying reason for the flaw has still been under-explored. In this work, we present a comprehensive analysis of the cause of aforementioned artifacts, namely gradient collision, which prevents large Gaussians in over-reconstructed regions from splitting. To address this issue, we propose the novel homodirectional view-space positional gradient as the criterion for densification. Our strategy efficiently identifies large Gaussians in over-reconstructed regions, and recovers fine details by splitting. We evaluate our proposed method on various challenging datasets. The experimental results indicate that our approach achieves the best rendering quality with reduced or similar memory consumption. Our method is easy to implement and can be incorporated into a wide variety of most recent Gaussian Splatting-based methods. We will open source our codes upon formal publication. Our project page is available at: https://ty424.github.io/AbsGS.github.io/
Into the Fog: Evaluating Multiple Object Tracking Robustness
Authors: Nadezda Kirillova, M. Jehanzeb Mirza, Horst Possegger, Horst Bischof
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10534
Pdf link: https://arxiv.org/pdf/2404.10534
Abstract State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges.
Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks
Authors: Florian Barthel, Arian Beckmann, Wieland Morgenstern, Anna Hilsmann, Peter Eisert
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10625
Pdf link: https://arxiv.org/pdf/2404.10625
Abstract NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes.
StyleCity: Large-Scale 3D Urban Scenes Stylization with Vision-and-Text Reference via Progressive Optimization
Authors: Yingshu Chen, Huajian Huang, Tuan-Anh Vu, Ka Chun Shum, Sai-Kit Yeung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10681
Pdf link: https://arxiv.org/pdf/2404.10681
Abstract Creating large-scale virtual urban scenes with variant styles is inherently challenging. To facilitate prototypes of virtual production and bypass the need for complex materials and lighting setups, we introduce the first vision-and-text-driven texture stylization system for large-scale urban scenes, StyleCity. Taking an image and text as references, StyleCity stylizes a 3D textured mesh of a large-scale urban scene in a semantics-aware fashion and generates a harmonic omnidirectional sky background. To achieve that, we propose to stylize a neural texture field by transferring 2D vision-and-text priors to 3D globally and locally. During 3D stylization, we progressively scale the planned training views of the input 3D scene at different levels in order to preserve high-quality scene content. We then optimize the scene style globally by adapting the scale of the style image with the scale of the training views. Moreover, we enhance local semantics consistency by the semantics-aware style loss which is crucial for photo-realistic stylization. Besides texture stylization, we further adopt a generative diffusion model to synthesize a style-consistent omnidirectional sky image, which offers a more immersive atmosphere and assists the semantic stylization process. The stylized neural texture field can be baked into an arbitrary-resolution texture, enabling seamless integration into conventional rendering pipelines and significantly easing the virtual production prototyping process. Extensive experiments demonstrate our stylized scenes' superiority in qualitative and quantitative performance and user preferences.
Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes
Authors: Zehao Yu, Torsten Sattler, Andreas Geiger
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10772
Pdf link: https://arxiv.org/pdf/2404.10772
Abstract Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results, while allowing the rendering of high-resolution images in real-time. However, leveraging 3D Gaussians for surface reconstruction poses significant challenges due to the explicit and disconnected nature of 3D Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of 3D Gaussians, enabling direct geometry extraction from 3D Gaussians by identifying its levelset, without resorting to Poisson reconstruction or TSDF fusion as in previous work. We approximate the surface normal of Gaussians as the normal of the ray-Gaussian intersection plane, enabling the application of regularization that significantly enhances geometry. Furthermore, we develop an efficient geometry extraction method utilizing marching tetrahedra, where the tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based methods in surface reconstruction and novel view synthesis. Further, it compares favorably to, or even outperforms, neural implicit methods in both quality and speed.
Keyword: cinematic rendering

There is no result

Keyword: volume data

There is no result

Keyword: remote visualization

There is no result

Keyword: direct volume rendering

There is no result

Keyword: mobile device

SparseDM: Toward Sparse Efficient Diffusion Models
Authors: Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10445
Pdf link: https://arxiv.org/pdf/2404.10445
Abstract Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application on mobile devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then use design progressive sparsity for model training in the fine-tuning stage, and switch the inference mask on and off, which supports a flexible choice of sparsity during inference according to the FID and MACs requirements. Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50\%$ while increasing FID by only 1.5 on average. Under other MACs conditions, the FID is also lower than 1$\sim$137 compared to other methods.
MobileNetV4 - Universal Models for the Mobile Ecosystem
Authors: Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10518
Pdf link: https://arxiv.org/pdf/2404.10518
Abstract We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.
Keyword: transfer function

Dynamic Complex-Frequency Control of Grid-Forming Converters
Authors: Roger Domingo-Enrich, Xiuqiang He, Verena Häberle, Florian Dörfler
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.10071
Pdf link: https://arxiv.org/pdf/2404.10071
Abstract Complex droop control, alternatively known as dispatchable virtual oscillator control (dVOC), stands out for its unique capabilities in synchronization and voltage stabilization among existing control strategies for grid-forming converters. Complex droop control leverages the novel concept of ``complex frequency'', thereby establishing a coupled connection between active and reactive power inputs and frequency and rate-of-change-of voltage outputs. However, its reliance on static droop gains limits its ability to exhibit crucial dynamic response behaviors required in future power systems. To address this limitation, this paper introduces \textit{dynamic complex-frequency control}, upgrading static droop gains with dynamic transfer functions to enhance the richness and flexibility in dynamic responses for frequency and voltage control. Unlike existing approaches, the complex-frequency control framework treats frequency and voltage dynamics collectively, ensuring small-signal stability for frequency synchronization and voltage stabilization simultaneously. The control framework is validated through detailed numerical case studies on the IEEE nine-bus system, also showcasing its applicability in multi-converter setups.
On the Universality of Spatially Coupled LDPC Codes Over Intersymbol Interference Channels
Authors: Mgeni Makambi Mashauri, Alexandre Graell i Amat, Michael Lentmaier
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.10348
Pdf link: https://arxiv.org/pdf/2404.10348
Abstract In this paper, we derive the exact input/output transfer functions of the optimal a-posteriori probability channel detector for a general ISI channel with erasures. Considering three channel impulse responses of different memory as an example, we compute the BP and MAP thresholds for regular spatially coupled LDPC codes with joint iterative detection and decoding. When we compare the results with the thresholds of ISI channels with Gaussian noise we observe an apparent inconsistency, i.e., a channel which performs better with erasures performs worse with AWGN. We show that this anomaly can be resolved by looking at the thresholds from an entropy perspective. We finally show that with spatial coupling we can achieve the symmetric information rates of different ISI channels using the same code.
Keyword: retrieval

Using Long Short-term Memory (LSTM) to merge precipitation data over mountainous area in Sierra Nevada
Authors: Yihan Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
Arxiv link: https://arxiv.org/abs/2404.10135
Pdf link: https://arxiv.org/pdf/2404.10135
Abstract Obtaining reliable precipitation estimation with high resolutions in time and space is of great importance to hydrological studies. However, accurately estimating precipitation is a challenging task over high mountainous complex terrain. The three widely used precipitation measurement approaches, namely rainfall gauge, precipitation radars, and satellite-based precipitation sensors, have their own pros and cons in producing reliable precipitation products over complex areas. One way to decrease the detection error probability and improve data reliability is precipitation data merging. With the rapid advancements in computational capabilities and the escalating volume and diversity of earth observational data, Deep Learning (DL) models have gained considerable attention in geoscience. In this study, a deep learning technique, namely Long Short-term Memory (LSTM), was employed to merge a radar-based and a satellite-based Global Precipitation Measurement (GPM) precipitation product Integrated Multi-Satellite Retrievals for GPM (IMERG) precipitation product at hourly scale. The merged results are compared with the widely used reanalysis precipitation product, Multi-Radar Multi-Sensor (MRMS), and assessed against gauge observational data from the California Data Exchange Center (CDEC). The findings indicated that the LSTM-based merged precipitation notably underestimated gauge observations and, at times, failed to provide meaningful estimates, showing predominantly near-zero values. Relying solely on individual Quantitative Precipitation Estimates (QPEs) without additional meteorological input proved insufficient for generating reliable merged QPE. However, the merged results effectively captured the temporal trends of the observations, outperforming MRMS in this aspect. This suggested that incorporating bias correction techniques could potentially enhance the accuracy of the merged product.
PD-L1 Classification of Weakly-Labeled Whole Slide Images of Breast Cancer
Authors: Giacomo Cignoni, Cristian Scatena, Chiara Frascarelli, Nicola Fusco, Antonio Giuseppe Naccarato, Giuseppe Nicoló Fanelli, Alina Sîrbu
Subjects: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10175
Pdf link: https://arxiv.org/pdf/2404.10175
Abstract Specific and effective breast cancer therapy relies on the accurate quantification of PD-L1 positivity in tumors, which appears in the form of brown stainings in high resolution whole slide images (WSIs). However, the retrieval and extensive labeling of PD-L1 stained WSIs is a time-consuming and challenging task for pathologists, resulting in low reproducibility, especially for borderline images. This study aims to develop and compare models able to classify PD-L1 positivity of breast cancer samples based on WSI analysis, relying only on WSI-level labels. The task consists of two phases: identifying regions of interest (ROI) and classifying tumors as PD-L1 positive or negative. For the latter, two model categories were developed, with different feature extraction methodologies. The first encodes images based on the colour distance from a base color. The second uses a convolutional autoencoder to obtain embeddings of WSI tiles, and aggregates them into a WSI-level embedding. For both model types, features are fed into downstream ML classifiers. Two datasets from different clinical centers were used in two different training configurations: (1) training on one dataset and testing on the other; (2) combining the datasets. We also tested the performance with or without human preprocessing to remove brown artefacts Colour distance based models achieve the best performances on testing configuration (1) with artefact removal, while autoencoder-based models are superior in the remaining cases, which are prone to greater data variability.
How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior
Authors: Kevin Wu, Eric Wu, James Zou
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10198
Pdf link: https://arxiv.org/pdf/2404.10198
Abstract Retrieval augmented generation (RAG) is often used to fix hallucinations and provide up-to-date knowledge for large language models (LLMs). However, in cases when the LLM alone incorrectly answers a question, does providing the correct retrieved content always fix the error? Conversely, in cases where the retrieved content is incorrect, does the LLM know to ignore the wrong information, or does it recapitulate the error? To answer these questions, we systematically analyze the tug-of-war between a LLM's internal knowledge (i.e. its prior) and the retrieved information in settings when they disagree. We test GPT-4 and other LLMs on question-answering abilities across datasets with and without reference documents. As expected, providing the correct retrieved information fixes most model mistakes (94% accuracy). However, when the reference document is perturbed with increasing levels of wrong values, the LLM is more likely to recite the incorrect, modified information when its internal prior is weaker but is more resistant when its prior is stronger. Similarly, we also find that the more the modified information deviates from the model's prior, the less likely the model is to prefer it. These results highlight an underlying tension between a model's prior knowledge and the information presented in reference documents.
Find The Gap: Knowledge Base Reasoning For Visual Question Answering
Authors: Elham J. Barezi, Parisa Kordjamshidi
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.10226
Pdf link: https://arxiv.org/pdf/2404.10226
Abstract We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering task-specific and LLM models with supervised external and visual knowledge retrieval models. Our findings show that though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model even if the relevant information from both modalities is available to the model. Moreover, we observed that LLM models outperform the NN model for KB-related questions which confirms the effectiveness of implicit knowledge in LLMs however, they do not alleviate the need for external KB.
Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression
Authors: Jixiang Luo
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2404.10234
Pdf link: https://arxiv.org/pdf/2404.10234
Abstract The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Authors: Jintao Sun, Zhedong Zheng, Gangyi Ding
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2404.10292
Pdf link: https://arxiv.org/pdf/2404.10292
Abstract In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these constructed datasets plays a decisive role. Therefore, we introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA (Weighted Low-Rank Adaptation) learning strategy for light fine-tuning. The filtering algorithm is based on the cross-modality relevance to remove the lots of coarse matching synthesis pairs. As the number of data decreases, we do not need to fine-tune the entire model. Therefore, we propose a WoRA learning strategy to efficiently update a minimal portion of model parameters. WoRA streamlines the learning process, enabling heightened efficiency in extracting knowledge from fewer, yet potent, data instances. Extensive experimentation validates the efficacy of pretraining, where our model achieves advanced and efficient retrieval performance on challenging real-world benchmarks. Notably, on the CUHK-PEDES dataset, we have achieved a competitive mAP of 67.02% while reducing model training time by 19.82%.
TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content
Authors: Avinash Anand, Raj Jaiswal, Pijush Bhuyan, Mohit Gupta, Siddhesh Bangar, Md. Modassir Imam, Rajiv Ratn Shah, Shin'ichi Satoh
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10305
Pdf link: https://arxiv.org/pdf/2404.10305
Abstract The automatic recognition of tabular data in document images presents a significant challenge due to the diverse range of table styles and complex structures. Tables offer valuable content representation, enhancing the predictive capabilities of various systems such as search engines and Knowledge Graphs. Addressing the two main problems, namely table detection (TD) and table structure recognition (TSR), has traditionally been approached independently. In this research, we propose an end-to-end pipeline that integrates deep learning models, including DETR, CascadeTabNet, and PP OCR v2, to achieve comprehensive image-based table recognition. This integrated approach effectively handles diverse table styles, complex structures, and image distortions, resulting in improved accuracy and efficiency compared to existing methods like Table Transformers. Our system achieves simultaneous table detection (TD), table structure recognition (TSR), and table content recognition (TCR), preserving table structures and accurately extracting tabular data from document images. The integration of multiple models addresses the intricacies of table recognition, making our approach a promising solution for image-based table understanding, data extraction, and information retrieval applications. Our proposed approach achieves an IOU of 0.96 and an OCR Accuracy of 78%, showcasing a remarkable improvement of approximately 25% in the OCR Accuracy compared to the previous Table Transformer approach.
LLMs4OM: Matching Ontologies with Large Language Models
Authors: Hamed Babaei Giglou, Jennifer D'Souza, Sören Auer
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10317
Pdf link: https://arxiv.org/pdf/2404.10317
Abstract Ontology Matching (OM), is a critical task in knowledge integration, where aligning heterogeneous ontologies facilitates data interoperability and knowledge sharing. Traditional OM systems often rely on expert knowledge or predictive models, with limited exploration of the potential of Large Language Models (LLMs). We present the LLMs4OM framework, a novel approach to evaluate the effectiveness of LLMs in OM tasks. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. Through comprehensive evaluations using 20 OM datasets from various domains, we demonstrate that LLMs, under the LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios. Our results highlight the potential of LLMs to significantly contribute to the field of OM.
Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering
Authors: Yuqi Wang, Boran Jiang, Yi Luo, Dawei He, Peng Cheng, Liangcai Gao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2404.10384
Pdf link: https://arxiv.org/pdf/2404.10384
Abstract Large language models (LLMs), such as GPT3.5, GPT4 and LLAMA2 perform surprisingly well and outperform human experts on many tasks. However, in many domain-specific evaluations, these LLMs often suffer from hallucination problems due to insufficient training of relevant corpus. Furthermore, fine-tuning large models may face problems such as the LLMs are not open source or the construction of high-quality domain instruction is difficult. Therefore, structured knowledge databases such as knowledge graph can better provide domain back- ground knowledge for LLMs and make full use of the reasoning and analysis capabilities of LLMs. In some previous works, LLM was called multiple times to determine whether the current triplet was suitable for inclusion in the subgraph when retrieving subgraphs through a question. Especially for the question that require a multi-hop reasoning path, frequent calls to LLM will consume a lot of computing power. Moreover, when choosing the reasoning path, LLM will be called once for each step, and if one of the steps is selected incorrectly, it will lead to the accumulation of errors in the following steps. In this paper, we integrated and optimized a pipeline for selecting reasoning paths from KG based on LLM, which can reduce the dependency on LLM. In addition, we propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank which can returns the paths most likely to contain the answer. We conduct experiments on three datasets: GenMedGPT-5k [14], WebQuestions [2], and CMCQA [21]. Finally, RoK can demonstrate that using fewer LLM calls can achieve the same results as previous SOTAs models.
VDTuner: Automated Performance Tuning for Vector Data Management Systems
Authors: Tiannuo Yang, Wen Hu, Wangqi Peng, Yusen Li, Jianguo Li, Gang Wang, Xiaoguang Liu
Subjects: Databases (cs.DB); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2404.10413
Pdf link: https://arxiv.org/pdf/2404.10413
Abstract Vector data management systems (VDMSs) have become an indispensable cornerstone in large-scale information retrieval and machine learning systems like large language models. To enhance the efficiency and flexibility of similarity search, VDMS exposes many tunable index parameters and system parameters for users to specify. However, due to the inherent characteristics of VDMS, automatic performance tuning for VDMS faces several critical challenges, which cannot be well addressed by the existing auto-tuning methods. In this paper, we introduce VDTuner, a learning-based automatic performance tuning framework for VDMS, leveraging multi-objective Bayesian optimization. VDTuner overcomes the challenges associated with VDMS by efficiently exploring a complex multi-dimensional parameter space without requiring any prior knowledge. Moreover, it is able to achieve a good balance between search speed and recall rate, delivering an optimal configuration. Extensive evaluations demonstrate that VDTuner can markedly improve VDMS performance (14.12% in search speed and 186.38% in recall rate) compared with default setting, and is more efficient compared with state-of-the-art baselines (up to 3.57 times faster in terms of tuning time). In addition, VDTuner is scalable to specific user preference and cost-aware optimization objective. VDTuner is available online at https://github.com/tiannuo-yang/VDTuner.
The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
Authors: Gabriele Trivigno, Carlo Masone, Barbara Caputo, Torsten Sattler
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10438
Pdf link: https://arxiv.org/pdf/2404.10438
Abstract Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g., from retrieval), (2) as pre-processing, i.e., to provide a better starting point to a more expensive pose estimator, (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work, we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity, it achieves state-of-the-art results, demonstrating that one can easily build a pose refiner without the need for specific training. The code is at https://github.com/ga1i13o/mcloc_poseref
Spiral of Silences: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering
Authors: Xiaoyang Chen, Ben He, Hongyu Lin, Xianpei Han, Tianshu Wang, Boxi Cao, Le Sun, Yingfei Sun
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2404.10496
Pdf link: https://arxiv.org/pdf/2404.10496
Abstract The practice of Retrieval-Augmented Generation (RAG), which integrates Large Language Models (LLMs) with retrieval systems, has become increasingly prevalent. However, the repercussions of LLM-derived content infiltrating the web and influencing the retrieval-generation feedback loop are largely uncharted territories. In this study, we construct and iteratively run a simulation pipeline to deeply investigate the short-term and long-term effects of LLM text on RAG systems. Taking the trending Open Domain Question Answering (ODQA) task as a point of entry, our findings reveal a potential digital "Spiral of Silence" effect, with LLM-generated text consistently outperforming human-authored content in search rankings, thereby diminishing the presence and impact of human contributions online. This trend risks creating an imbalanced information ecosystem, where the unchecked proliferation of erroneous LLM-generated content may result in the marginalization of accurate information. We urge the academic community to take heed of this potential issue, ensuring a diverse and authentic digital information landscape.
PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction
Authors: Sinisa Stekovic, Stefan Ainetter, Mattia D'Urso, Friedrich Fraundorfer, Vincent Lepetit
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.10620
Pdf link: https://arxiv.org/pdf/2404.10620
Abstract We propose PyTorchGeoNodes, a differentiable module for reconstructing 3D objects from images using interpretable shape programs. In comparison to traditional CAD model retrieval methods, the use of shape programs for 3D reconstruction allows for reasoning about the semantic properties of reconstructed objects, editing, low memory footprint, etc. However, the utilization of shape programs for 3D scene understanding has been largely neglected in past works. As our main contribution, we enable gradient-based optimization by introducing a module that translates shape programs designed in Blender, for example, into efficient PyTorch code. We also provide a method that relies on PyTorchGeoNodes and is inspired by Monte Carlo Tree Search (MCTS) to jointly optimize discrete and continuous parameters of shape programs and reconstruct 3D objects for input scenes. In our experiments, we apply our algorithm to reconstruct 3D objects in the ScanNet dataset and evaluate our results against CAD model retrieval-based reconstructions. Our experiments indicate that our reconstructions match well the input scenes while enabling semantic reasoning about reconstructed objects.
Watch Your Step: Optimal Retrieval for Continual Learning at Scale
Authors: Truman Hickok, Dhireesha Kudithipudi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10758
Pdf link: https://arxiv.org/pdf/2404.10758
Abstract One of the most widely used approaches in continual learning is referred to as replay. Replay methods support interleaved learning by storing past experiences in a replay buffer. Although there are methods for selectively constructing the buffer and reprocessing its contents, there is limited exploration of the problem of selectively retrieving samples from the buffer. Current solutions have been tested in limited settings and, more importantly, in isolation. Existing work has also not explored the impact of duplicate replays on performance. In this work, we propose a framework for evaluating selective retrieval strategies, categorized by simple, independent class- and sample-selective primitives. We evaluated several combinations of existing strategies for selective retrieval and present their performances. Furthermore, we propose a set of strategies to prevent duplicate replays and explore whether new samples with low loss values can be learned without replay. In an effort to match our problem setting to a realistic continual learning pipeline, we restrict our experiments to a setting involving a large, pre-trained, open vocabulary object detection model, which is fully fine-tuned on a sequence of 15 datasets.
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
Authors: Liyan Tang, Philippe Laban, Greg Durrett
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10774
Pdf link: https://arxiv.org/pdf/2404.10774
Abstract Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
Keyword: video retrieval

There is no result

Keyword: mobile

A Linear MPC with Control Barrier Functions for Differential Drive Robots
Authors: Ali Mohamed Ali, Chao Shen, Hashim A. Hashim
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.10018
Pdf link: https://arxiv.org/pdf/2404.10018
Abstract The need for fully autonomous mobile robots has surged over the past decade, with the imperative of ensuring safe navigation in a dynamic setting emerging as a primary challenge impeding advancements in this domain. In this paper, a Safety Critical Model Predictive Control based on Dynamic Feedback Linearization tailored to the application of differential drive robots with two wheels is proposed to generate control signals that result in obstacle-free paths. A barrier function introduces a safety constraint to the optimization problem of the Model Predictive Control (MPC) to prevent collisions. Due to the intrinsic nonlinearities of the differential drive robots, computational complexity while implementing a Nonlinear Model Predictive Control (NMPC) arises. To facilitate the real-time implementation of the optimization problem and to accommodate the underactuated nature of the robot, a combination of Linear Model Predictive Control (LMPC) and Dynamic Feedback Linearization (DFL) is proposed. The MPC problem is formulated on a linear equivalent model of the differential drive robot rendered by the DFL controller. The analysis of the closed-loop stability and recursive feasibility of the proposed control design is discussed. Numerical experiments illustrate the robustness and effectiveness of the proposed control synthesis in avoiding obstacles with respect to the benchmark of using Euclidean distance constraints. Keywords: Model Predictive Control, MPC, Autonomous Ground Vehicles, Nonlinearity, Dynamic Feedback Linearization, Optimal Control, Differential Robots.
CO-oPS: A Mobile App for Community Oversight of Privacy and Security
Authors: Mamtaj Akter, Leena Alghamdi, Dylan Gillespie, Nazmus Miazi, Jess Kropczynski, Heather Lipford, Pamela Wisniewski
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.10258
Pdf link: https://arxiv.org/pdf/2404.10258
Abstract Smartphone users install numerous mobile apps that require access to different information from their devices. Much of this information is very sensitive, and users often struggle to manage these accesses due to their lack of tech expertise and knowledge regarding mobile privacy. Thus, they often seek help from others to make decisions regarding their mobile privacy and security. We embedded these social processes in a mobile app titled "CO-oPS'' ("Community Oversight for Privacy and Security"). CO-oPS allows trusted community members to review one another's apps installed and permissions granted to those apps. Community members can provide feedback to one another regarding their privacy behaviors. Users are also allowed to hide some of their mobile apps that they do not like others to see, ensuring their personal privacy.
Multiple Mobile Target Detection and Tracking in Active Sonar Array Using a Track-Before-Detect Approach
Authors: Avi Abu, Nikola Miskovic, Oleg Chebotar, Neven Cukrov, Roee Diamant
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.10316
Pdf link: https://arxiv.org/pdf/2404.10316
Abstract We present an algorithm for detecting and tracking underwater mobile objects using active acoustic transmission of broadband chirp signals whose reflections are received by a hydrophone array. The method overcomes the problem of high false alarm rate by applying a track-before-detect ap- proach to the sequence of received reflections. A 2D time- space matrix is created for the reverberations received from each transmitted probe signal by performing delay and sum beamforming and pulse compression. The result is filtered by a 2D constant false alarm rate (CFAR) detector to identify reflection patterns corresponding to potential targets. Closely spaced signals for multiple probe transmissions are combined into blobs to avoid multiple detections of a single object. A track- before-detect method using a Nearly Constant Velocity (NCV) model is employed to track multiple objects. The position and velocity is estimated by the debiased converted measurement Kalman filter. Results are analyzed for simulated scenarios and for experiments at sea, where GPS tagged gilt-head seabream fish were tracked. Compared to two benchmark schemes, the results show a favorable track continuity and accuracy that is robust to the choice of detection threshold.
Learning Wireless Data Knowledge Graph for Green Intelligent Communications: Methodology and Experiments
Authors: Yongming Huang, Xiaohu You, Hang Zhan, Shiwen He, Ningning Fu, Wei Xu
Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.10365
Pdf link: https://arxiv.org/pdf/2404.10365
Abstract Intelligent communications have played a pivotal role in shaping the evolution of 6G networks. Native artificial intelligence (AI) within green communication systems must meet stringent real-time requirements. To achieve this, deploying lightweight and resource-efficient AI models is necessary. However, as wireless networks generate a multitude of data fields and indicators during operation, only a fraction of them imposes significant impact on the network AI models. Therefore, real-time intelligence of communication systems heavily relies on a small but critical set of the data that profoundly influences the performance of network AI models. These challenges underscore the need for innovative architectures and solutions. In this paper, we propose a solution, termed the pervasive multi-level (PML) native AI architecture, which integrates the concept of knowledge graph (KG) into the intelligent operational manipulations of mobile networks, resulting in the establishment of a wireless data KG. Leveraging the wireless data KG, we characterize the massive and complex data collected from wireless communication networks and analyze the relationships among various data fields. The obtained graph of data field relations enables the on-demand generation of minimal and effective datasets, referred to as feature datasets, tailored to specific application requirements. Consequently, this architecture not only enhances AI training, inference, and validation processes but also significantly reduces resource wastage and overhead for communication networks. To implement this architecture, we have developed a specific solution comprising a spatio-temporal heterogeneous graph attention neural network model (STREAM) as well as a feature dataset generation algorithm. Experiments are conducted to validate the effectiveness of the proposed architecture.
SparseDM: Toward Sparse Efficient Diffusion Models
Authors: Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.10445
Pdf link: https://arxiv.org/pdf/2404.10445
Abstract Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application on mobile devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then use design progressive sparsity for model training in the fine-tuning stage, and switch the inference mask on and off, which supports a flexible choice of sparsity during inference according to the FID and MACs requirements. Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50\%$ while increasing FID by only 1.5 on average. Under other MACs conditions, the FID is also lower than 1$\sim$137 compared to other methods.
MobileNetV4 - Universal Models for the Mobile Ecosystem
Authors: Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10518
Pdf link: https://arxiv.org/pdf/2404.10518
Abstract We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.
Exploring Post Quantum Cryptography with Quantum Key Distribution for Sustainable Mobile Network Architecture Design
Authors: Sanzida Hoque, Abdullah Aydeger, Engin Zeydan
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2404.10602
Pdf link: https://arxiv.org/pdf/2404.10602
Abstract The proliferation of mobile networks and their increasing importance to modern life, combined with the emerging threat of quantum computing, present new challenges and opportunities for cybersecurity. This paper addresses the complexity of protecting these critical infrastructures against future quantum attacks while considering operational sustainability. We begin with an overview of the current landscape, identify the main vulnerabilities in mobile networks, and evaluate existing security solutions with new post-quantum cryptography (PQC) methods. We then present a quantum-secure architecture with PQC and Quantum Key Distribution (QKD) tailored explicitly for sustainable mobile networks and illustrate its applicability with several use cases that emphasize the need for advanced protection measures in this new era. In addition, a comprehensive analysis of PQC algorithm families is presented, focusing on their suitability for integration in mobile environments, with particular attention to the trade-offs between energy consumption and security improvements. Finally, recommendations for strengthening mobile networks against quantum threats are provided through a detailed examination of current challenges and opportunities.
Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks
Authors: Florian Barthel, Arian Beckmann, Wieland Morgenstern, Anna Hilsmann, Peter Eisert
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.10625
Pdf link: https://arxiv.org/pdf/2404.10625
Abstract NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes.
Keyword: smartphone

CO-oPS: A Mobile App for Community Oversight of Privacy and Security
Authors: Mamtaj Akter, Leena Alghamdi, Dylan Gillespie, Nazmus Miazi, Jess Kropczynski, Heather Lipford, Pamela Wisniewski
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.10258
Pdf link: https://arxiv.org/pdf/2404.10258
Abstract Smartphone users install numerous mobile apps that require access to different information from their devices. Much of this information is very sensitive, and users often struggle to manage these accesses due to their lack of tech expertise and knowledge regarding mobile privacy. Thus, they often seek help from others to make decisions regarding their mobile privacy and security. We embedded these social processes in a mobile app titled "CO-oPS'' ("Community Oversight for Privacy and Security"). CO-oPS allows trusted community members to review one another's apps installed and permissions granted to those apps. Community members can provide feedback to one another regarding their privacy behaviors. Users are also allowed to hide some of their mobile apps that they do not like others to see, ensuring their personal privacy.
A Phone-based Distributed Ambient Temperature Measurement System with An Efficient Label-free Automated Training Strategy
Authors: Dayin Chen, Xiaodan Shi, Haoran Zhang, Xuan Song, Dongxiao Zhang, Yuntian Chen, Jinyue Yan
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.10401
Pdf link: https://arxiv.org/pdf/2404.10401
Abstract Enhancing the energy efficiency of buildings significantly relies on monitoring indoor ambient temperature. The potential limitations of conventional temperature measurement techniques, together with the omnipresence of smartphones, have redirected researchers' attention towards the exploration of phone-based ambient temperature estimation technology. Nevertheless, numerous obstacles remain to be addressed in order to achieve a practical implementation of this technology. This study proposes a distributed phone-based ambient temperature estimation system which enables collaboration between multiple phones to accurately measure the ambient temperature in each small area of an indoor space. Besides, it offers a secure, efficient, and cost-effective training strategy to train a new estimation model for each newly added phone, eliminating the need for manual collection of labeled data. This innovative training strategy can yield a high-performing estimation model for a new phone with just 5 data points, requiring only a few iterations. Meanwhile, by crowdsourcing, our system automatically provides accurate inferred labels for all newly collected data. We also highlight the potential of integrating federated learning into our system to ensure privacy protection at the end of this study. We believe this study has the potential to advance the practical application of phone-based ambient temperature measurement, facilitating energy-saving efforts in buildings.
Keyword: medical volume data

There is no result

Yukeaaa / arxiv-daily

【CS-part1】New submissions for Wed, 17 Apr 24 #1366

Keyword: volume render

Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes

Keyword: volumetric render

Keyword: remote render

Keyword: hybrid render

Keyword: raycast

Keyword: medical imaging

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Keyword: medical visualization

Keyword: interactive volume

Keyword: rendering

Plug-and-Play Acceleration of Occupancy Grid-based NeRF Rendering using VDB Grid and Hierarchical Ray Traversal

EucliDreamer: Fast and High-Quality Texturing for 3D Models with Depth-Conditioned Stable Diffusion

SRGS: Super-Resolution 3D Gaussian Splatting

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

AbsGS: Recovering Fine Details for 3D Gaussian Splatting

Into the Fog: Evaluating Multiple Object Tracking Robustness

Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks

StyleCity: Large-Scale 3D Urban Scenes Stylization with Vision-and-Text Reference via Progressive Optimization

Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes

Keyword: cinematic rendering

Keyword: volume data

Keyword: remote visualization

Keyword: direct volume rendering

Keyword: mobile device

SparseDM: Toward Sparse Efficient Diffusion Models

MobileNetV4 - Universal Models for the Mobile Ecosystem

Keyword: transfer function

Dynamic Complex-Frequency Control of Grid-Forming Converters

On the Universality of Spatially Coupled LDPC Codes Over Intersymbol Interference Channels

Keyword: retrieval

Using Long Short-term Memory (LSTM) to merge precipitation data over mountainous area in Sierra Nevada

PD-L1 Classification of Weakly-Labeled Whole Slide Images of Breast Cancer

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content

LLMs4OM: Matching Ontologies with Large Language Models

Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering

VDTuner: Automated Performance Tuning for Vector Data Management Systems

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

Spiral of Silences: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering

PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction

Watch Your Step: Optimal Retrieval for Continual Learning at Scale

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Keyword: video retrieval

Keyword: mobile

A Linear MPC with Control Barrier Functions for Differential Drive Robots

CO-oPS: A Mobile App for Community Oversight of Privacy and Security

Multiple Mobile Target Detection and Tracking in Active Sonar Array Using a Track-Before-Detect Approach

Learning Wireless Data Knowledge Graph for Green Intelligent Communications: Methodology and Experiments

SparseDM: Toward Sparse Efficient Diffusion Models

MobileNetV4 - Universal Models for the Mobile Ecosystem

Exploring Post Quantum Cryptography with Quantum Key Distribution for Sustainable Mobile Network Architecture Design

Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks

Keyword: smartphone

CO-oPS: A Mobile App for Community Oversight of Privacy and Security

A Phone-based Distributed Ambient Temperature Measurement System with An Efficient Label-free Automated Training Strategy

Keyword: medical volume data