Abstract
Detection of cyber attacks in smart power distribution grids with unbalanced configurations poses challenges due to the inherent nonlinear nature of these uncertain and stochastic systems. It originates from the intermittent characteristics of the distributed energy resources (DERs) generation and load variations. Moreover, the unknown behavior of cyber attacks, especially false data injection attacks (FDIAs) in the distribution grids with complex temporal correlations and the limited amount of labeled data increases the vulnerability of the grids and imposes a high risk in the secure and reliable operation of the grids. To address these challenges, this paper proposes an unsupervised adversarial autoencoder (AAE) model to detect FDIAs in unbalanced power distribution grids integrated with DERs, i.e., PV systems and wind generation. The proposed method utilizes long short-term memory (LSTM) in the structure of the autoencoder to capture the temporal dependencies in the time-series measurements and leverages the power of generative adversarial networks (GANs) for better reconstruction of the input data. The advantage of the proposed data-driven model is that it can detect anomalous points for the system operation without reliance on abstract models or mathematical representations. To evaluate the efficacy of the approach, it is tested on IEEE 13-bus and 123-bus systems with historical meteorological data (wind speed, ambient temperature, and solar irradiance) as well as historical real-world load data under three types of data falsification functions. The comparison of the detection results of the proposed model with other unsupervised learning methods verifies its superior performance in detecting cyber attacks in unbalanced power distribution grids.
KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking
Authors: Authors: Jiawei Zhang, Chejian Xu, Yu Gai, Freddy Lecue, Dawn Song, Bo Li
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
This paper introduces KnowHalu, a novel approach for detecting hallucinations in text generated by large language models (LLMs), utilizing step-wise reasoning, multi-formulation query, multi-form knowledge for factual checking, and fusion-based detection mechanism. As LLMs are increasingly applied across various domains, ensuring that their outputs are not hallucinated is critical. Recognizing the limitations of existing approaches that either rely on the self-consistency check of LLMs or perform post-hoc fact-checking without considering the complexity of queries or the form of knowledge, KnowHalu proposes a two-phase process for hallucination detection. In the first phase, it identifies non-fabrication hallucinations--responses that, while factually correct, are irrelevant or non-specific to the query. The second phase, multi-form based factual checking, contains five key steps: reasoning and query decomposition, knowledge retrieval, knowledge optimization, judgment generation, and judgment aggregation. Our extensive evaluations demonstrate that KnowHalu significantly outperforms SOTA baselines in detecting hallucinations across diverse tasks, e.g., improving by 15.65% in QA tasks and 5.50% in summarization tasks, highlighting its effectiveness and versatility in detecting hallucinations in LLM-generated content.
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
Authors: Authors: Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, Hai Li
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The problem of pre-training data detection for large language models (LLMs) has received growing attention due to its implications in critical issues like copyright violation and test data contamination. The current state-of-the-art approach, Min-K%, measures the raw token probability which we argue may not be the most informative signal. Instead, we propose Min-K%++ to normalize the token probability with statistics of the categorical distribution over the whole vocabulary, which accurately reflects the relative likelihood of the target token compared with other candidate tokens in the vocabulary. Theoretically, we back up our method by showing that the statistic it estimates is explicitly optimized during LLM training, thus serving as a reliable indicator for detecting training data. Empirically, on the WikiMIA benchmark, Min-K%++ outperforms the SOTA Min-K% by 6.2% to 10.5% in detection AUROC averaged over five models. On the more challenging MIMIR benchmark, Min-K%++ consistently improves upon Min-K% and performs on par with reference-based method, despite not requiring an extra reference model.
Foundation Models for Structural Health Monitoring
Abstract
Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy) only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.85 for light and heavy vehicle traffic, respectively, while the best previous approach stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.10 of the best existing method.
ASAP: Interpretable Analysis and Summarization of AI-generated Image Patterns at Scale
Abstract
Generative image models have emerged as a promising technology to produce realistic images. Despite potential benefits, concerns grow about its misuse, particularly in generating deceptive images that could raise significant ethical, legal, and societal issues. Consequently, there is growing demand to empower users to effectively discern and comprehend patterns of AI-generated images. To this end, we developed ASAP, an interactive visualization system that automatically extracts distinct patterns of AI-generated images and allows users to interactively explore them via various views. To uncover fake patterns, ASAP introduces a novel image encoder, adapted from CLIP, which transforms images into compact "distilled" representations, enriched with information for differentiating authentic and fake images. These representations generate gradients that propagate back to the attention maps of CLIP's transformer block. This process quantifies the relative importance of each pixel to image authenticity or fakeness, exposing key deceptive patterns. ASAP enables the at scale interactive analysis of these patterns through multiple, coordinated visualizations. This includes a representation overview with innovative cell glyphs to aid in the exploration and qualitative evaluation of fake patterns across a vast array of images, as well as a pattern view that displays authenticity-indicating patterns in images and quantifies their impact. ASAP supports the analysis of cutting-edge generative models with the latest architectures, including GAN-based models like proGAN and diffusion models like the latent diffusion model. We demonstrate ASAP's usefulness through two usage scenarios using multiple fake image detection benchmark datasets, revealing its ability to identify and understand hidden patterns in AI-generated images, especially in detecting fake human faces produced by diffusion-based techniques.
Transfer learning applications for anomaly detection in wind turbines
Authors: Authors: Cyriana M.A. Roelofs, Christian Gück, Stefan Faulstich
Abstract
Anomaly detection in wind turbines typically involves using normal behaviour models to detect faults early. However, training autoencoder models for each turbine is time-consuming and resource intensive. Thus, transfer learning becomes essential for wind turbines with limited data or applications with limited computational resources. This study examines how cross-turbine transfer learning can be applied to autoencoder-based anomaly detection. Here, autoencoders are combined with constant thresholds for the reconstruction error to determine if input data contains an anomaly. The models are initially trained on one year's worth of data from one or more source wind turbines. They are then fine-tuned using smaller amounts of data from another turbine. Three methods for fine-tuning are investigated: adjusting the entire autoencoder, only the decoder, or only the threshold of the model. The performance of the transfer learning models is compared to baseline models that were trained on one year's worth of data from the target wind turbine. The results of the tests conducted in this study indicate that models trained on data of multiple wind turbines do not improve the anomaly detection capability compared to models trained on data of one source wind turbine. In addition, modifying the model's threshold can lead to comparable or even superior performance compared to the baseline, whereas fine-tuning the decoder or autoencoder further enhances the models' performance.
Blessing or curse? A survey on the Impact of Generative AI on Fake News
Authors: Authors: Alexander Loth, Martin Kappes, Marc-Oliver Pahl
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Fake news significantly influence our society. They impact consumers, voters, and many other societal groups. While Fake News exist for a centuries, Generative AI brings fake news on a new level. It is now possible to automate the creation of masses of high-quality individually targeted Fake News. On the other end, Generative AI can also help detecting Fake News. Both fields are young but developing fast. This survey provides a comprehensive examination of the research and practical use of Generative AI for Fake News detection and creation in 2024. Following the Structured Literature Survey approach, the paper synthesizes current results in the following topic clusters 1) enabling technologies, 2) creation of Fake News, 3) case study social media as most relevant distribution channel, 4) detection of Fake News, and 5) deepfakes as upcoming technology. The article also identifies current challenges and open issues.
Linear Anchored Gaussian Mixture Model for Location and Width Computation of Objects in Thick Line Shape
Abstract
An accurate detection of the centerlines of linear objects is a challenging topic in many sensitive real-world applications such X-ray imaging, remote sensing and lane marking detection in road traffic. Model-based approaches using Hough and Radon transforms are often used but, are not recommended for thick line detection, whereas approaches based on image derivatives need further step-by-step processing, making their efficiency dependent on each step outcomes. In this paper, we aim to detect linear structures found in images by considering the 3D representation of the image gray levels as a finite mixture model of statistical distribution. The latter, which we named linear anchored Gaussian distribution could be parametrized by a scale value {\sigma} describing the linear structure thickness and a line equation, parametrized, in turn, by a radius \r{ho} and an orientation angle {\theta}, describing the linear structure centerline location. Expectation-Maximization (EM) algorithm is used for the mixture model parameter estimation, where a new paradigm, using the background subtraction for the likelihood function computation, is proposed. For the EM algorithm, two {\theta} parameter initialization schemes are used: the first one is based on a random choice of the first component of {\theta} vector, whereas the second is based on the image Hessian with a simultaneous computation of the mixture model components number. Experiments on real world images and synthetic images corrupted by blur and additive noise show the good performance of the proposed methods, where the algorithm using background subtraction and Hessian-based {\theta} initialization provides an outstanding accuracy of the linear structure detection despite irregular image background and presence of blur and noise.
Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System
Authors: Authors: Xiwen Dengxiong, Xueting Wang, Shi Bai, Yunbo Zhang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Most existing 6-DoF robot grasping solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide 6-DoF grasp poses without grasp pose annotations. Specifically, the system collects the human demonstration from the AR environment and contrastively learns the grasping strategy from the demonstration. For the real-world experiment, the proposed system leads to satisfactory grasping abilities and learning to grasp unknown objects within three demonstrations.
Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking
Authors: Authors: Navid Mahdian, Mohammad Jani, Amir M. Soufi Enayati, Homayoun Najjaran
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-object tracking (MOT) is a prominent task in computer vision with application in autonomous driving, responsible for the simultaneous tracking of multiple object trajectories. Detection-based multi-object tracking (DBT) algorithms detect objects using an independent object detector and predict the imminent location of each target. Conventional prediction methods in DBT utilize Kalman Filter(KF) to extrapolate the target location in the upcoming frames by supposing a constant velocity motion model. These methods are especially hindered in autonomous driving applications due to dramatic camera motion or unavailable detections. Such limitations lead to tracking failures manifested by numerous identity switches and disrupted trajectories. In this paper, we introduce a novel KF-based prediction module called the Ego-motion Aware Target Prediction (EMAP) module by focusing on the integration of camera motion and depth information with object motion models. Our proposed method decouples the impact of camera rotational and translational velocity from the object trajectories by reformulating the Kalman Filter. This reformulation enables us to reject the disturbances caused by camera motion and maximizes the reliability of the object motion model. We integrate our module with four state-of-the-art base MOT algorithms, namely OC-SORT, Deep OC-SORT, ByteTrack, and BoT-SORT. In particular, our evaluation on the KITTI MOT dataset demonstrates that EMAP remarkably drops the number of identity switches (IDSW) of OC-SORT and Deep OC-SORT by 73% and 21%, respectively. At the same time, it elevates other performance metrics such as HOTA by more than 5%. Our source code is available at https://github.com/noyzzz/EMAP.
LTRDetector: Exploring Long-Term Relationship for Advanced Persistent Threats Detection
Authors: Authors: Xiaoxiao Liu, Fan Xu, Nan Wang, Qinxin Zhao, Dalin Zhang, Xibin Zhao, Jiqiang Liu
Abstract
Advanced Persistent Threat (APT) is challenging to detect due to prolonged duration, infrequent occurrence, and adept concealment techniques. Existing approaches primarily concentrate on the observable traits of attack behaviors, neglecting the intricate relationships formed throughout the persistent attack lifecycle. Thus, we present an innovative APT detection framework named LTRDetector, implementing an end-to-end holistic operation. LTRDetector employs an innovative graph embedding technique to retain comprehensive contextual information, then derives long-term features from these embedded provenance graphs. During the process, we compress the data of the system provenance graph for effective feature learning. Furthermore, in order to detect attacks conducted by using zero-day exploits, we captured the system's regular behavior and detects abnormal activities without relying on predefined attack signatures. We also conducted extensive evaluations using five prominent datasets, the efficacy evaluation of which underscores the superiority of LTRDetector compared to existing state-of-the-art techniques.
SEPE-SQED: Symbolic Quick Error Detection by Semantically Equivalent Program Execution
Abstract
Symbolic quick error detection (SQED) has greatly improved efficiency in formal chip verification. However, it has a limitation in detecting single-instruction bugs due to its reliance on the self-consistency property. To address this, we propose a new variant called symbolic quick error detection by semantically equivalent program execution (SEPE-SQED), which utilizes program synthesis techniques to find sequences with equivalent meanings to original instructions. SEPE-SQED effectively detects single-instruction bugs by differentiating their impact on the original instruction and its semantically equivalent program (instruction sequence). To manage the search space associated with program synthesis, we introduce the CEGIS based on the highest priority first algorithm. The experimental results show that our proposed CEGIS approach improves the speed of generating the desired set of equivalent programs by 50% in time compared to previous methods. Compared to SQED, SEPE-SQED offers a wider variety of instruction combinations and can provide a shorter trace for triggering bugs in certain scenarios.
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Abstract
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.
MonoCD: Monocular 3D Object Detection with Complementary Depths
Abstract
Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However, the errors of existing multiple depths tend to have the same sign, which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem, we propose to increase the complementarity of depths with two novel designs. First, we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second, we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs, our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition, complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD.
Learning Transferable Negative Prompts for Out-of-Distribution Detection
Abstract
Existing prompt learning methods have shown certain capabilities in Out-of-Distribution (OOD) detection, but the lack of OOD images in the target dataset in their training can lead to mismatches between OOD images and In-Distribution (ID) categories, resulting in a high false positive rate. To address this issue, we introduce a novel OOD detection method, named 'NegPrompt', to learn a set of negative prompts, each representing a negative connotation of a given class label, for delineating the boundaries between ID and OOD images. It learns such negative prompts with ID data only, without any reliance on external outlier data. Further, current methods assume the availability of samples of all ID classes, rendering them ineffective in open-vocabulary learning scenarios where the inference stage can contain novel ID classes not present during training. In contrast, our learned negative prompts are transferable to novel class labels. Experiments on various ImageNet benchmarks show that NegPrompt surpasses state-of-the-art prompt-learning-based OOD detection methods and maintains a consistent lead in hard OOD detection in closed- and open-vocabulary classification scenarios. Code is available at https://github.com/mala-lab/negprompt.
Simultaneous State Estimation and Contact Detection for Legged Robots by Multiple-Model Kalman Filtering
Authors: Authors: Marcel Menner, Karl Berntorp
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Abstract
This paper proposes an algorithm for combined contact detection and state estimation for legged robots. The proposed algorithm models the robot's movement as a switched system, in which different modes relate to different feet being in contact with the ground. The key element in the proposed algorithm is an interacting multiple-model Kalman filter, which identifies the currently-active mode defining contacts, while estimating the state. The rationale for the proposed estimation framework is that contacts (and contact forces) impact the robot's state and vice versa. This paper presents validation studies with a quadruped using (i) the high-fidelity simulator Gazebo for a comparison with ground truth values and a baseline estimator, and (ii) hardware experiments with the Unitree A1 robot. The simulation study shows that the proposed algorithm outperforms the baseline estimator, which does not simultaneous detect contacts. The hardware experiments showcase the applicability of the proposed algorithm and highlights the ability to detect contacts.
Design of Stickbug: a Six-Armed Precision Pollination Robot
Authors: Authors: Trevor Smith, Madhav Rijal, Christopher Tatsch, R. Michael Butts, Jared Beard, R. Tyler Cook, Andy Chu, Jason Gross, Yu Gu
Abstract
This work presents the design of Stickbug, a six-armed, multi-agent, precision pollination robot that combines the accuracy of single-agent systems with swarm parallelization in greenhouses. Precision pollination robots have often been proposed to offset the effects of a decreasing population of natural pollinators, but they frequently lack the required parallelization and scalability. Stickbug achieves this by allowing each arm and drive base to act as an individual agent, significantly reducing planning complexity. Stickbug uses a compact holonomic Kiwi drive to navigate narrow greenhouse rows, a tall mast to support multiple manipulators and reach plant heights, a detection model and classifier to identify Bramble flowers, and a felt-tipped end-effector for contact-based pollination. Initial experimental validation demonstrates that Stickbug can attempt over 1.5 pollinations per minute with a 50% success rate. Additionally, a Bramble flower perception dataset was created and is publicly available alongside Stickbug's software and design files.
A Methodology to Study the Impact of Spiking Neural Network Parameters considering Event-Based Automotive Data
Authors: Authors: Iqra Bano, Rachmad Vidya Wicaksana Putra, Alberto Marchisio, Muhammad Shafique
Abstract
Autonomous Driving (AD) systems are considered as the future of human mobility and transportation. Solving computer vision tasks such as image classification and object detection/segmentation, with high accuracy and low power/energy consumption, is highly needed to realize AD systems in real life. These requirements can potentially be satisfied by Spiking Neural Networks (SNNs). However, the state-of-the-art works in SNN-based AD systems still focus on proposing network models that can achieve high accuracy, and they have not systematically studied the roles of SNN parameters when used for learning event-based automotive data. Therefore, we still lack understanding of how to effectively develop SNN models for AD systems. Toward this, we propose a novel methodology to systematically study and analyze the impact of SNN parameters considering event-based automotive data, then leverage this analysis for enhancing SNN developments. To do this, we first explore different settings of SNN parameters that directly affect the learning mechanism (i.e., batch size, learning rate, neuron threshold potential, and weight decay), then analyze the accuracy results. Afterward, we propose techniques that jointly improve SNN accuracy and reduce training time. Experimental results show that our methodology can improve the SNN models for AD systems than the state-of-the-art, as it achieves higher accuracy (i.e., 86%) for the NCARS dataset, and it can also achieve iso-accuracy (i.e., ~85% with standard deviation less than 0.5%) while speeding up the training time by 1.9x. In this manner, our research work provides a set of guidelines for SNN parameter enhancements, thereby enabling the practical developments of SNN-based AD systems.
About Test-time training for outlier detection
Authors: Authors: Simon Klüttermann, Emmanuel Müller
Abstract
In this paper, we introduce DOUST, our method applying test-time training for outlier detection, significantly improving the detection performance. After thoroughly evaluating our algorithm on common benchmark datasets, we discuss a common problem and show that it disappears with a large enough test set. Thus, we conclude that under reasonable conditions, our algorithm can reach almost supervised performance even when no labeled outliers are given.
DQ-DETR: DETR with Dynamic Query for Tiny Object Detection
Abstract
Despite previous DETR-like methods having performed successfully in generic object detection, tiny object detection is still a challenging task for them since the positional information of object queries is not customized for detecting tiny objects, whose scale is extraordinarily smaller than general objects. Also, DETR-like methods using a fixed number of queries make them unsuitable for aerial datasets, which only contain tiny objects, and the numbers of instances are imbalanced between different images. Thus, we present a simple yet effective model, named DQ-DETR, which consists of three different components: categorical counting module, counting-guided feature enhancement, and dynamic query selection to solve the above-mentioned problems. DQ-DETR uses the prediction and density maps from the categorical counting module to dynamically adjust the number of object queries and improve the positional information of queries. Our model DQ-DETR outperforms previous CNN-based and DETR-like methods, achieving state-of-the-art mAP 30.2% on the AI-TOD-V2 dataset, which mostly consists of tiny objects.
Is CLIP the main roadblock for fine-grained open-world perception?
Authors: Authors: Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Fabrizio Falchi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time - a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings - i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.
Factored Task and Motion Planning with Combined Optimization, Sampling and Learning
Abstract
In this thesis, we aim to improve the performance of TAMP algorithms from three complementary perspectives. First, we investigate the integration of discrete task planning with continuous trajectory optimization. Our main contribution is a conflict-based solver that automatically discovers why a task plan might fail when considering the constraints of the physical world. This information is then fed back into the task planner, resulting in an efficient, bidirectional, and intuitive interface between task and motion, capable of solving TAMP problems with multiple objects, robots, and tight physical constraints. In the second part, we first illustrate that, given the wide range of tasks and environments within TAMP, neither sampling nor optimization is superior in all settings. To combine the strengths of both approaches, we have designed meta-solvers for TAMP, adaptive solvers that automatically select which algorithms and computations to use and how to best decompose each problem to find a solution faster. In the third part, we combine deep learning architectures with model-based reasoning to accelerate computations within our TAMP solver. Specifically, we target infeasibility detection and nonlinear optimization, focusing on generalization, accuracy, compute time, and data efficiency. At the core of our contributions is a refined, factored representation of the trajectory optimization problems inside TAMP. This structure not only facilitates more efficient planning, encoding of geometric infeasibility, and meta-reasoning but also provides better generalization in neural architectures.
Intent Detection and Entity Extraction from BioMedical Literature
Authors: Authors: Ankan Mullick, Mukur Gupta, Pawan Goyal
Abstract
Biomedical queries have become increasingly prevalent in web searches, reflecting the growing interest in accessing biomedical literature. Despite recent research on large-language models (LLMs) motivated by endeavours to attain generalized intelligence, their efficacy in replacing task and domain-specific natural language understanding approaches remains questionable. In this paper, we address this question by conducting a comprehensive empirical evaluation of intent detection and named entity recognition (NER) tasks from biomedical text. We show that Supervised Fine Tuned approaches are still relevant and more effective than general-purpose LLMs. Biomedical transformer models such as PubMedBERT can surpass ChatGPT on NER task with only 5 supervised examples.
Evaluating LLMs at Detecting Errors in LLM Responses
Abstract
With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at https://github.com/psunlpgroup/ReaLMistake.
Keyword: face recognition
If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces
Abstract
Recent advances in deep face recognition have spurred a growing demand for large, diverse, and manually annotated face datasets. Acquiring authentic, high-quality data for face recognition has proven to be a challenge, primarily due to privacy concerns. Large face datasets are primarily sourced from web-based images, lacking explicit user consent. In this paper, we examine whether and how synthetic face data can be used to train effective face recognition models with reduced reliance on authentic images, thereby mitigating data collection concerns. First, we explored the performance gap among recent state-of-the-art face recognition models, trained with synthetic data only and authentic (scarce) data only. Then, we deepened our analysis by training a state-of-the-art backbone with various combinations of synthetic and authentic data, gaining insights into optimizing the limited use of the latter for verification accuracy. Finally, we assessed the effectiveness of data augmentation approaches on synthetic and authentic data, with the same goal in mind. Our results highlighted the effectiveness of FR trained on combined datasets, particularly when combined with appropriate augmentation techniques.
Keyword: augmentation
On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models
Abstract
In this paper, we propose that small models may not need to absorb the cost of pre-training to reap its benefits. Instead, they can capitalize on the astonishing results achieved by modern, enormous models to a surprising degree. We observe that, when distilled on a task from a pre-trained teacher model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task. To allow this phenomenon to be easily leveraged, we establish a connection reducing knowledge distillation to modern contrastive learning, opening two doors: (1) vastly different model architecture pairings can work for the distillation, and (2) most contrastive learning algorithms rooted in the theory of Noise Contrastive Estimation can be easily applied and used. We demonstrate this paradigm using pre-trained teacher models from open-source model hubs, Transformer and convolution based model combinations, and a novel distillation algorithm that massages the Alignment/Uniformity perspective of contrastive learning by Wang & Isola (2020) into a distillation objective. We choose this flavor of contrastive learning due to its low computational cost, an overarching theme of this work. We also observe that this phenomenon tends not to occur if the task is data-limited. However, this can be alleviated by leveraging yet another scale-inspired development: large, pre-trained generative models for dataset augmentation. Again, we use an open-source model, and our rudimentary prompts are sufficient to boost the small model`s performance. Thus, we highlight a training method for small models that is up to 94% faster than the standard pre-training paradigm without sacrificing performance. For practitioners discouraged from fully utilizing modern foundation datasets for their small models due to the prohibitive scale, we believe our work keeps that door open.
A superconvergence result in the RBF-FD method
Authors: Authors: Andrej Kolar-Požun, Mitja Jančič, Gregor Kosec
Abstract
Radial Basis Function-generated Finite Differences (RBF-FD) is a meshless method that can be used to numerically solve partial differential equations. The solution procedure consists of two steps. First, the differential operator is discretised on given scattered nodes and afterwards, a global sparse matrix is assembled and inverted to obtain an approximate solution. Focusing on Polyharmonic Splines as our Radial Basis Functions (RBFs) of choice, appropriately augmented with monomials, it is well known that the truncation error of the differential operator approximation is determined by the degree of monomial augmentation. Naively, one might think that the solution error will have the same order of convergence. We present a superconvergence result that shows otherwise - for some augmentation degrees, order of convergence is higher than expected.
If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces
Abstract
Recent advances in deep face recognition have spurred a growing demand for large, diverse, and manually annotated face datasets. Acquiring authentic, high-quality data for face recognition has proven to be a challenge, primarily due to privacy concerns. Large face datasets are primarily sourced from web-based images, lacking explicit user consent. In this paper, we examine whether and how synthetic face data can be used to train effective face recognition models with reduced reliance on authentic images, thereby mitigating data collection concerns. First, we explored the performance gap among recent state-of-the-art face recognition models, trained with synthetic data only and authentic (scarce) data only. Then, we deepened our analysis by training a state-of-the-art backbone with various combinations of synthetic and authentic data, gaining insights into optimizing the limited use of the latter for verification accuracy. Finally, we assessed the effectiveness of data augmentation approaches on synthetic and authentic data, with the same goal in mind. Our results highlighted the effectiveness of FR trained on combined datasets, particularly when combined with appropriate augmentation techniques.
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training
Abstract
Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports. However, existing approaches often face challenges in encoding medical knowledge effectively. While radiology reports provide insights into the current disease manifestation, medical definitions (as used by contemporary methods) tend to be overly abstract, creating a gap in knowledge. To address this, we propose DeViDe, a novel transformer-based method that leverages radiographic descriptions from the open web. These descriptions outline general visual characteristics of diseases in radiographs, and when combined with abstract definitions and radiology reports, provide a holistic snapshot of knowledge. DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources. Second, this knowledge is aligned with image information at various levels of granularity. Third, a novel projection layer is proposed to handle the complexity of aligning each image with multiple descriptions arising in a multi-label setting. In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets. Additionally, fine-tuning DeViDe on four downstream tasks and six segmentation tasks showcases its superior performance across data from diverse distributions.
Keyword: detection
An Unsupervised Adversarial Autoencoder for Cyber Attack Detection in Power Distribution Grids
KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
Foundation Models for Structural Health Monitoring
ASAP: Interpretable Analysis and Summarization of AI-generated Image Patterns at Scale
Transfer learning applications for anomaly detection in wind turbines
Blessing or curse? A survey on the Impact of Generative AI on Fake News
Linear Anchored Gaussian Mixture Model for Location and Width Computation of Objects in Thick Line Shape
Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System
Ego-Motion Aware Target Prediction Module for Robust Multi-Object Tracking
LTRDetector: Exploring Long-Term Relationship for Advanced Persistent Threats Detection
SEPE-SQED: Symbolic Quick Error Detection by Semantically Equivalent Program Execution
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
MonoCD: Monocular 3D Object Detection with Complementary Depths
Learning Transferable Negative Prompts for Out-of-Distribution Detection
Simultaneous State Estimation and Contact Detection for Legged Robots by Multiple-Model Kalman Filtering
Design of Stickbug: a Six-Armed Precision Pollination Robot
A Methodology to Study the Impact of Spiking Neural Network Parameters considering Event-Based Automotive Data
About Test-time training for outlier detection
DQ-DETR: DETR with Dynamic Query for Tiny Object Detection
Is CLIP the main roadblock for fine-grained open-world perception?
Factored Task and Motion Planning with Combined Optimization, Sampling and Learning
Intent Detection and Entity Extraction from BioMedical Literature
Evaluating LLMs at Detecting Errors in LLM Responses
Keyword: face recognition
If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces
Keyword: augmentation
On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models
A superconvergence result in the RBF-FD method
If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training