Abstract
Cover songs are alternate versions of a song by a different artist. Long being a vital part of the music industry, cover songs significantly influence music culture and are commonly heard in public venues. The rise of online music platforms has further increased their prevalence, often as background music or video soundtracks. While current automatic identification methods serve adequately for original songs, they are less effective with cover songs, primarily because cover versions often significantly deviate from the original compositions. In this paper, we propose a novel method for cover song detection that utilizes the lyrics of a song. We introduce a new dataset for cover songs and their corresponding originals. The dataset contains 5078 cover songs and 2828 original songs. In contrast to other cover song datasets, it contains the annotated lyrics for the original song and the cover song. We evaluate our method on this dataset and compare it with multiple baseline approaches. Our results show that our method outperforms the baseline approaches.
Title:
DeTra: A Unified Model for Object Detection and Trajectory Forecasting
Authors: Sergio Casas, Ben Agro, Jiageng Mao, Thomas Gilles, Alexander Cui, Thomas Li, Raquel Urtasun
Abstract
The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that \ourmodel{} outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.
Title:
Rough Set improved Therapy-Based Metaverse Assisting System
Abstract
Chronic neck and shoulder pain (CNSP) is a major global public health issue. Traditional treatments like physiotherapy and rehabilitation have drawbacks, including high costs, low precision, and user discomfort. This paper presents an interactive system based on Cognitive Therapy Theory (CBT) for CNSP treatment. The system includes a pain detection module using EMG and IMU to monitor pain and optimize data with Rough Set theory, and a cognitive therapy module that processes this data further for CBT-based interventions, including massage and heating therapy. An experimental plan is outlined to evaluate the system's effectiveness and performance. The goal is to create an accessible device for CNSP therapy. Additionally, the paper explores the system's application in a metaverse environment, enhancing treatment immersion and personalization. The metaverse platform simulates treatment environments and responds to real-time patient data, allowing for continuous monitoring and adjustment of treatment plans.
Title:
Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
Authors: Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan
Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
Abstract
Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
Title:
A multi-core periphery perspective: Ranking via relative centrality
Abstract
Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \& Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \& Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized. In this direction, we introduce a novel quantification for graphs with ground truth communities, where each community has a densely connected part (the core), and the rest is more sparse (the periphery), with inter-community edges more frequent between the peripheries. Built on this structure, we propose a new algorithmic concept that we call relative centrality to detect the cores. We observe that core-detection algorithms based on popular centrality measures such as PageRank and degree centrality can show some bias in their outcome by selecting very few vertices from some cores. We show that relative centrality solves this bias issue and provide theoretical and simulation support, as well as experiments on real-world graphs. Core detection is known to have important applications with respect to core-periphery structures. In our model, we show a new application: relative-centrality-based algorithms can select a subset of the vertices such that it contains sufficient vertices from all communities, and points in this subset are better separable into their respective communities. We apply the methods to 11 biological datasets, with our methods resulting in a more balanced selection of vertices from all communities such that clustering algorithms have better performance on this set.
Title:
CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset
Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Abstract
In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (this https URL).
Title:
Classification of Non-native Handwritten Characters Using Convolutional Neural Network
Authors: F. A. Mamun, S. A. H. Chowdhury, J. E. Giti, H. Sarker
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
The use of convolutional neural networks (CNNs) has accelerated the progress of handwritten character classification/recognition. Handwritten character recognition (HCR) has found applications in various domains, such as traffic signal detection, language translation, and document information extraction. However, the widespread use of existing HCR technology is yet to be seen as it does not provide reliable character recognition with outstanding accuracy. One of the reasons for unreliable HCR is that existing HCR methods do not take the handwriting styles of non-native writers into account. Hence, further improvement is needed to ensure the reliability and extensive deployment of character recognition technologies for critical tasks. In this work, the classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model. We train this CNN with a new dataset called the handwritten isolated English character (HIEC) dataset. This dataset consists of 16,496 images collected from 260 persons. This paper also includes an ablation study of our CNN by adjusting hyperparameters to identify the best model for the HIEC dataset. The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy and achieves an accuracy of $\mathbf{97.04}$%. Compared with the second-best model, the relative improvement of our model in terms of classification accuracy is $\mathbf{4.38}$%.
Title:
TESTEVAL: Benchmarking Large Language Models for Test Case Generation
Authors: Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, Lei Ma
Abstract
Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths. We have open-sourced our dataset and benchmark pipelines at this https URL to contribute and accelerate future research on LLMs for software testing.
Title:
FOOD: Facial Authentication and Out-of-Distribution Detection with Short-Range FMCW Radar
Authors: Sabri Mustafa Kahya, Boran Hamdi Sivrikaya, Muhammet Sami Yavuz, Eckehard Steinbach
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract
This paper proposes a short-range FMCW radar-based facial authentication and out-of-distribution (OOD) detection framework. Our pipeline jointly estimates the correct classes for the in-distribution (ID) samples and detects the OOD samples to prevent their inaccurate prediction. Our reconstruction-based architecture consists of a main convolutional block with one encoder and multi-decoder configuration, and intermediate linear encoder-decoder parts. Together, these elements form an accurate human face classifier and a robust OOD detector. For our dataset, gathered using a 60 GHz short-range FMCW radar, our network achieves an average classification accuracy of 98.07% in identifying in-distribution human faces. As an OOD detector, it achieves an average Area Under the Receiver Operating Characteristic (AUROC) curve of 98.50% and an average False Positive Rate at 95% True Positive Rate (FPR95) of 6.20%. Also, our extensive experiments show that the proposed approach outperforms previous OOD detectors in terms of common OOD detection metrics.
Title:
Camera-Pose Robust Crater Detection from Chang'e 5
Authors: Matthew Rodda, Sofia McLeod, Ky Cuong Pham, Tat-Jun Chin
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
As space missions aim to explore increasingly hazardous terrain, accurate and timely position estimates are required to ensure safe navigation. Vision-based navigation achieves this goal through correlating impact craters visible through onboard imagery with a known database to estimate a craft's pose. However, existing literature has not sufficiently evaluated crater-detection algorithm (CDA) performance from imagery containing off-nadir view angles. In this work, we evaluate the performance of Mask R-CNN for crater detection, comparing models pretrained on simulated data containing off-nadir view angles and to pretraining on real-lunar images. We demonstrate pretraining on real-lunar images is superior despite the lack of images containing off-nadir view angles, achieving detection performance of 63.1 F1-score and ellipse-regression performance of 0.701 intersection over union. This work provides the first quantitative analysis of performance of CDAs on images containing off-nadir view angles. Towards the development of increasingly robust CDAs, we additionally provide the first annotated CDA dataset with off-nadir view angles from the Chang'e 5 Landing Camera.
Title:
Attention Fusion Reverse Distillation for Multi-Lighting Image Anomaly Detection
Abstract
This study targets Multi-Lighting Image Anomaly Detection (MLIAD), where multiple lighting conditions are utilized to enhance imaging quality and anomaly detection performance. While numerous image anomaly detection methods have been proposed, they lack the capacity to handle multiple inputs for a single sample, like multi-lighting images in MLIAD. Hence, this study proposes Attention Fusion Reverse Distillation (AFRD) to handle multiple inputs in MLIAD. For this purpose, AFRD utilizes a pre-trained teacher network to extract features from multiple inputs. Then these features are aggregated into fused features through an attention module. Subsequently, a corresponding student net-work is utilized to regress the attention fused features. The regression errors are denoted as anomaly scores during inference. Experiments on Eyecandies demonstrates that AFRD achieves superior MLIAD performance than other MLIAD alternatives, also highlighting the benefit of using multiple lighting conditions for anomaly detection.
Title:
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
Authors: Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Huang Zhong, Dennis Cai, Yuan Xie, Binzhang Fu
Abstract
The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.
Title:
Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis
Authors: Xintong Wang, Mingqian Shi, Ye Wang
Subjects: Subjects:
Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Abstract
Mispronunciation Detection and Diagnosis (MDD) systems, leveraging Automatic Speech Recognition (ASR), face two main challenges in Mandarin Chinese: 1) The two-stage models create an information gap between the phoneme or tone classification stage and the MDD stage. 2) The scarcity of Mandarin MDD datasets limits model training. In this paper, we introduce a stateless RNN-T model for Mandarin MDD, utilizing HuBERT features with pitch embedding through a Pitch Fusion Block. Our model, trained solely on native speaker data, shows a 3% improvement in Phone Error Rate and a 7% increase in False Acceptance Rate over the state-of-the-art baseline in non-native scenarios
Title:
Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions
Abstract
The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at this https URL.
Title:
A Recover-then-Discriminate Framework for Robust Anomaly Detection
Authors: Peng Xing, Dong Zhang, Jinhui Tang, Zechao li
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Anomaly detection (AD) has been extensively studied and applied in a wide range of scenarios in the recent past. However, there are still gaps between achieved and desirable levels of recognition accuracy for making AD for practical applications. In this paper, we start from an insightful analysis of two types of fundamental yet representative failure cases in the baseline model, and reveal reasons that hinder current AD methods from achieving a higher recognition accuracy. Specifically, by Case-1, we found that the main reasons detrimental to current AD methods is that the inputs to the recovery model contain a large number of detailed features to be recovered, which leads to the normal/abnormal area has-not/has been recovered into its original state. By Case-2, we surprisingly found that the abnormal area that cannot be recognized in image-level representations can be easily recognized in the feature-level representation. Based on the above observations, we propose a novel Recover-then-Discriminate (ReDi) framework for AD. ReDi takes a self-generated feature map and a selected prompted image as explicit input information to solve problems in case-1. Concurrently, a feature-level discriminative network is proposed to enhance abnormal differences between the recovered representation and the input representation. Extensive experimental results on two popular yet challenging AD datasets validate that ReDi achieves the new state-of-the-art accuracy.
Abstract
A novel approach for forest fire detection using image processing technique is proposed. A rule-based color model for fire pixel classification is used. The proposed algorithm uses RGB and YCbCr color space. The advantage of using YCbCr color space is that it can separate the luminance from the chrominance more effectively than RGB color space. The performance of the proposed algorithm is tested on two sets of images, one of which contains fire; the other contains fire-like regions. Standard methods are used for calculating the performance of the algorithm. The proposed method has both higher detection rate and lower false alarm rate. Since the algorithm is cheap in computation, it can be used for real-time forest fire detection.
Title:
UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection
Abstract
With the advancement of collaborative perception, the role of aerial-ground collaborative perception, a crucial component, is becoming increasingly important. The demand for collaborative perception across different perspectives to construct more comprehensive perceptual information is growing. However, challenges arise due to the disparities in the field of view between cross-domain agents and their varying sensitivity to information in images. Additionally, when we transform image features into Bird's Eye View (BEV) features for collaboration, we need accurate depth information. To address these issues, we propose a framework specifically designed for aerial-ground collaboration. First, to mitigate the lack of datasets for aerial-ground collaboration, we develop a virtual dataset named V2U-COO for our research. Second, we design a Cross-Domain Cross-Adaptation (CDCA) module to align the target information obtained from different domains, thereby achieving more accurate perception results. Finally, we introduce a Collaborative Depth Optimization (CDO) module to obtain more precise depth estimation results, leading to more accurate perception outcomes. We conduct extensive experiments on both our virtual dataset and a public dataset to validate the effectiveness of our framework. Our experiments on the V2U-COO dataset and the DAIR-V2X dataset demonstrate that our method improves detection accuracy by 6.1% and 2.7%, respectively.
Title:
UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping
Abstract
Multi-UAV collaborative 3D object detection can perceive and comprehend complex environments by integrating complementary information, with applications encompassing traffic monitoring, delivery services and agricultural management. However, the extremely broad observations in aerial remote sensing and significant perspective differences across multiple UAVs make it challenging to achieve precise and consistent feature mapping from 2D images to 3D space in multi-UAV collaborative 3D object detection paradigm. To address the problem, we propose an unparalleled camera-based multi-UAV collaborative 3D object detection paradigm called UCDNet. Specifically, the depth information from the UAVs to the ground is explicitly utilized as a strong prior to provide a reference for more accurate and generalizable feature mapping. Additionally, we design a homologous points geometric consistency loss as an auxiliary self-supervision, which directly influences the feature mapping module, thereby strengthening the global consistency of multi-view perception. Experiments on AeroCollab3D and CoPerception-UAVs datasets show our method increases 4.7% and 10% mAP respectively compared to the baseline, which demonstrates the superiority of UCDNet.
Title:
LogiCode: an LLM-Driven Framework for Logical Anomaly Detection
Abstract
This paper presents LogiCode, a novel framework that leverages Large Language Models (LLMs) for identifying logical anomalies in industrial settings, moving beyond traditional focus on structural inconsistencies. By harnessing LLMs for logical reasoning, LogiCode autonomously generates Python codes to pinpoint anomalies such as incorrect component quantities or missing elements, marking a significant leap forward in anomaly detection technologies. A custom dataset "LOCO-Annotations" and a benchmark "LogiBench" are introduced to evaluate the LogiCode's performance across various metrics including binary classification accuracy, code generation success rate, and precision in reasoning. Findings demonstrate LogiCode's enhanced interpretability, significantly improving the accuracy of logical anomaly detection and offering detailed explanations for identified anomalies. This represents a notable shift towards more intelligent, LLM-driven approaches in industrial anomaly detection, promising substantial impacts on industry-specific applications.
Title:
Higher-order Structure Based Anomaly Detection on Attributed Networks
Abstract
Anomaly detection (such as telecom fraud detection and medical image detection) has attracted the increasing attention of people. The complex interaction between multiple entities widely exists in the network, which can reflect specific human behavior patterns. Such patterns can be modeled by higher-order network structures, thus benefiting anomaly detection on attributed networks. However, due to the lack of an effective mechanism in most existing graph learning methods, these complex interaction patterns fail to be applied in detecting anomalies, hindering the progress of anomaly detection to some extent. In order to address the aforementioned issue, we present a higher-order structure based anomaly detection (GUIDE) method. We exploit attribute autoencoder and structure autoencoder to reconstruct node attributes and higher-order structures, respectively. Moreover, we design a graph attention layer to evaluate the significance of neighbors to nodes through their higher-order structure differences. Finally, we leverage node attribute and higher-order structure reconstruction errors to find anomalies. Extensive experiments on five real-world datasets (i.e., ACM, Citation, Cora, DBLP, and Pubmed) are implemented to verify the effectiveness of GUIDE. Experimental results in terms of ROC-AUC, PR-AUC, and Recall@K show that GUIDE significantly outperforms the state-of-art methods.
Title:
Unsupervised representation learning with Hebbian synaptic and structural plasticity in brain-like feedforward neural networks
Authors: Naresh Ravichandran, Anders Lansner, Pawel Herman
Subjects: Subjects:
Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Abstract
Neural networks that can capture key principles underlying brain computation offer exciting new opportunities for developing artificial intelligence and brain-like computing algorithms. Such networks remain biologically plausible while leveraging localized forms of synaptic learning rules and modular network architecture found in the neocortex. Compared to backprop-driven deep learning approches, they provide more suitable models for deploying on neuromorphic hardware and have greater potential for scalability on large-scale computing clusters. The development of such brain-like neural networks depends on having a learning procedure that can build effective internal representations from data. In this work, we introduce and evaluate a brain-like neural network model capable of unsupervised representation learning. It builds on the Bayesian Confidence Propagation Neural Network (BCPNN), which has earlier been implemented as abstract as well as biophyscially detailed recurrent attractor neural networks explaining various cortical associative memory phenomena. Here we developed a feedforward BCPNN model to perform representation learning by incorporating a range of brain-like attributes derived from neocortical circuits such as cortical columns, divisive normalization, Hebbian synaptic plasticity, structural plasticity, sparse activity, and sparse patchy connectivity. The model was tested on a diverse set of popular machine learning benchmarks: grayscale images (MNIST, Fashion-MNIST), RGB natural images (SVHN, CIFAR-10), QSAR (MUV, HIV), and malware detection (EMBER). The performance of the model when using a linear classifier to predict the class labels fared competitively with conventional multi-layer perceptrons and other state-of-the-art brain-like neural networks.
Title:
Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization
Authors: Huanhuan Ma, Jinghao Zhang, Qiang Liu, Shu Wu, Liang Wang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
The rapid spread of information through mobile devices and media has led to the widespread of false or deceptive news, causing significant concerns in society. Among different types of misinformation, image repurposing, also known as out-of-context misinformation, remains highly prevalent and effective. However, current approaches for detecting out-of-context misinformation often lack interpretability and offer limited explanations. In this study, we propose a logic regularization approach for out-of-context detection called LOGRAN (LOGic Regularization for out-of-context ANalysis). The primary objective of LOGRAN is to decompose the out-of-context detection at the phrase level. By employing latent variables for phrase-level predictions, the final prediction of the image-caption pair can be aggregated using logical rules. The latent variables also provide an explanation for how the final result is derived, making this fine-grained detection method inherently explanatory. We evaluate the performance of LOGRAN on the NewsCLIPpings dataset, showcasing competitive overall results. Visualized examples also reveal faithful phrase-level predictions of out-of-context images, accompanied by explanations. This highlights the effectiveness of our approach in addressing out-of-context detection and enhancing interpretability.
Title:
EGOR: Efficient Generated Objects Replay for incremental object detection
Authors: Zijia An, Boyu Diao, Libo Huang, Ruiqi Liu, Zhulin An, Yongjun Xu
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Incremental object detection aims to simultaneously maintain old-class accuracy and detect emerging new-class objects in incremental data. Most existing distillation-based methods underperform when unlabeled old-class objects are absent in the incremental dataset. While the absence can be mitigated by generating old-class samples, it also incurs high computational costs. In this paper, we argue that the extra computational cost stems from the inconsistency between the detector and the generative model, along with redundant generation. To overcome this problem, we propose Efficient Generated Object Replay (EGOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We also propose augmented replay to reuse the objects in generated samples, thereby reducing the redundant generation. In addition, we propose high-response knowledge distillation focusing on the knowledge related to the old class, which transfers the knowledge in generated objects to the incremental detector. With the addition of the generated objects and losses, we observe a bias towards old classes in the detector. We balance the losses for old and new classes to alleviate the bias, thereby increasing the overall detection accuracy. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in the absence of old-class objects.
Title:
HateDebias: On the Diversity and Variability of Hate Speech Debiasing
Abstract
Hate speech on social media is ubiquitous but urgently controlled. Without detecting and mitigating the biases brought by hate speech, different types of ethical problems. While a number of datasets have been proposed to address the problem of hate speech detection, these datasets seldom consider the diversity and variability of bias, making it far from real-world scenarios. To fill this gap, we propose a benchmark, named HateDebias, to analyze the model ability of hate speech detection under continuous, changing environments. Specifically, to meet the diversity of biases, we collect existing hate speech detection datasets with different types of biases. To further meet the variability (i.e., the changing of bias attributes in datasets), we reorganize datasets to follow the continuous learning setting. We evaluate the detection accuracy of models trained on the datasets with a single type of bias with the performance on the HateDebias, where a significant performance drop is observed. To provide a potential direction for debiasing, we further propose a debiasing framework based on continuous learning and bias information regularization, as well as the memory replay strategies to ensure the debiasing ability of the model. Experiment results on the proposed benchmark show that the aforementioned method can improve several baselines with a distinguished margin, highlighting its effectiveness in real-world applications.
Title:
Seeing the Unseen: Visual Metaphor Captioning for Videos
Abstract
Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.
Title:
Sexism Detection on a Data Diet
Authors: Rabiraj Bandyopadhyay, Dennis Assenmacher, Jose M.Alonso Moral, Claudia Wagner
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
There is an increase in the proliferation of online hate commensurate with the rise in the usage of social media. In response, there is also a significant advancement in the creation of automated tools aimed at identifying harmful text content using approaches grounded in Natural Language Processing and Deep Learning. Although it is known that training Deep Learning models require a substantial amount of annotated data, recent line of work suggests that models trained on specific subsets of the data still retain performance comparable to the model that was trained on the full dataset. In this work, we show how we can leverage influence scores to estimate the importance of a data point while training a model and designing a pruning strategy applied to the case of sexism detection. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets and find, that in accordance with other work a large fraction of instances can be removed without significant performance drop. However, we also discover that the strategies for pruning data, previously successful in Natural Language Inference tasks, do not readily apply to the detection of harmful content and instead amplify the already prevalent class imbalance even more, leading in the worst-case to a complete absence of the hateful class.
Title:
Concept Drift Detection using Ensemble of Integrally Private Models
Authors: Ayush K. Varshney, Vicenc Torra
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Abstract
Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streaming data, where data may change its distribution frequently. These concept drifts must be detected privately in order to avoid any disclosure risk from DNNs. Existing privacy models use concept drift detection schemes such ADWIN, KSWIN to detect the drifts. In this paper, we focus on the notion of integrally private DNNs to detect concept drifts. Integrally private DNNs are the models which recur frequently from different datasets. Based on this, we introduce an ensemble methodology which we call 'Integrally Private Drift Detection' (IPDD) method to detect concept drift from private models. Our IPDD method does not require labels to detect drift but assumes true labels are available once the drift has been detected. We have experimented with binary and multi-class synthetic and real-world data. Our experimental results show that our methodology can privately detect concept drift, has comparable utility (even better in some cases) with ADWIN and outperforms utility from different levels of differentially private models. The source code for the paper is available \hyperlink{this https URL}{here}.
Title:
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
Abstract
The recent advancements in generative AI models, which can create realistic and human-like content, are significantly transforming how people communicate, create, and work. While the appropriate use of generative AI models can benefit the society, their misuse poses significant threats to data reliability and authentication. However, due to a lack of aligned multimodal datasets, effective and robust methods for detecting machine-generated content are still in the early stages of development. In this paper, we introduce RU-AI, a new large-scale multimodal dataset designed for the robust and efficient detection of machine-generated content in text, image, and voice. Our dataset is constructed from three large publicly available datasets: Flickr8K, COCO, and Places205, by combining the original datasets and their corresponding machine-generated pairs. Additionally, experimental results show that our proposed unified model, which incorporates a multimodal embedding module with a multilayer perceptron network, can effectively determine the origin of the data (i.e., original data samples or machine-generated ones) from RU-AI. However, future work is still required to address the remaining challenges posed by RU-AI. The source code and dataset are available at this https URL.
Title:
PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs
Authors: Binglei Lou, Richard Rademacher, David Boland, Philip H.W. Leong
Abstract
FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modelled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining $A$ PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of $1.3-7.7\times$ with a $1.2-2.2\times$ decrease in latency.
Title:
Faster Than Lies: Real-time Deepfake Detection using Binary Neural Networks
Authors: Lanzino Romeo, Fontana Federico, Diko Anxhelo, Marini Marco Raoul, Cinque Luigi
Abstract
Deepfake detection aims to contrast the spread of deep-generated media that undermines trust in online content. While existing methods focus on large and complex models, the need for real-time detection demands greater efficiency. With this in mind, unlike previous work, we introduce a novel deepfake detection approach on images using Binary Neural Networks (BNNs) for fast inference with minimal accuracy loss. Moreover, our method incorporates Fast Fourier Transform (FFT) and Local Binary Pattern (LBP) as additional channel features to uncover manipulation traces in frequency and texture domains. Evaluations on COCOFake, DFFD, and CIFAKE datasets demonstrate our method's state-of-the-art performance in most scenarios with a significant efficiency gain of up to a $20\times$ reduction in FLOPs during inference. Finally, by exploring BNNs in deepfake detection to balance accuracy and efficiency, this work paves the way for future research on efficient deepfake detection.
Title:
Nacala-Roof-Material: Drone Imagery for Roof Detection, Classification, and Segmentation to Support Mosquito-borne Disease Risk Assessment
Authors: Venkanna Babu Guthula, Stefan Oehmcke, Remigio Chilaule, Hui Zhang, Nico Lang, Ankit Kariryaa, Johan Mottelson, Christian Igel
Abstract
As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique with corresponding labels delineating houses and specifying their roof types. The dataset defines a multi-task computer vision problem, comprising object detection, classification, and segmentation. In addition, we benchmarked various state-of-the-art approaches on the dataset. Canonical U-Nets, YOLOv8, and a custom decoder on pretrained DINOv2 served as baselines. We show that each of the methods has its advantages but none is superior on all tasks, which highlights the potential of our dataset for future research in multi-task learning. While the tasks are closely related, accurate segmentation of objects does not necessarily imply accurate instance separation, and vice versa. We address this general issue by introducing a variant of the deep ordinal watershed (DOW) approach that additionally separates the interior of objects, allowing for improved object delineation and separation. We show that our DOW variant is a generic approach that improves the performance of both U-Net and DINOv2 backbones, leading to a better trade-off between semantic segmentation and instance segmentation.
Title:
From cryptomarkets to the surface web: Scouting eBay for counterfeits
Authors: Felix Soldner, Fabian Plum, Bennett Kleinberg, Shane D Johnson
Subjects: Subjects:
Social and Information Networks (cs.SI)
Abstract
Detecting counterfeits on online marketplaces is challenging, and current methods struggle with the volume of sales on platforms like eBay, while cryptomarkets openly sell counterfeits. Leveraging information from 453 cryptomarket counterfeits, we automated a search for corresponding products on eBay, utilizing image and text similarity metrics. We collected data twice over 4-months to analyze changes with an average of 159 eBay products per cryptomarket item, totaling 134k products. We found identical products, which would warrant further investigation as to whether they are counterfeits. Results indicate increasing difficulty finding similar products over time, moderated by product type and origin. Future improved versions of the current system could be used to examine possible connections between cryptomarket and surface web listings more closely and could hold practical value in supporting the detection of counterfeits on the surface web.
Title:
GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications
Authors: Shakhnaz Akhmedova, Nils Körber
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Generative adversarial networks (GANs) are machine learning models that are used to estimate the underlying statistical structure of a given dataset and as a result can be used for a variety of tasks such as image generation or anomaly detection. Despite their initial simplicity, designing an effective loss function for training GANs remains challenging, and various loss functions have been proposed aiming to improve the performance and stability of the generative models. In this study, loss function design for GANs is presented as an optimization problem solved using the genetic programming (GP) approach. Initial experiments were carried out using small Deep Convolutional GAN (DCGAN) model and the MNIST dataset, in order to search experimentally for an improved loss function. The functions found were evaluated on CIFAR10, with the best function, named GANetic loss, showing exceptionally better performance and stability compared to the losses commonly used for GAN training. To further evalute its general applicability on more challenging problems, GANetic loss was applied for two medical applications: image generation and anomaly detection. Experiments were performed with histopathological, gastrointestinal or glaucoma images to evaluate the GANetic loss in medical image generation, resulting in improved image quality compared to the baseline models. The GANetic Loss used for polyp and glaucoma images showed a strong improvement in the detection of anomalies. In summary, the GANetic loss function was evaluated on multiple datasets and applications where it consistently outperforms alternative loss functions. Moreover, GANetic loss leads to stable training and reproducible results, a known weak spot of GANs.
Title:
I2EDL: Interactive Instruction Error Detection and Localization
Authors: Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, Yiming Wang
Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract
In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. "turn left" instead of "turn right"). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user's cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.
Keyword: face recognition
There is no result
Keyword: augmentation
Title:
Enhancing Size Generalization in Graph Neural Networks through Disentangled Representation Learning
Authors: Zheng Huang, Qihui Yang, Dawei Zhou, Yujun Yan
Abstract
Although most graph neural networks (GNNs) can operate on graphs of any size, their classification performance often declines on graphs larger than those encountered during training. Existing methods insufficiently address the removal of size information from graph representations, resulting in sub-optimal performance and reliance on backbone models. In response, we propose DISGEN, a novel and model-agnostic framework designed to disentangle size factors from graph representations. DISGEN employs size- and task-invariant augmentations and introduces a decoupling loss that minimizes shared information in hidden representations, with theoretical guarantees for its effectiveness. Our empirical results show that DISGEN outperforms the state-of-the-art models by up to 6% on real-world datasets, underscoring its effectiveness in enhancing the size generalizability of GNNs. Our codes are available at: this https URL.
Title:
Cooperative Meta-Learning with Gradient Augmentation
Abstract
Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the gradient is one of the widely used regularization methods. In this work, we propose a novel cooperative meta-learning framework dubbed CML which leverages gradient-level regularization with gradient augmentation. We inject learnable noise into the gradient of the model for the model generalization. The key idea of CML is introducing the co-learner which has no inner update but the outer loop update to augment gradients for finding better meta-initialization parameters. Since the co-learner does not update in the inner loop, it can be easily deleted after meta-training. Therefore, CML infers with only meta-learner without additional cost and performance degradation. We demonstrate that CML is easily applicable to gradient-based meta-learning methods and CML leads to increased performance in few-shot regression, few-shot image classification and few-shot node classification tasks. Our codes are at this https URL.
Title:
Annotating FrameNet via Structure-Conditioned Language Generation
Authors: Xinyue Cui, Swabha Swayamdipta
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
Title:
Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments
Abstract
Forecasting indoor temperatures is important to achieve efficient control of HVAC systems. In this task, the limited data availability presents a challenge as most of the available data is acquired during standard operation where extreme scenarios and transitory regimes such as major temperature increases or decreases are de-facto excluded. Acquisition of such data requires significant energy consumption and a dedicated facility, hindering the quantity and diversity of available data. Cost related constraints however do not allow for continuous year-around acquisition. To address this, we investigate the efficacy of data augmentation techniques leveraging SoTA AI-based methods for synthetic data generation. Inspired by practical and experimental motivations, we explore fusion strategies of real and synthetic data to improve forecasting models. This approach alleviates the need for continuously acquiring extensive time series data, especially in contexts involving repetitive heating and cooling cycles in buildings. In our evaluation 1) we assess the performance of synthetic data generators independently, particularly focusing on SoTA AI-based methods; 2) we measure the utility of incorporating synthetically augmented data in a subsequent forecasting tasks where we employ a simple model in two distinct scenarios: 1) we first examine an augmentation technique that combines real and synthetically generated data to expand the training dataset, 2) we delve into utilizing synthetic data to tackle dataset imbalances. Our results highlight the potential of synthetic data augmentation in enhancing forecasting accuracy while mitigating training variance. Through empirical experiments, we show significant improvements achievable by integrating synthetic data, thereby paving the way for more robust forecasting models in low-data regime.
Title:
Semantic Segmentation on VSPW Dataset through Masked Video Consistency
Abstract
Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.
Title:
Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness
Authors: Zahra Golpayegani, Patrick St-Amant, Nizar Bouguila
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Deep learning models can perform well when evaluated on images from the same distribution as the training set. However, applying small perturbations in the forms of noise, artifacts, occlusions, blurring, etc. to a model's input image and feeding the model with out-of-distribution (OOD) data can significantly drop the model's accuracy, making it not applicable to real-world scenarios. Data augmentation is one of the well-practiced methods to improve model robustness against OOD data; however, examining which augmentation type to choose and how it affects the OOD robustness remains understudied. There is a growing belief that augmenting datasets using data augmentations that improve a model's bias to shape-based features rather than texture-based features results in increased OOD robustness for Convolutional Neural Networks trained on the ImageNet-1K dataset. This is usually stated as ``an increase in the model's shape bias results in an increase in its OOD robustness". Based on this hypothesis, some works in the literature aim to find augmentations with higher effects on model shape bias and use those for data augmentation. By evaluating 39 types of data augmentations on a widely used OOD dataset, we demonstrate the impact of each data augmentation on the model's robustness to OOD data and further show that the mentioned hypothesis is not true; an increase in shape bias does not necessarily result in higher OOD robustness. By analyzing the results, we also find some biases in the ImageNet-1K dataset that can easily be reduced using proper data augmentation. Our evaluation results further show that there is not necessarily a trade-off between in-domain accuracy and OOD robustness, and choosing the proper augmentations can help increase both in-domain accuracy and OOD robustness simultaneously.
Title:
A Novel Time Series-to-Image Encoding Approach for Weather Phenomena Classification
Authors: Christian Giannetti
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Rainfall estimation through the analysis of its impact on electromagnetic waves has sparked increasing interest in the research community. Recent studies have delved into its effects on cellular network performance, demonstrating the potential to forecast rainfall levels based on electromagnetic wave attenuation during precipitations. This paper aims to solve the problem of identifying the nature of specific weather phenomena from the received signal level (RSL) in 4G/LTE mobile terminals. Specifically, utilizing time-series data representing RSL, we propose a novel approach to encode time series as images and model the task as an image classification problem, which we finally address using convolutional neural networks (CNNs). The main benefit of the abovementioned procedure is the opportunity to utilize various data augmentation techniques simultaneously. This encompasses applying traditional approaches, such as moving averages, to the time series and enhancing the generated images. We have investigated various image data augmentation methods to identify the most effective combination for this scenario. In the upcoming sections, we will introduce the task of rainfall estimation and conduct a comprehensive analysis of the dataset used. Subsequently, we will formally propose a new approach for converting time series into images. To conclude, the paper's final section will present and discuss the experiments conducted, providing the reader with a brief yet comprehensive overview of the results.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
There is no result
Keyword: augmentation
Title:
Title:
Title:
Title:
Title:
Title:
Title: