New submissions for Fri, 4 Dec 20

Keyword: detection

Video Anomaly Detection by Estimating Likelihood of Representations

Authors: Yuqi Ouyang, Victor Sanchez
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2012.01468
Pdf link: https://arxiv.org/pdf/2012.01468
Abstract Video anomaly detection is a challenging task not only because it involves solving many sub-tasks such as motion representation, object localization and action recognition, but also because it is commonly considered as an unsupervised learning problem that involves detecting outliers. Traditionally, solutions to this task have focused on the mapping between video frames and their low-dimensional features, while ignoring the spatial connections of those features. Recent solutions focus on analyzing these spatial connections by using hard clustering techniques, such as K-Means, or applying neural networks to map latent features to a general understanding, such as action attributes. In order to solve video anomaly in the latent feature space, we propose a deep probabilistic model to transfer this task into a density estimation problem where latent manifolds are generated by a deep denoising autoencoder and clustered by expectation maximization. Evaluations on several benchmarks datasets show the strengths of our model, achieving outstanding performance on challenging datasets.
Differential Morphed Face Detection Using Deep Siamese Networks
Authors: Sobhan Soleymani, Baaria Chaudhary, Ali Dabouei, Jeremy Dawson, Nasser M. Nasrabadi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01541
Pdf link: https://arxiv.org/pdf/2012.01541
Abstract Although biometric facial recognition systems are fast becoming part of security applications, these systems are still vulnerable to morphing attacks, in which a facial reference image can be verified as two or more separate identities. In border control scenarios, a successful morphing attack allows two or more people to use the same passport to cross borders. In this paper, we propose a novel differential morph attack detection framework using a deep Siamese network. To the best of our knowledge, this is the first research work that makes use of a Siamese network architecture for morph attack detection. We compare our model with other classical and deep learning models using two distinct morph datasets, VISAPP17 and MorGAN. We explore the embedding space generated by the contrastive loss using three decision making frameworks using Euclidean distance, feature difference and a support vector machine classifier, and feature concatenation and a support vector machine classifier.
Mutual Information Maximization on Disentangled Representations for Differential Morph Detection
Authors: Sobhan Soleymani, Ali Dabouei, Fariborz Taherkhani, Jeremy Dawson, Nasser M. Nasrabadi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01542
Pdf link: https://arxiv.org/pdf/2012.01542
Abstract In this paper, we present a novel differential morph detection framework, utilizing landmark and appearance disentanglement. In our framework, the face image is represented in the embedding domain using two disentangled but complementary representations. The network is trained by triplets of face images, in which the intermediate image inherits the landmarks from one image and the appearance from the other image. This initially trained network is further trained for each dataset using contrastive representations. We demonstrate that, by employing appearance and landmark disentanglement, the proposed framework can provide state-of-the-art differential morph detection performance. This functionality is achieved by the using distances in landmark, appearance, and ID domains. The performance of the proposed framework is evaluated using three morph datasets generated with different methodologies.
Multimodal Contact Detection using Auditory and Force Features for Reliable Object Placing in Household Environments
Authors: Jaime Maldonado, Asil Kaan Bozcuoğlu, Christoph Zetzsche
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.01583
Pdf link: https://arxiv.org/pdf/2012.01583
Abstract Typical contact detection is based on the monitoring of a threshold value in the force and torque signals. The selection of a threshold is challenging for robots operating in unstructured or highly dynamic environments, such in a household setting, due to the variability of the characteristics of the objects that might be encountered. We propose a multimodal contact detection approach using time and frequency domain features which model the distinctive characteristics of contact events in the auditory and haptic modalities. In our approach the monitoring of force and torque thresholds is not necessary as detection is based on the characteristics of force and torque signals in the frequency domain together with the impact sound generated by the manipulation task. We evaluated our approach with a typical glass placing task in a household setting. Our experimental results show that robust contact detection (99.94% mean cross-validation accuracy) is possible independent of force/torque threshold values and suitable of being implemented for operation in highly dynamic scenarios.
SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change
Authors: Maurício Gruppi, Sibel Adali, Pin-Yu Chen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2012.01603
Pdf link: https://arxiv.org/pdf/2012.01603
Abstract This paper describes SChME (Semantic Change Detection with Model Ensemble), a method usedin SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature. More specifically, we combine cosine distance of wordvectors combined with a neighborhood-based metric we named Mapped Neighborhood Distance(MAP), and a word frequency differential metric as input signals to our model. Additionally,we explore alignment-based methods to investigate the importance of the landmarks used in thisprocess. Our results show evidence that the number of landmarks used for alignment has a directimpact on the predictive performance of the model. Moreover, we show that languages that sufferless semantic change tend to benefit from using a large number of landmarks, whereas languageswith more semantic change benefit from a more careful choice of landmark number for alignment.
Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D
Authors: Ankit Goyal, Kaiyu Yang, Dawei Yang, Jia Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01634
Pdf link: https://arxiv.org/pdf/2012.01634
Abstract Understanding spatial relations (e.g., "laptop on table") in visual input is important for both humans and robots. Existing datasets are insufficient as they lack large-scale, high-quality 3D ground truth information, which is critical for learning spatial relations. In this paper, we fill this gap by constructing Rel3D: the first large-scale, human-annotated dataset for grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness of 3D information in predicting spatial relations on large-scale human data. Moreover, we propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other. We empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training. Code and data are available at https://github.com/princeton-vl/Rel3D.
Motion-based Camera Localization System in Colonoscopy Videos
Authors: Heming Yao, Ryan W. Stidham, Jonathan Gryak, Kayvan Najarian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2012.01690
Pdf link: https://arxiv.org/pdf/2012.01690
Abstract Optical colonoscopy is an essential diagnostic and prognostic tool for many gastrointestinal diseases including cancer screening and staging, intestinal bleeding, diarrhea, abdominal symptom evaluation, and inflammatory bowel disease assessment. However, the evaluation, classification, and quantification of findings on colonoscopy are subject to inter-observer variation. Automated assessment of colonoscopy is of interest considering the subjectivity present in qualitative human interpretations of colonoscopy findings. Localization of the camera is an essential element to consider when inferring the meaning and context of findings for diseases evaluated by colonoscopy. In this study, we proposed a camera localization system to estimate the approximate anatomic location of the camera and classify the anatomical colon segment the camera is in. The camera localization system starts with non-informative frame detection to remove frames without camera motion information. Then a self-training end-to-end convolutional neural network was built to estimate the camera motion. With the estimated camera motion, the camera trajectory can be derived, and the location index can be calculated. Based on the estimated location index, anatomical colon segment classification was performed by building the colon template. The algorithm was trained and validated using colonoscopy videos collected from routine clinical practice. From our results, the average accuracy of the classification is 0.759, which is substantially higher than the performance of using the location index built from other methods.
Feature-Based Software Design Pattern Detection
Authors: Najam Nazar, Aldeida Aleti
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2012.01708
Pdf link: https://arxiv.org/pdf/2012.01708
Abstract Software design patterns are standard solutions to common problems in software design and architecture. Knowing that a particular module implements a design pattern is a shortcut to design comprehension. Manually detecting design patterns is a time consuming and challenging task; therefore, researchers have proposed automatic design patterns detection techniques to facilitate software developers. However, these techniques show low performance for certain design patterns. In this work, we introduce an approach that improves the performance over the state-of-the-art by using code features with machine learning classifiers to automatically train a design pattern detection. We create a semantic representation of source code from the code features and the call graph, and apply the Word2Vec algorithm on the semantic representation to construct the word-space geometric model of the Java source code. DPD_F then uses a Machine Learning approach trained using the word-space model and identifies software design patterns with 78% Precision and 76% Recall. Additionally, we have compared our results with two existing design pattern detection approaches namely FeatureMaps & MARPLE-DPD. Empirical results demonstrate that our approach outperforms the benchmark approaches by 30\% and 10\% respectively in terms of Precision. The runtime performance also supports its practical applicability.
A Study on the Autoregressive and non-Autoregressive Multi-label Learning
Authors: Elham J. Barezi, Iacer Calixto, Kyunghyun Cho, Pascale Fung
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01711
Pdf link: https://arxiv.org/pdf/2012.01711
Abstract Extreme classification tasks are multi-label tasks with an extremely large number of labels (tags). These tasks are hard because the label space is usually (i) very large, e.g. thousands or millions of labels, (ii) very sparse, i.e. very few labels apply to each input document, and (iii) highly correlated, meaning that the existence of one label changes the likelihood of predicting all other labels. In this work, we propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly and to predict labels for a given input. In more detail, we propose a non-autoregressive latent variable model and compare it to a strong autoregressive baseline that predicts a label based on all previously generated labels. Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies through latent variables, and compares favourably to the autoregressive baseline. We apply our models to four standard extreme classification natural language data sets, and one news videos dataset for automated label detection from a lexicon of semantic concepts. Experimental results show that although the autoregressive models, where use a given order of the labels for chain-order label prediction, work great for the small scale labels or the prediction of the highly ranked label, but our non-autoregressive model surpasses them by around 2% to 6% when we need to predict more labels, or the dataset has a larger number of the labels.
Learning Disentangled Intent Representations for Zero-shot Intent Detection
Authors: Qingyi Si, Yuanxin Liu, Peng Fu, Jiangnan Li, Zheng Lin, Weiping Wang
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01721
Pdf link: https://arxiv.org/pdf/2012.01721
Abstract Zero-shot intent detection (ZSID) aims to deal with the continuously emerging intents without annotated training data. However, existing ZSID systems suffer from two limitations: 1) They are not good at modeling the relationship between seen and unseen intents, when the label names are given in the form of raw phrases or sentences. 2) They cannot effectively recognize unseen intents under the generalized intent detection (GZSID) setting. A critical factor behind these limitations is the representations of unseen intents, which cannot be learned in the training stage. To address this problem, we propose a class-transductive framework that utilizes unseen class labels to learn Disentangled Intent Representations (DIR). Specifically, we allow the model to predict unseen intents in the training stage, with the corresponding label names serving as input utterances. Under this framework, we introduce a multi-task learning objective, which encourages the model to learn the distinctions among intents, and a similarity scorer, which estimates the connections among intents more accurately based on the learned intent representations. Since the purpose of DIR is to provide better intent representations, it can be easily integrated with existing ZSID and GZSID methods. Experiments on two real-world datasets show that the proposed framework brings consistent improvement to the baseline systems, regardless of the model architectures or zero-shot learning strategies.
Parallel Residual Bi-Fusion Feature Pyramid Network for Accurate Single-Shot Object Detection
Authors: Ping-Yang Chen, Ming-Ching Chang, Jun-Wei Hsieh, Yong-Sheng Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01724
Pdf link: https://arxiv.org/pdf/2012.01724
Abstract We propose the Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN) for fast and accurate single-shot object detection. Feature Pyramid (FP) is widely used in recent visual detection, however the top-down pathway of FP cannot preserve accurate localization due to pooling shifting. The advantage of FP is weaken as deeper backbones with more layers are used. To address this issue, we propose a new parallel FP structure with bi-directional (top-down and bottom-up) fusion and associated improvements to retain high-quality features for accurate localization. Our method is particularly suitable for detecting small objects. We provide the following design improvements: (1) A parallel bifusion FP structure with a Bottom-up Fusion Module (BFM) to detect both small and large objects at once with high accuracy. (2) A COncatenation and RE-organization (CORE) module provides a bottom-up pathway for feature fusion, which leads to the bi-directional fusion FP that can recover lost information from lower-layer feature maps. (3) The CORE feature is further purified to retain richer contextual information. Such purification is performed with CORE in a few iterations in both top-down and bottom-up pathways. (4) The adding of a residual design to CORE leads to a new Re-CORE module that enables easy training and integration with a wide range of (deeper or lighter) backbones. The proposed network achieves state-of-the-art performance on UAVDT17 and MS COCO datasets.
Dual Refinement Feature Pyramid Networks for Object Detection
Authors: Jialiang Ma, Bin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01733
Pdf link: https://arxiv.org/pdf/2012.01733
Abstract FPN is a common component used in object detectors, it supplements multi-scale information by adjacent level features interpolation and summation. However, due to the existence of nonlinear operations and the convolutional layers with different output dimensions, the relationship between different levels is much more complex, the pixel-wise summation is not an efficient approach. In this paper, we first analyze the design defects from pixel level and feature map level. Then, we design a novel parameter-free feature pyramid networks named Dual Refinement Feature Pyramid Networks (DRFPN) for the problems. Specifically, DRFPN consists of two modules: Spatial Refinement Block (SRB) and Channel Refinement Block (CRB). SRB learns the location and content of sampling points based on contextual information between adjacent levels. CRB learns an adaptive channel merging method based on attention mechanism. Our proposed DRFPN can be easily plugged into existing FPN-based models. Without bells and whistles, for two-stage detectors, our model outperforms different FPN-based counterparts by 1.6 to 2.2 AP on the COCO detection benchmark, and 1.5 to 1.9 AP on the COCO segmentation benchmark. For one-stage detectors, DRFPN improves anchor-based RetinaNet by 1.9 AP and anchor-free FCOS by 1.3 AP when using ResNet50 as backbone. Extensive experiments verifies the robustness and generalization ability of DRFPN. The code will be made publicly available.
Automatic Routability Predictor Development Using Neural Architecture Search
Authors: Jingyu Pan, Chen-Chia Chang, Tunhou Zhang, Zhiyao Xie, Jiang Hu, Weiyi Qi, Chung-Wei Lin, Rongjian Liang, Joydeep Mitra, Elias Fallon, Yiran Chen
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01737
Pdf link: https://arxiv.org/pdf/2012.01737
Abstract The rise of machine learning technology inspires a boom of its applications in electronic design automation (EDA) and helps improve the degree of automation in chip designs. However, manually crafted machine learning models require extensive human expertise and tremendous engineering efforts. In this work, we leverage neural architecture search (NAS) to automatically develop high-quality neural architectures for routability prediction, which guides cell placement toward routable solutions. Experimental results demonstrate that the automatically generated neural architectures clearly outperform the manual solutions. Compared to the average case of manually designed models, NAS-generated models achieve $5.6\%$ higher Kendall's $\tau$ in predicting the number of nets with DRC violations and $1.95\%$ larger area under ROC curve (ROC-AUC) in DRC hotspots detection.
On Root Detection Strategies for Android Devices
Authors: Raphael Bialon
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2012.01812
Pdf link: https://arxiv.org/pdf/2012.01812
Abstract The Android operating system runs on the majority of smartphones nowadays. Its success is driven by its availability to a variety of smartphone hardware vendors on the one hand, and the customization possibilities given to its users on the other hand. While other big smartphone operating systems restrict user configuration to a given set of functionality, Android users can leverage the whole potential of their devices. This high degree of customization enabled by a process called rooting, where the users escalate their privileges to those of the operating system, introduces security, data integrity and privacy concerns. Several rooting detection mechanisms for Android devices already exist, aimed at different levels of detection. This paper introduces further strategies derived from the Linux ecosystem and outlines their usage on the Android platform. In addition, we present a novel remote rooting detection approach aimed at trust and integrity checks between devices in wireless networks.
D-Unet: A Dual-encoder U-Net for Image Splicing Forgery Detection and Localization
Authors: Xiuli Bi, Yanbin Liu, Bin Xiao, Weisheng Li, Chi-Man Pun, Guoyin Wang, Xinbo Gao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01821
Pdf link: https://arxiv.org/pdf/2012.01821
Abstract Recently, many detection methods based on convolutional neural networks (CNNs) have been proposed for image splicing forgery detection. Most of these detection methods focus on the local patches or local objects. In fact, image splicing forgery detection is a global binary classification task that distinguishes the tampered and non-tampered regions by image fingerprints. However, some specific image contents are hardly retained by CNN-based detection networks, but if included, would improve the detection accuracy of the networks. To resolve these issues, we propose a novel network called dual-encoder U-Net (D-Unet) for image splicing forgery detection, which employs an unfixed encoder and a fixed encoder. The unfixed encoder autonomously learns the image fingerprints that differentiate between the tampered and non-tampered regions, whereas the fixed encoder intentionally provides the direction information that assists the learning and detection of the network. This dual-encoder is followed by a spatial pyramid global-feature extraction module that expands the global insight of D-Unet for classifying the tampered and non-tampered regions more accurately. In an experimental comparison study of D-Unet and state-of-the-art methods, D-Unet outperformed the other methods in image-level and pixel-level detection, without requiring pre-training or training on a large number of forgery images. Moreover, it was stably robust to different attacks.
Make One-Shot Video Object Segmentation Efficient Again
Authors: Tim Meinhardt, Laura Leal-Taixe
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.01866
Pdf link: https://arxiv.org/pdf/2012.01866
Abstract Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. In the semi-supervised setting, the first mask of each object is provided at test time. Following the one-shot principle, fine-tuning VOS methods train a segmentation model separately on each given object mask. However, recently the VOS community has deemed such a test time optimization and its impact on the test runtime as unfeasible. To mitigate the inefficiencies of previous fine-tuning approaches, we present efficient One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS approaches, e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN. The one-shot test runtime and performance are optimized without a laborious and handcrafted hyperparameter search. To this end, we meta learn the model initialization and learning rates for the test time optimization. To achieve optimal learning behavior, we predict individual learning rates at a neuron level. Furthermore, we apply an online adaptation to address the common performance degradation throughout a sequence by continuously fine-tuning the model on previous mask predictions supported by a frame-to-frame bounding box propagation. e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods while reducing the test runtime substantially. Code is available at https://github.com/dvl-tum/e-osvos.
Hotspot identification for Mapper graphs
Authors: Ciara Frances Loughrey, Nick Orr, Anna Jurek-Loughrey, Paweł Dłotko
Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01868
Pdf link: https://arxiv.org/pdf/2012.01868
Abstract Mapper algorithm can be used to build graph-based representations of high-dimensional data capturing structurally interesting features such as loops, flares or clusters. The graph can be further annotated with additional colouring of vertices allowing location of regions of special interest. For instance, in many applications, such as precision medicine, Mapper graph has been used to identify unknown compactly localized subareas within the dataset demonstrating unique or unusual behaviours. This task, performed so far by a researcher, can be automatized using hotspot analysis. In this work we propose a new algorithm for detecting hotspots in Mapper graphs. It allows automatizing of the hotspot detection process. We demonstrate the performance of the algorithm on a number of artificial and real world datasets. We further demonstrate how our algorithm can be used for the automatic selection of the Mapper lens functions.
Label Enhanced Event Detection with Heterogeneous Graph Attention Networks
Authors: Shiyao Cui, Bowen Yu, Xin Cong, Tingwen Liu, Quangang Li, Jinqiao Shi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2012.01878
Pdf link: https://arxiv.org/pdf/2012.01878
Abstract Event Detection (ED) aims to recognize instances of specified types of event triggers in text. Different from English ED, Chinese ED suffers from the problem of word-trigger mismatch due to the uncertain word boundaries. Existing approaches injecting word information into character-level models have achieved promising progress to alleviate this problem, but they are limited by two issues. First, the interaction between characters and lexicon words is not fully exploited. Second, they ignore the semantic information provided by event labels. We thus propose a novel architecture named Label enhanced Heterogeneous Graph Attention Networks (L-HGAT). Specifically, we transform each sentence into a graph, where character nodes and word nodes are connected with different types of edges, so that the interaction between words and characters is fully reserved. A heterogeneous graph attention networks is then introduced to propagate relational message and enrich information interaction. Furthermore, we convert each label into a trigger-prototype-based embedding, and design a margin loss to guide the model distinguish confusing event labels. Experiments on two benchmark datasets show that our model achieves significant improvement over a range of competitive baseline methods.
Patch2Pix: Epipolar-Guided Pixel-Level Correspondences
Authors: Qunjie Zhou, Torsten Sattler, Laura Leal-Taixe
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01909
Pdf link: https://arxiv.org/pdf/2012.01909
Abstract Deep learning has been applied to a classical matching pipeline which typically involves three steps: (i) local feature detection and description, (ii) feature matching, and (iii) outlier rejection. Recently emerged correspondence networks propose to perform those steps inside a single network but suffer from low matching resolution due to the memory bottleneck. In this work, we propose a new perspective to estimate correspondences in a detect-to-refine manner, where we first predict patch-level match proposals and then refine them. We present a novel refinement network Patch2Pix that refines match proposals by regressing pixel-level matches from the local regions defined by those proposals and jointly rejecting outlier matches with confidence scores, which is weakly supervised to learn correspondences that are consistent with the epipolar geometry of an input image pair. We show that our refinement network significantly improves the performance of correspondence networks on image matching, homography estimation, and localization tasks. In addition, we show that our learned refinement generalizes to fully-supervised methods without re-training, which leads us to state-of-the-art localization performance.
Teaching the Machine to Explain Itself using Domain Knowledge
Authors: Vladimir Balayan, Pedro Saleiro, Catarina Belém, Ludwig Krippahl, Pedro Bizarro
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2012.01932
Pdf link: https://arxiv.org/pdf/2012.01932
Abstract Machine Learning (ML) has been increasingly used to aid humans to make better and faster decisions. However, non-technical humans-in-the-loop struggle to comprehend the rationale behind model predictions, hindering trust in algorithmic decision-making systems. Considerable research work on AI explainability attempts to win back trust in AI systems by developing explanation methods but there is still no major breakthrough. At the same time, popular explanation methods (e.g., LIME, and SHAP) produce explanations that are very hard to understand for non-data scientist persona. To address this, we present JOEL, a neural network-based framework to jointly learn a decision-making task and associated explanations that convey domain knowledge. JOEL is tailored to human-in-the-loop domain experts that lack deep technical ML knowledge, providing high-level insights about the model's predictions that very much resemble the experts' own reasoning. Moreover, we collect the domain feedback from a pool of certified experts and use it to ameliorate the model (human teaching), hence promoting seamless and better suited explanations. Lastly, we resort to semantic mappings between legacy expert systems and domain taxonomies to automatically annotate a bootstrap training set, overcoming the absence of concept-based human annotations. We validate JOEL empirically on a real-world fraud detection dataset. We show that JOEL can generalize the explanations from the bootstrap dataset. Furthermore, obtained results indicate that human teaching can further improve the explanations prediction quality by approximately $13.57\%$.
Classifying Malware Using Function Representations in a Static Call Graph
Authors: Thomas Dalton, Mauritius Schmidtler, Alireza Hadj Khodabakhshi
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01939
Pdf link: https://arxiv.org/pdf/2012.01939
Abstract We propose a deep learning approach for identifying malware families using the function call graphs of x86 assembly instructions. Though prior work on static call graph analysis exists, very little involves the application of modern, principled feature learning techniques to the problem. In this paper, we introduce a system utilizing an executable's function call graph where function representations are obtained by way of a recurrent neural network (RNN) autoencoder which maps sequences of x86 instructions into dense, latent vectors. These function embeddings are then modeled as vertices in a graph with edges indicating call dependencies. Capturing rich, node-level representations as well as global, topological properties of an executable file greatly improves malware family detection rates and contributes to a more principled approach to the problem in a way that deliberately avoids tedious feature engineering and domain expertise. We test our approach by performing several experiments on a Microsoft malware classification data set and achieve excellent separation between malware families with a classification accuracy of 99.41%.
Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection
Authors: Tiancai Wang, Tong Yang, Jiale Cao, Xiangyu Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01950
Pdf link: https://arxiv.org/pdf/2012.01950
Abstract Object detectors usually achieve promising results with the supervision of complete instance annotations. However, their performance is far from satisfactory with sparse instance annotations. Most existing methods for sparsely annotated object detection either re-weight the loss of hard negative samples or convert the unlabeled instances into ignored regions to reduce the interference of false negatives. We argue that these strategies are insufficient since they can at most alleviate the negative effect caused by missing annotations. In this paper, we propose a simple but effective mechanism, called Co-mining, for sparsely annotated object detection. In our Co-mining, two branches of a Siamese network predict the pseudo-label sets for each other. To enhance multi-view learning and better mine unlabeled instances, the original image and corresponding augmented image are used as the inputs of two branches of the Siamese network, respectively. Co-mining can serve as a general training mechanism applied to most of modern object detectors. Experiments are performed on MS COCO dataset with three different sparsely annotated settings using two typical frameworks: anchor-based detector RetinaNet and anchor-free detector FCOS. Experimental results show that our Co-mining with RetinaNet achieves 1.4%~2.1% improvements compared with different baselines and surpasses existing methods under the same sparsely annotated setting.
IoT DoS and DDoS Attack Detection using ResNet
Authors: Faisal Hussain, Syed Ghazanfar Abbas, Muhammad Husnain, Ubaid Ullah Fayyaz, Farrukh Shahzad, Ghalib A. Shah
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2012.01971
Pdf link: https://arxiv.org/pdf/2012.01971
Abstract The network attacks are increasing both in frequency and intensity with the rapid growth of internet of things (IoT) devices. Recently, denial of service (DoS) and distributed denial of service (DDoS) attacks are reported as the most frequent attacks in IoT networks. The traditional security solutions like firewalls, intrusion detection systems, etc., are unable to detect the complex DoS and DDoS attacks since most of them filter the normal and attack traffic based upon the static predefined rules. However, these solutions can become reliable and effective when integrated with artificial intelligence (AI) based techniques. During the last few years, deep learning models especially convolutional neural networks achieved high significance due to their outstanding performance in the image processing field. The potential of these convolutional neural network (CNN) models can be used to efficiently detect the complex DoS and DDoS by converting the network traffic dataset into images. Therefore, in this work, we proposed a methodology to convert the network traffic data into image form and trained a state-of-the-art CNN model, i.e., ResNet over the converted data. The proposed methodology accomplished 99.99\% accuracy for detecting the DoS and DDoS in case of binary classification. Furthermore, the proposed methodology achieved 87\% average precision for recognizing eleven types of DoS and DDoS attack patterns which is 9\% higher as compared to the state-of-the-art.
Automated Artefact Relevancy Determination from Artefact Metadata and Associated Timeline Events
Authors: Xiaoyu Du, Quan Le, Mark Scanlon
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01972
Pdf link: https://arxiv.org/pdf/2012.01972
Abstract Case-hindering, multi-year digital forensic evidence backlogs have become commonplace in law enforcement agencies throughout the world. This is due to an ever-growing number of cases requiring digital forensic investigation coupled with the growing volume of data to be processed per case. Leveraging previously processed digital forensic cases and their component artefact relevancy classifications can facilitate an opportunity for training automated artificial intelligence based evidence processing systems. These can significantly aid investigators in the discovery and prioritisation of evidence. This paper presents one approach for file artefact relevancy determination building on the growing trend towards a centralised, Digital Forensics as a Service (DFaaS) paradigm. This approach enables the use of previously encountered pertinent files to classify newly discovered files in an investigation. Trained models can aid in the detection of these files during the acquisition stage, i.e., during their upload to a DFaaS system. The technique generates a relevancy score for file similarity using each artefact's filesystem metadata and associated timeline events. The approach presented is validated against three experimental usage scenarios.
Detection of False-Reading Attacks in the AMI Net-Metering System
Authors: Mahmoud M. Badr, Mohamed I. Ibrahem, Mohamed Mahmoud, Mostafa M. Fouda, Waleed Alasmary
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01983
Pdf link: https://arxiv.org/pdf/2012.01983
Abstract In smart grid, malicious customers may compromise their smart meters (SMs) to report false readings to achieve financial gains illegally. Reporting false readings not only causes hefty financial losses to the utility but may also degrade the grid performance because the reported readings are used for energy management. This paper is the first work that investigates this problem in the net-metering system, in which one SM is used to report the difference between the power consumed and the power generated. First, we prepare a benign dataset for the net-metering system by processing a real power consumption and generation dataset. Then, we propose a new set of attacks tailored for the net-metering system to create malicious dataset. After that, we analyze the data and we found time correlations between the net meter readings and correlations between the readings and relevant data obtained from trustworthy sources such as the solar irradiance and temperature. Based on the data analysis, we propose a general multi-data-source deep hybrid learning-based detector to identify the false-reading attacks. Our detector is trained on net meter readings of all customers besides data from the trustworthy sources to enhance the detector performance by learning the correlations between them. The rationale here is that although an attacker can report false readings, he cannot manipulate the solar irradiance and temperature values because they are beyond his control. Extensive experiments have been conducted, and the results indicate that our detector can identify the false-reading attacks with high detection rate and low false alarm.
Radar Artifact Labeling Framework (RALF): Method for Plausible Radar Detections in Datasets
Authors: Simon T. Isele, Marcel P. Schilling, Fabian E. Klein, Sascha Saralajew, J. Marius Zoellner
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01993
Pdf link: https://arxiv.org/pdf/2012.01993
Abstract Research on localization and perception for Autonomous Driving is mainly focused on camera and LiDAR datasets, rarely on radar data. Manually labeling sparse radar point clouds is challenging. For a dataset generation, we propose the cross sensor Radar Artifact Labeling Framework (RALF). Automatically generated labels for automotive radar data help to cure radar shortcomings like artifacts for the application of artificial intelligence. RALF provides plausibility labels for radar raw detections, distinguishing between artifacts and targets. The optical evaluation backbone consists of a generalized monocular depth image estimation of surround view cameras plus LiDAR scans. Modern car sensor sets of cameras and LiDAR allow to calibrate image-based relative depth information in overlapping sensing areas. K-Nearest Neighbors matching relates the optical perception point cloud with raw radar detections. In parallel, a temporal tracking evaluation part considers the radar detections' transient behavior. Based on the distance between matches, respecting both sensor and model uncertainties, we propose a plausibility rating of every radar detection. We validate the results by evaluating error metrics on semi-manually labeled ground truth dataset of $3.28\cdot10^6$ points. Besides generating plausible radar detections, the framework enables further labeled low-level radar signal datasets for applications of perception and Autonomous Driving learning tasks.
Fast Track: Synchronized Behavior Detection in Streaming Tensors
Authors: Jiabao Zhang, Shenghua Liu, Wenting Hou, Siddharth Bhatia, Huawei Shen, Wenjian Yu, Xueqi Cheng
Subjects: Databases (cs.DB); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2012.02006
Pdf link: https://arxiv.org/pdf/2012.02006
Abstract How can we track synchronized behavior in a stream of time-stamped tuples, such as mobile devices installing and uninstalling applications in the lockstep, to boost their ranks in the app store? We model such tuples as entries in a streaming tensor, which augments attribute sizes in its modes over time. Synchronized behavior tends to form dense blocks (i.e. subtensors) in such a tensor, signaling anomalous behavior, or interesting communities. However, existing dense block detection methods are either based on a static tensor, or lack an efficient algorithm in a streaming setting. Therefore, we propose a fast streaming algorithm, AugSplicing, which can detect the top dense blocks by incrementally splicing the previous detection with the incoming ones in new tuples, avoiding re-runs over all the history data at every tracking time step. AugSplicing is based on a splicing condition that guides the algorithm (Section 4). Compared to the state-of-the-art methods, our method is (1) effective to detect fraudulent behavior in installing data of real-world apps and find a synchronized group of students with interesting features in campus Wi-Fi data; (2) robust with splicing theory for dense block detection; (3) streaming and faster than the existing streaming algorithm, with closely comparable accuracy.
Context in Informational Bias Detection
Authors: Esther van den Berg, Katja Markert
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2012.02015
Pdf link: https://arxiv.org/pdf/2012.02015
Abstract Informational bias is bias conveyed through sentences or clauses that provide tangential, speculative or background information that can sway readers' opinions towards entities. By nature, informational bias is context-dependent, but previous work on informational bias detection has not explored the role of context beyond the sentence. In this paper, we explore four kinds of context for informational bias in English news articles: neighboring sentences, the full article, articles on the same event from other news publishers, and articles from the same domain (but potentially different events). We find that integrating event context improves classification performance over a very strong baseline. In addition, we perform the first error analysis of models on this task. We find that the best-performing context-inclusive model outperforms the baseline on longer sentences, and sentences from politically centrist articles.
SuperOCR: A Conversion from Optical Character Recognition to Image Captioning
Authors: Baohua Sun, Michael Lin, Hao Sha, Lin Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2012.02033
Pdf link: https://arxiv.org/pdf/2012.02033
Abstract Optical Character Recognition (OCR) has many real world applications. The existing methods normally detect where the characters are, and then recognize the character for each detected location. Thus the accuracy of characters recognition is impacted by the performance of characters detection. In this paper, we propose a method for recognizing characters without detecting the location of each character. This is done by converting the OCR task into an image captioning task. One advantage of the proposed method is that the labeled bounding boxes for the characters are not needed during training. The experimental results show the proposed method outperforms the existing methods on both the license plate recognition and the watermeter character recognition tasks. The proposed method is also deployed into a low-power (300mW) CNN accelerator chip connected to a Raspberry Pi 3 for on-device applications.
Characteristics of Reversible Circuits for Error Detection
Authors: Lukas Burgholzer, Robert Wille, Richard Kueng
Subjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2012.02037
Pdf link: https://arxiv.org/pdf/2012.02037
Abstract In this work, we consider error detection via simulation for reversible circuit architectures. We rigorously prove that reversibility augments the performance of this simple error detection protocol to a considerable degree. A single randomly generated input is guaranteed to unveil a single error with a probability that only depends on the size of the error, not the size of the circuit itself. Empirical studies confirm that this behavior typically extends to multiple errors as well. In conclusion, reversible circuits offer characteristics that reduce masking effects -- a desirable feature that is in stark contrast to irreversible circuit architectures.
A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation
Authors: Ngan Le, Kashu Yamazaki, Dat Truong, Kha Gia Quach, Marios Savvides
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02073
Pdf link: https://arxiv.org/pdf/2012.02073
Abstract In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belonging to the foreground class (tumor pixel). To address this problem, we propose a multi-task network which is formed as a cascaded structure. Our model consists of two targets, i.e., (i) effectively differentiate the brain tumor regions and (ii) estimate the brain tumor mask. The first objective is performed by our proposed contextual brain tumor detection network, which plays a role of an attention gate and focuses on the region around brain tumor only while ignoring the far neighbor background which is less correlated to the tumor. The second objective is built upon a 3D atrous residual network and under an encode-decode network in order to effectively segment both large and small objects (brain tumor). Our 3D atrous residual network is designed with a skip connection to enables the gradient from the deep layers to be directly propagated to shallow layers, thus, features of different depths are preserved and used for refining each other. In order to incorporate larger contextual information from volume MRI data, our network utilizes the 3D atrous convolution with various kernel sizes, which enlarges the receptive field of filters. Our proposed network has been evaluated on various datasets including BRATS2015, BRATS2017 and BRATS2018 datasets with both validation set and testing set. Our performance has been benchmarked by both region-based metrics and surface-based metrics. We also have conducted comparisons against state-of-the-art approaches.
SAFCAR: Structured Attention Fusion for Compositional Action Recognition
Authors: Tae Soo Kim, Gregory D. Hager
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02109
Pdf link: https://arxiv.org/pdf/2012.02109
Abstract We present a general framework for compositional action recognition -- i.e. action recognition where the labels are composed out of simpler components such as subjects, atomic-actions and objects. The main challenge in compositional action recognition is that there is a combinatorially large set of possible actions that can be composed using basic components. However, compositionality also provides a structure that can be exploited. To do so, we develop and test a novel Structured Attention Fusion (SAF) self-attention mechanism to combine information from object detections, which capture the time-series structure of an action, with visual cues that capture contextual information. We show that our approach recognizes novel verb-noun compositions more effectively than current state of the art systems, and it generalizes to unseen action categories quite efficiently from only a few labeled examples. We validate our approach on the challenging Something-Else tasks from the Something-Something-V2 dataset. We further show that our framework is flexible and can generalize to a new domain by showing competitive results on the Charades-Fewshot dataset.
Deep Inverse Sensor Models as Priors for evidential Occupancy Mapping
Authors: Daniel Bauer, Lars Kuhnert, Lutz Eckstein
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.02111
Pdf link: https://arxiv.org/pdf/2012.02111
Abstract With the recent boost in autonomous driving, increased attention has been paid on radars as an input for occupancy mapping. Besides their many benefits, the inference of occupied space based on radar detections is notoriously difficult because of the data sparsity and the environment dependent noise (e.g. multipath reflections). Recently, deep learning-based inverse sensor models, from here on called deep ISMs, have been shown to improve over their geometric counterparts in retrieving occupancy information. Nevertheless, these methods perform a data-driven interpolation which has to be verified later on in the presence of measurements. In this work, we describe a novel approach to integrate deep ISMs together with geometric ISMs into the evidential occupancy mapping framework. Our method leverages both the capabilities of the data-driven approach to initialize cells not yet observable for the geometric model effectively enhancing the perception field and convergence speed, while at the same time use the precision of the geometric ISM to converge to sharp boundaries. We further define a lower limit on the deep ISM estimate's certainty together with analytical proofs of convergence which we use to distinguish cells that are solely allocated by the deep ISM from cells already verified using the geometric approach.
Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline
Authors: Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab, Senthil Yogamani
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.02124
Pdf link: https://arxiv.org/pdf/2012.02124
Abstract Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The standard bounding box fails in fisheye cameras due to the strong radial distortion, particularly in the image's periphery. We explore better representations like oriented bounding box, ellipse, and generic polygon for object detection in fisheye images in this work. We use the IoU metric to compare these representations using accurate instance segmentation ground truth. We design a novel curved bounding box model that has optimal properties for fisheye distortion models. We also design a curvature adaptive perimeter sampling method for obtaining polygon vertices, improving relative mAP score by 4.9% compared to uniform sampling. Overall, the proposed polygon model improves mIoU relative accuracy by 40.3%. It is the first detailed study on object detection on fisheye cameras for autonomous driving scenarios to the best of our knowledge. The dataset comprising of 10,000 images along with all the object representations ground truth will be made public to encourage further research. We summarize our work in a short video with qualitative results at https://youtu.be/iLkOzvJpL-A.
Keyword: segmentation

Contour Transformer Network for One-shot Segmentation of Anatomical Structures
Authors: Yuhang Lu, Kang Zheng, Weijian Li, Yirui Wang, Adam P. Harrison, Chihung Lin, Song Wang, Jing Xiao, Le Lu, Chang-Fu Kuo, Shun Miao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01480
Pdf link: https://arxiv.org/pdf/2012.01480
Abstract Accurate segmentation of anatomical structures is vital for medical image analysis. The state-of-the-art accuracy is typically achieved by supervised learning methods, where gathering the requisite expert-labeled image annotations in a scalable manner remains a main obstacle. Therefore, annotation-efficient methods that permit to produce accurate anatomical structure segmentation are highly desirable. In this work, we present Contour Transformer Network (CTN), a one-shot anatomy segmentation method with a naturally built-in human-in-the-loop mechanism. We formulate anatomy segmentation as a contour evolution process and model the evolution behavior by graph convolutional networks (GCNs). Training the CTN model requires only one labeled image exemplar and leverages additional unlabeled data through newly introduced loss functions that measure the global shape and appearance consistency of contours. On segmentation tasks of four different anatomies, we demonstrate that our one-shot learning method significantly outperforms non-learning-based methods and performs competitively to the state-of-the-art fully supervised deep learning methods. With minimal human-in-the-loop editing feedback, the segmentation performance can be further improved to surpass the fully supervised methods.
From a Fourier-Domain Perspective on Adversarial Examples to a Wiener Filter Defense for Semantic Segmentation
Authors: Nikhil Kapoor, Andreas Bär, Serin Varghese, Jan David Schneider, Fabian Hüger, Peter Schlicht, Tim Fingscheidt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2012.01558
Pdf link: https://arxiv.org/pdf/2012.01558
Abstract Despite recent advancements, deep neural networks are not robust against adversarial perturbations. Many of the proposed adversarial defense approaches use computationally expensive training mechanisms that do not scale to complex real-world tasks such as semantic segmentation, and offer only marginal improvements. In addition, fundamental questions on the nature of adversarial perturbations and their relation to the network architecture are largely understudied. In this work, we study the adversarial problem from a frequency domain perspective. More specifically, we analyze discrete Fourier transform (DFT) spectra of several adversarial images and report two major findings: First, there exists a strong connection between a model architecture and the nature of adversarial perturbations that can be observed and addressed in the frequency domain. Second, the observed frequency patterns are largely image- and attack-type independent, which is important for the practical impact of any defense making use of such patterns. Motivated by these findings, we additionally propose an adversarial defense method based on the well-known Wiener filters that captures and suppresses adversarial frequencies in a data-driven manner. Our proposed method not only generalizes across unseen attacks but also beats five existing state-of-the-art methods across two models in a variety of attack settings.
Single-shot Path Integrated Panoptic Segmentation
Authors: Sukjun Hwang, Seoung Wug Oh, Seon Joo Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01632
Pdf link: https://arxiv.org/pdf/2012.01632
Abstract Panoptic segmentation, which is a novel task of unifying instance segmentation and semantic segmentation, has attracted a lot of attention lately. However, most of the previous methods are composed of multiple pathways with each pathway specialized to a designated segmentation task. In this paper, we propose to resolve panoptic segmentation in single-shot by integrating the execution flows. With the integrated pathway, a unified feature map called Panoptic-Feature is generated, which includes the information of both things and stuffs. Panoptic-Feature becomes more sophisticated by auxiliary problems that guide to cluster pixels that belong to the same instance and differentiate between objects of different classes. A collection of convolutional filters, where each filter represents either a thing or stuff, is applied to Panoptic-Feature at once, materializing the single-shot panoptic segmentation. Taking the advantages of both top-down and bottom-up approaches, our method, named SPINet, enjoys high efficiency and accuracy on major panoptic segmentation benchmarks: COCO and Cityscapes.
Learning Hyperbolic Representations for Unsupervised 3D Segmentation
Authors: Joy Hsu, Jeffrey Gu, Gong-Her Wu, Wah Chiu, Serena Yeung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01644
Pdf link: https://arxiv.org/pdf/2012.01644
Abstract There exists a need for unsupervised 3D segmentation on complex volumetric data, particularly when annotation ability is limited or discovery of new categories is desired. Using the observation that much of 3D volumetric data is innately hierarchical, we propose learning effective representations of 3D patches for unsupervised segmentation through a variational autoencoder (VAE) with a hyperbolic latent space and a proposed gyroplane convolutional layer, which better models the underlying hierarchical structure within a 3D image. We also introduce a hierarchical triplet loss and multi-scale patch sampling scheme to embed relationships across varying levels of granularity. We demonstrate the effectiveness of our hyperbolic representations for unsupervised 3D segmentation on a hierarchical toy dataset, BraTS whole tumor dataset, and cryogenic electron microscopy data.
Dual-Branch Network with Dual-Sampling Modulated Dice Loss for Hard Exudate Segmentation from Colour Fundus Images
Authors: Qing Liu, Haotian Liu, Yixiong Liang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01665
Pdf link: https://arxiv.org/pdf/2012.01665
Abstract Automated segmentation of hard exudates in colour fundus images is a challenge task due to issues of extreme class imbalance and enormous size variation. This paper aims to tackle these issues and proposes a dual-branch network with dual-sampling modulated Dice loss. It consists of two branches: large hard exudate biased learning branch and small hard exudate biased learning branch. Both of them are responsible for their own duty separately. Furthermore, we propose a dual-sampling modulated Dice loss for the training such that our proposed dual-branch network is able to segment hard exudates in different sizes. In detail, for the first branch, we use a uniform sampler to sample pixels from predicted segmentation mask for Dice loss calculation, which leads to this branch naturally be biased in favour of large hard exudates as Dice loss generates larger cost on misidentification of large hard exudates than small hard exudates. For the second branch, we use a re-balanced sampler to oversample hard exudate pixels and undersample background pixels for loss calculation. In this way, cost on misidentification of small hard exudates is enlarged, which enforces the parameters in the second branch fit small hard exudates well. Considering that large hard exudates are much easier to be correctly identified than small hard exudates, we propose an easy-to-difficult learning strategy by adaptively modulating the losses of two branches. We evaluate our proposed method on two public datasets and results demonstrate that ours achieves state-of-the-art performances.
Dual Refinement Feature Pyramid Networks for Object Detection
Authors: Jialiang Ma, Bin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01733
Pdf link: https://arxiv.org/pdf/2012.01733
Abstract FPN is a common component used in object detectors, it supplements multi-scale information by adjacent level features interpolation and summation. However, due to the existence of nonlinear operations and the convolutional layers with different output dimensions, the relationship between different levels is much more complex, the pixel-wise summation is not an efficient approach. In this paper, we first analyze the design defects from pixel level and feature map level. Then, we design a novel parameter-free feature pyramid networks named Dual Refinement Feature Pyramid Networks (DRFPN) for the problems. Specifically, DRFPN consists of two modules: Spatial Refinement Block (SRB) and Channel Refinement Block (CRB). SRB learns the location and content of sampling points based on contextual information between adjacent levels. CRB learns an adaptive channel merging method based on attention mechanism. Our proposed DRFPN can be easily plugged into existing FPN-based models. Without bells and whistles, for two-stage detectors, our model outperforms different FPN-based counterparts by 1.6 to 2.2 AP on the COCO detection benchmark, and 1.5 to 1.9 AP on the COCO segmentation benchmark. For one-stage detectors, DRFPN improves anchor-based RetinaNet by 1.9 AP and anchor-free FCOS by 1.3 AP when using ResNet50 as backbone. Extensive experiments verifies the robustness and generalization ability of DRFPN. The code will be made publicly available.
Make One-Shot Video Object Segmentation Efficient Again
Authors: Tim Meinhardt, Laura Leal-Taixe
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.01866
Pdf link: https://arxiv.org/pdf/2012.01866
Abstract Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. In the semi-supervised setting, the first mask of each object is provided at test time. Following the one-shot principle, fine-tuning VOS methods train a segmentation model separately on each given object mask. However, recently the VOS community has deemed such a test time optimization and its impact on the test runtime as unfeasible. To mitigate the inefficiencies of previous fine-tuning approaches, we present efficient One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS approaches, e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN. The one-shot test runtime and performance are optimized without a laborious and handcrafted hyperparameter search. To this end, we meta learn the model initialization and learning rates for the test time optimization. To achieve optimal learning behavior, we predict individual learning rates at a neuron level. Furthermore, we apply an online adaptation to address the common performance degradation throughout a sequence by continuously fine-tuning the model on previous mask predictions supported by a frame-to-frame bounding box propagation. e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods while reducing the test runtime substantially. Code is available at https://github.com/dvl-tum/e-osvos.
A small note on variation in segmentation annotations
Authors: Silas Nyboe Ørting
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2012.01975
Pdf link: https://arxiv.org/pdf/2012.01975
Abstract We report on the results of a small crowdsourcing experiment conducted at a workshop on machine learning for segmentation held at the Danish Bio Imaging network meeting 2020. During the workshop we asked participants to manually segment mitochondria in three 2D patches. The aim of the experiment was to illustrate that manual annotations should not be seen as the ground truth, but as a reference standard that is subject to substantial variation. In this note we show how the large variation we observed in the segmentations can be reduced by removing the annotators with worst pair-wise agreement. Having removed the annotators with worst performance, we illustrate that the remaining variance is semantically meaningful and can be exploited to obtain segmentations of cell boundary and cell interior.
Aerial Imagery Pixel-level Segmentation
Authors: Michael R. Heffels, Joaquin Vanschoren
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02024
Pdf link: https://arxiv.org/pdf/2012.02024
Abstract Aerial imagery can be used for important work on a global scale. Nevertheless, the analysis of this data using neural network architectures lags behind the current state-of-the-art on popular datasets such as PASCAL VOC, CityScapes and Camvid. In this paper we bridge the performance-gap between these popular datasets and aerial imagery data. Little work is done on aerial imagery with state-of-the-art neural network architectures in a multi-class setting. Our experiments concerning data augmentation, normalisation, image size and loss functions give insight into a high performance setup for aerial imagery segmentation datasets. Our work, using the state-of-the-art DeepLabv3+ Xception65 architecture, achieves a mean IOU of 70% on the DroneDeploy validation set. With this result, we clearly outperform the current publicly available state-of-the-art validation set mIOU (65%) performance with 5%. Furthermore, to our knowledge, there is no mIOU benchmark for the test set. Hence, we also propose a new benchmark on the DroneDeploy test set using the best performing DeepLabv3+ Xception65 architecture, with a mIOU score of 52.5%.
A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation
Authors: Ngan Le, Kashu Yamazaki, Dat Truong, Kha Gia Quach, Marios Savvides
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02073
Pdf link: https://arxiv.org/pdf/2012.02073
Abstract In recent years, deep neural networks have achieved state-of-the-art performance in a variety of recognition and segmentation tasks in medical imaging including brain tumor segmentation. We investigate that segmenting a brain tumor is facing to the imbalanced data problem where the number of pixels belonging to the background class (non tumor pixel) is much larger than the number of pixels belonging to the foreground class (tumor pixel). To address this problem, we propose a multi-task network which is formed as a cascaded structure. Our model consists of two targets, i.e., (i) effectively differentiate the brain tumor regions and (ii) estimate the brain tumor mask. The first objective is performed by our proposed contextual brain tumor detection network, which plays a role of an attention gate and focuses on the region around brain tumor only while ignoring the far neighbor background which is less correlated to the tumor. The second objective is built upon a 3D atrous residual network and under an encode-decode network in order to effectively segment both large and small objects (brain tumor). Our 3D atrous residual network is designed with a skip connection to enables the gradient from the deep layers to be directly propagated to shallow layers, thus, features of different depths are preserved and used for refining each other. In order to incorporate larger contextual information from volume MRI data, our network utilizes the 3D atrous convolution with various kernel sizes, which enlarges the receptive field of filters. Our proposed network has been evaluated on various datasets including BRATS2015, BRATS2017 and BRATS2018 datasets with both validation set and testing set. Our performance has been benchmarked by both region-based metrics and surface-based metrics. We also have conducted comparisons against state-of-the-art approaches.
Towards Part-Based Understanding of RGB-D Scans
Authors: Alexey Bokhovkin, Vladislav Ishimtsev, Emil Bogomolov, Denis Zorin, Alexey Artemov, Evgeny Burnaev, Angela Dai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02094
Pdf link: https://arxiv.org/pdf/2012.02094
Abstract Recent advances in 3D semantic scene understanding have shown impressive progress in 3D instance segmentation, enabling object-level reasoning about 3D scenes; however, a finer-grained understanding is required to enable interactions with objects and their functional understanding. Thus, we propose the task of part-based scene understanding of real-world 3D environments: from an RGB-D scan of a scene, we detect objects, and for each object predict its decomposition into geometric part masks, which composed together form the complete geometry of the observed object. We leverage an intermediary part graph representation to enable robust completion as well as building of part priors, which we use to construct the final part mask predictions. Our experiments demonstrate that guiding part understanding through part graph to part prior-based predictions significantly outperforms alternative approaches to the task of semantic part completion.
Robust Instance Segmentation through Reasoning about Multi-Object Occlusion
Authors: Xiaoding Yuan, Adam Kortylewski, Yihong Sun, Alan Yuille
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02107
Pdf link: https://arxiv.org/pdf/2012.02107
Abstract Analyzing complex scenes with Deep Neural Networks is a challenging task, particularly when images contain multiple objects that partially occlude each other. Existing approaches to image analysis mostly process objects independently and do not take into account the relative occlusion of nearby objects. In this paper, we propose a deep network for multi-object instance segmentation that is robust to occlusion and can be trained from bounding box supervision only. Our work builds on Compositional Networks, which learn a generative model of neural feature activations to locate occluders and to classify objects based on their non-occluded parts. We extend their generative model to include multiple objects and introduce a framework for the efficient inference in challenging occlusion scenarios. In particular, we obtain feed-forward predictions of the object classes and their instance and occluder segmentations. We introduce an Occlusion Reasoning Module (ORM) that locates erroneous segmentations and estimates the occlusion ordering to correct them. The improved segmentation masks are, in turn, integrated into the network in a top-down manner to improve the image classification. Our experiments on the KITTI INStance dataset (KINS) and a synthetic occlusion dataset demonstrate the effectiveness and robustness of our model at multi-object instance segmentation under occlusion.
Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline
Authors: Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab, Senthil Yogamani
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.02124
Pdf link: https://arxiv.org/pdf/2012.02124
Abstract Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The standard bounding box fails in fisheye cameras due to the strong radial distortion, particularly in the image's periphery. We explore better representations like oriented bounding box, ellipse, and generic polygon for object detection in fisheye images in this work. We use the IoU metric to compare these representations using accurate instance segmentation ground truth. We design a novel curved bounding box model that has optimal properties for fisheye distortion models. We also design a curvature adaptive perimeter sampling method for obtaining polygon vertices, improving relative mAP score by 4.9% compared to uniform sampling. Overall, the proposed polygon model improves mIoU relative accuracy by 40.3%. It is the first detailed study on object detection on fisheye cameras for autonomous driving scenarios to the best of our knowledge. The dataset comprising of 10,000 images along with all the object representations ground truth will be made public to encourage further research. We summarize our work in a short video with qualitative results at https://youtu.be/iLkOzvJpL-A.
Keyword: human pose

Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation
Authors: Sina Honari, Victor Constantin, Helge Rhodin, Mathieu Salzmann, Pascal Fua
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01511
Pdf link: https://arxiv.org/pdf/2012.01511
Abstract In this paper, we introduce an unsupervised feature extraction method that exploits contrastive self-supervised (CSS) learning to extract rich latent vectors from single-view videos. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly separate each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features yields a rich latent space, well-suited for human pose estimation. Our approach outperforms other unsupervised single-view methods and match the performance of multi-view techniques.
Holistic 3D Human and Scene Mesh Estimation from Single View Images
Authors: Zhenzhen Weng, Serena Yeung
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01591
Pdf link: https://arxiv.org/pdf/2012.01591
Abstract The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving ambiguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computer vision models fully leverage this fact. In this work, we propose an end-to-end trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.
Keyword: super resolution

There is no result

Keyword: generation

Data-driven Analysis of Turbulent Flame Images
Authors: Rathziel Roncancio, Jupyoung Kim, Aly El Gamal, Jay P. Gore
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01485
Pdf link: https://arxiv.org/pdf/2012.01485
Abstract Turbulent premixed flames are important for power generation using gas turbines. Improvements in characterization and understanding of turbulent flames continue particularly for transient events like ignition and extinction. Pockets or islands of unburned material are features of turbulent flames during these events. These features are directly linked to heat release rates and hydrocarbons emissions. Unburned material pockets in turbulent CH$_4$/air premixed flames with CO$_2$ addition were investigated using OH Planar Laser-Induced Fluorescence images. Convolutional Neural Networks (CNN) were used to classify images containing unburned pockets for three turbulent flames with 0%, 5%, and 10% CO$_2$ addition. The CNN model was constructed using three convolutional layers and two fully connected layers using dropout and weight decay. The CNN model achieved accuracies of 91.72%, 89.35% and 85.80% for the three flames, respectively.
TAN-NTM: Topic Attention Networks for Neural Topic Modeling
Authors: Madhur Panwar, Shashank Shailabh, Milan Aggarwal, Balaji Krishnamurthy
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2012.01524
Pdf link: https://arxiv.org/pdf/2012.01524
Abstract Topic models have been widely used to learn representations from text and gain insight into document corpora. To perform topic discovery, existing neural models use document bag-of-words (BoW) representation as input followed by variational inference and learn topic-word distribution through reconstructing BoW. Such methods have mainly focused on analysing the effect of enforcing suitable priors on document distribution. However, little importance has been given to encoding improved document features for capturing document semantics better. In this work, we propose a novel framework: TAN-NTM which models document as a sequence of tokens instead of BoW at the input layer and processes it through an LSTM whose output is used to perform variational inference followed by BoW decoding. We apply attention on LSTM outputs to empower the model to attend on relevant words which convey topic related cues. We hypothesise that attention can be performed effectively if done in a topic guided manner and establish this empirically through ablations. We factor in topic-word distribution to perform topic aware attention achieving state-of-the-art results with ~9-15 percentage improvement over score of existing SOTA topic models in NPMI coherence metric on four benchmark datasets - 20NewsGroup, Yelp, AGNews, DBpedia. TAN-NTM also obtains better document classification accuracy owing to learning improved document-topic features. We qualitatively discuss that attention mechanism enables unsupervised discovery of keywords. Motivated by this, we further show that our proposed framework achieves state-of-the-art performance on topic aware supervised generation of keyphrases on StackExchange and Weibo datasets.
Deep learning collocation method for solid mechanics: Linear elasticity, hyperelasticity, and plasticity as examples
Authors: Diab W. Abueidda, Qiyue Lu, Seid Koric
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2012.01547
Pdf link: https://arxiv.org/pdf/2012.01547
Abstract Deep learning and the collocation method are merged and used to solve partial differential equations describing structures' deformation. We have considered different types of materials: linear elasticity, hyperelasticity (neo-Hookean) with large deformation, and von Mises plasticity with isotropic and kinematic hardening. The performance of this deep collocation method (DCM) depends on the architecture of the neural network and the corresponding hyperparameters. The presented DCM is meshfree and avoids any spatial discretization, which is usually needed for the finite element method (FEM). We show that the DCM can capture the response qualitatively and quantitatively, without the need for any data generation using other numerical methods such as the FEM. Data generation usually is the main bottleneck in most data-driven models. The deep learning model is trained to learn the model's parameters yielding accurate approximate solutions. Once the model is properly trained, solutions can be obtained almost instantly at any point in the domain, given its spatial coordinates. Therefore, the deep collocation method is potentially a promising standalone technique to solve partial differential equations involved in the deformation of materials and structural systems as well as other physical phenomena.
Corrected subdivision approximation of piecewise smooth functions
Authors: Sergio Amat, David Levin, Juan Ruiz-Álvarez
Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2012.01575
Pdf link: https://arxiv.org/pdf/2012.01575
Abstract Subdivision schemes are useful mathematical tools for the generation of curves and surfaces. The linear approaches suffer from Gibbs oscillations when approximating functions with singularities. On the other hand, when we analyze the convergence of nonlinear subdivision schemes the regularity of the limit function is smaller than in the linear case. The goal of this paper is to introduce a corrected implementation of linear interpolatory subdivision schemes addressing both properties at the same time: regularity and adaption to singularities with high order of accuracy. In both cases of point-value data and of cell-average data, we are able to construct a subdivision-based algorithm producing approximations with high precision and limit functions with high piecewise regularity and without diffusion nor oscillations in the presence of discontinuities.
Meta-Generating Deep Attentive Metric for Few-shot Classification
Authors: Lei Zhang, Fei Zhou, Wei Wei, Yanning Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01641
Pdf link: https://arxiv.org/pdf/2012.01641
Abstract Learning to generate a task-aware base learner proves a promising direction to deal with few-shot learning (FSL) problem. Existing methods mainly focus on generating an embedding model utilized with a fixed metric (eg, cosine distance) for nearest neighbour classification or directly generating a linear classier. However, due to the limited discriminative capacity of such a simple metric or classifier, these methods fail to generalize to challenging cases appropriately. To mitigate this problem, we present a novel deep metric meta-generation method that turns to an orthogonal direction, ie, learning to adaptively generate a specific metric for a new FSL task based on the task description (eg, a few labelled samples). In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task. Moreover, different from existing methods that utilize an uni-modal weight distribution conditioned on labelled samples for network generation, the proposed meta-learner establishes a multi-modal weight distribution conditioned on cross-class sample pairs using a tailored variational autoencoder, which can separately capture the specific inter-class discrepancy statistics for each class and jointly embed the statistics for all classes into metric generation. By doing this, the generated metric can be appropriately adapted to a new FSL task with pleasing generalization performance. To demonstrate this, we test the proposed method on four benchmark FSL datasets and gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases, eg, improve the accuracy from 26.14% to 46.69% in the 20-way 1-shot task on miniImageNet, while improve the accuracy from 45.2% to 68.72% in the 5-way 1-shot task on FC100. Code is available: https://github.com/NWPUZhoufei/DAM.
YAP: Tool Support for Deriving Safety Controllers from Hazard Analysis and Risk Assessments
Authors: Mario Gleirscher (University of York)
Subjects: Software Engineering (cs.SE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2012.01649
Pdf link: https://arxiv.org/pdf/2012.01649
Abstract Safety controllers are system or software components responsible for handling risk in many machine applications. This tool paper describes a use case and a workflow for YAP, a research tool for risk modelling and discrete-event safety controller design. The goal of this use case is to derive a safety controller from hazard analysis and risk assessment, to define a design space for this controller, and to select a verified optimal controller instance from this design space. We represent this design space as a stochastic model and use YAP for risk modelling and generation of parts of this stochastic model. For the controller verification and selection step, we use a stochastic model checker. The approach is illustrated by an example of a collaborative robot operated in a manufacturing work cell.
VICToRy: Visual Interactive Consistency Management in Tolerant Rule-based Systems
Authors: Nils Weidmann (Paderborn University, Germany), Anthony Anjorin (IAV GmbH Ingenieurgesellschaft Auto und Verkehr, Germany), James Cheney (University of Edinburgh, United Kingdom)
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2012.01655
Pdf link: https://arxiv.org/pdf/2012.01655
Abstract In the field of Model-Driven Engineering, there exist numerous tools that support various consistency management operations including model transformation, synchronisation and consistency checking. The supported operations, however, typically run completely in the background with only input and output made visible to the user. We argue that this often reduces both understandability and controllability. As a step towards improving this situation, we present VICToRy, a debugger for model generation and transformation based on Triple Graph Grammars, a well-known rule-based approach to bidirectional transformation. In addition to a fine-grained, step-by-step, interactive visualisation, VICToRy enables the user to actively explore and choose between multiple valid rule applications thus improving control and understanding.
A Novel 3D Non-Stationary Multi-Frequency Multi-Link Wideband MIMO Channel Model
Authors: Li Zhang, Xinyue Chen, Zihao Zhou, Cheng-Xiang Wang, Chun Pan, Yungui Wang
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2012.01728
Pdf link: https://arxiv.org/pdf/2012.01728
Abstract In this paper, a multi-frequency multi-link three-dimensional (3D) non-stationary wideband multiple-input multiple-output (MIMO) channel model is proposed. The spatial consistency and multi-frequency correlation are considered in parameters initialization of every single-link and different frequencies, including large scale parameters (LSPs) and small scale parameters (SSPs). Moreover, SSPs are time-variant and updated when scatterers and the receiver (Rx) are moving. The temporal evolution of clusters is modeled by birth and death processes. The single-link channel model which has considered the inter-correlation can be easily extended to multi-link channel model. Statistical properties, including spatial cross-correlation function (CCF), power delay profile (PDP), and correlation matrix collinearity (CMC) are investigated and compared with the 3rd generation partner project (3GPP) TR 38.901 and quasi deterministic radio channel generator (QuaDRiGa) channel models. Besides, the CCF is validated against measurement data.
DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances
Authors: Xiaodong Gu, Kang Min Yoo, Jung-Woo Ha
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01775
Pdf link: https://arxiv.org/pdf/2012.01775
Abstract Recent advances in pre-trained language models have significantly improved neural response generation. However, existing methods usually view the dialogue context as a linear sequence of tokens and learn to generate the next word through token-level self-attention. Such token-level encoding hinders the exploration of discourse-level coherence among utterances. This paper presents DialogBERT, a novel conversational response generation model that enhances previous PLM-based dialogue models. DialogBERT employs a hierarchical Transformer architecture. To efficiently capture the discourse-level coherence among utterances, we propose two training objectives, including masked utterance regression and distributed utterance order ranking in analogy to the original BERT training. Experiments on three multi-turn conversation datasets show that our approach remarkably outperforms the baselines, such as BART and DialoGPT, in terms of quantitative evaluation. The human evaluation suggests that DialogBERT generates more coherent, informative, and human-like responses than the baselines with significant margins.
Attributes Aware Face Generation with Generative Adversarial Networks
Authors: Zheng Yuan, Jie Zhang, Shiguang Shan, Xilin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.01782
Pdf link: https://arxiv.org/pdf/2012.01782
Abstract Recent studies have shown remarkable success in face image generations. However, most of the existing methods only generate face images from random noise, and cannot generate face images according to the specific attributes. In this paper, we focus on the problem of face synthesis from attributes, which aims at generating faces with specific characteristics corresponding to the given attributes. To this end, we propose a novel attributes aware face image generator method with generative adversarial networks called AFGAN. Specifically, we firstly propose a two-path embedding layer and self-attention mechanism to convert binary attribute vector to rich attribute features. Then three stacked generators generate $64 \times 64$, $128 \times 128$ and $256 \times 256$ resolution face images respectively by taking the attribute features as input. In addition, an image-attribute matching loss is proposed to enhance the correlation between the generated images and input attributes. Extensive experiments on CelebA demonstrate the superiority of our AFGAN in terms of both qualitative and quantitative evaluations.
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training
Authors: Haohan Guo, Heng Lu, Na Hu, Chunlei Zhang, Shan Yang, Lei Xie, Dan Su, Dong Yu
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2012.01837
Pdf link: https://arxiv.org/pdf/2012.01837
Abstract This paper describes an end-to-end adversarial singing voice conversion (EA-SVC) approach. It can directly generate arbitrary singing waveform by given phonetic posteriorgram (PPG) representing content, F0 representing pitch, and speaker embedding representing timbre, respectively. Proposed system is composed of three modules: generator $G$, the audio generation discriminator $D_{A}$, and the feature disentanglement discriminator $D_F$. The generator $G$ encodes the features in parallel and inversely transforms them into the target waveform. In order to make timbre conversion more stable and controllable, speaker embedding is further decomposed to the weighted sum of a group of trainable vectors representing different timbre clusters. Further, to realize more robust and accurate singing conversion, disentanglement discriminator $D_F$ is proposed to remove pitch and timbre related information that remains in the encoded PPG. Finally, a two-stage training is conducted to keep a stable and effective adversarial training process. Subjective evaluation results demonstrate the effectiveness of our proposed methods. Proposed system outperforms conventional cascade approach and the WaveNet based end-to-end approach in terms of both singing quality and singer similarity. Further objective analysis reveals that the model trained with the proposed two-stage training strategy can produce a smoother and sharper formant which leads to higher audio quality.
Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization
Authors: Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, Youxin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2012.01879
Pdf link: https://arxiv.org/pdf/2012.01879
Abstract This paper tackles automated categorization of Age-related Macular Degeneration (AMD), a common macular disease among people over 50. Previous research efforts mainly focus on AMD categorization with a single-modal input, let it be a color fundus image or an OCT image. By contrast, we consider AMD categorization given a multi-modal input, a direction that is clinically meaningful yet mostly unexplored. Contrary to the prior art that takes a traditional approach of feature extraction plus classifier training that cannot be jointly optimized, we opt for end-to-end multi-modal Convolutional Neural Networks (MM-CNN). Our MM-CNN is instantiated by a two-stream CNN, with spatially-invariant fusion to combine information from the fundus and OCT streams. In order to visually interpret the contribution of the individual modalities to the final prediction, we extend the class activation mapping (CAM) technique to the multi-modal scenario. For effective training of MM-CNN, we develop two data augmentation methods. One is GAN-based fundus / OCT image synthesis, with our novel use of CAMs as conditional input of a high-resolution image-to-image translation GAN. The other method is Loose Pairing, which pairs a fundus image and an OCT image on the basis of their classes instead of eye identities. Experiments on a clinical dataset consisting of 1,099 color fundus images and 1,290 OCT images acquired from 1,099 distinct eyes verify the effectiveness of the proposed solution for multi-modal AMD categorization.
Competition analysis on the over-the-counter credit default swap market
Authors: Louis Abraham
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Econometrics (econ.EM); Computational Finance (q-fin.CP)
Arxiv link: https://arxiv.org/abs/2012.01883
Pdf link: https://arxiv.org/pdf/2012.01883
Abstract We study two questions related to competition on the OTC CDS market using data collected as part of the EMIR regulation. First, we study the competition between central counterparties through collateral requirements. We present models that successfully estimate the initial margin requirements. However, our estimations are not precise enough to use them as input to a predictive model for CCP choice by counterparties in the OTC market. Second, we model counterpart choice on the interdealer market using a novel semi-supervised predictive task. We present our methodology as part of the literature on model interpretability before arguing for the use of conditional entropy as the metric of interest to derive knowledge from data through a model-agnostic approach. In particular, we justify the use of deep neural networks to measure conditional entropy on real-world datasets. We create the $\textit{Razor entropy}$ using the framework of algorithmic information theory and derive an explicit formula that is identical to our semi-supervised training objective. Finally, we borrow concepts from game theory to define $\textit{top-k Shapley values}$. This novel method of payoff distribution satisfies most of the properties of Shapley values, and is of particular interest when the value function is monotone submodular. Unlike classical Shapley values, top-k Shapley values can be computed in quadratic time of the number of features instead of exponential. We implement our methodology and report the results on our particular task of counterpart choice. Finally, we present an improvement to the $\textit{node2vec}$ algorithm that could for example be used to further study intermediation. We show that the neighbor sampling used in the generation of biased walks can be performed in logarithmic time with a quasilinear time pre-computation, unlike the current implementations that do not scale well.
An Empirical Study of Derivative-Free-Optimization Algorithms for Targeted Black-Box Attacks in Deep Neural Networks
Authors: Giuseppe Ughi, Vinayak Abrol, Jared Tanner
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01901
Pdf link: https://arxiv.org/pdf/2012.01901
Abstract We perform a comprehensive study on the performance of derivative free optimization (DFO) algorithms for the generation of targeted black-box adversarial attacks on Deep Neural Network (DNN) classifiers assuming the perturbation energy is bounded by an $\ell_\infty$ constraint and the number of queries to the network is limited. This paper considers four pre-existing state-of-the-art DFO-based algorithms along with the introduction of a new algorithm built on BOBYQA, a model-based DFO method. We compare these algorithms in a variety of settings according to the fraction of images that they successfully misclassify given a maximum number of queries to the DNN. The experiments disclose how the likelihood of finding an adversarial example depends on both the algorithm used and the setting of the attack; algorithms limiting the search of adversarial example to the vertices of the $\ell^\infty$ constraint work particularly well without structural defenses, while the presented BOBYQA based algorithm works better for especially small perturbation energies. This variance in performance highlights the importance of new algorithms being compared to the state-of-the-art in a variety of settings, and the effectiveness of adversarial defenses being tested using as wide a range of algorithms as possible.
Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts
Authors: Fuqi Song, Éric de la Clergerie
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01942
Pdf link: https://arxiv.org/pdf/2012.01942
Abstract In contract analysis and contract automation, a knowledge base (KB) of legal entities is fundamental for performing tasks such as contract verification, contract generation and contract analytic. However, such a KB does not always exist nor can be produced in a short time. In this paper, we propose a clustering-based approach to automatically generate a reliable knowledge base of legal entities from given contracts without any supplemental references. The proposed method is robust to different types of errors brought by pre-processing such as Optical Character Recognition (OCR) and Named Entity Recognition (NER), as well as editing errors such as typos. We evaluate our method on a dataset that consists of 800 real contracts with various qualities from 15 clients. Compared to the collected ground-truth data, our method is able to recall 84\% of the knowledge.
Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs
Authors: Sangpyo Kim, Wonkyung Jung, Jaiyoung Park, Jung Ho Ahn
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2012.01968
Pdf link: https://arxiv.org/pdf/2012.01968
Abstract Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.
Detection of False-Reading Attacks in the AMI Net-Metering System
Authors: Mahmoud M. Badr, Mohamed I. Ibrahem, Mohamed Mahmoud, Mostafa M. Fouda, Waleed Alasmary
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01983
Pdf link: https://arxiv.org/pdf/2012.01983
Abstract In smart grid, malicious customers may compromise their smart meters (SMs) to report false readings to achieve financial gains illegally. Reporting false readings not only causes hefty financial losses to the utility but may also degrade the grid performance because the reported readings are used for energy management. This paper is the first work that investigates this problem in the net-metering system, in which one SM is used to report the difference between the power consumed and the power generated. First, we prepare a benign dataset for the net-metering system by processing a real power consumption and generation dataset. Then, we propose a new set of attacks tailored for the net-metering system to create malicious dataset. After that, we analyze the data and we found time correlations between the net meter readings and correlations between the readings and relevant data obtained from trustworthy sources such as the solar irradiance and temperature. Based on the data analysis, we propose a general multi-data-source deep hybrid learning-based detector to identify the false-reading attacks. Our detector is trained on net meter readings of all customers besides data from the trustworthy sources to enhance the detector performance by learning the correlations between them. The rationale here is that although an attacker can report false readings, he cannot manipulate the solar irradiance and temperature values because they are beyond his control. Extensive experiments have been conducted, and the results indicate that our detector can identify the false-reading attacks with high detection rate and low false alarm.
Radar Artifact Labeling Framework (RALF): Method for Plausible Radar Detections in Datasets
Authors: Simon T. Isele, Marcel P. Schilling, Fabian E. Klein, Sascha Saralajew, J. Marius Zoellner
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2012.01993
Pdf link: https://arxiv.org/pdf/2012.01993
Abstract Research on localization and perception for Autonomous Driving is mainly focused on camera and LiDAR datasets, rarely on radar data. Manually labeling sparse radar point clouds is challenging. For a dataset generation, we propose the cross sensor Radar Artifact Labeling Framework (RALF). Automatically generated labels for automotive radar data help to cure radar shortcomings like artifacts for the application of artificial intelligence. RALF provides plausibility labels for radar raw detections, distinguishing between artifacts and targets. The optical evaluation backbone consists of a generalized monocular depth image estimation of surround view cameras plus LiDAR scans. Modern car sensor sets of cameras and LiDAR allow to calibrate image-based relative depth information in overlapping sensing areas. K-Nearest Neighbors matching relates the optical perception point cloud with raw radar detections. In parallel, a temporal tracking evaluation part considers the radar detections' transient behavior. Based on the distance between matches, respecting both sensor and model uncertainties, we propose a plausibility rating of every radar detection. We validate the results by evaluating error metrics on semi-manually labeled ground truth dataset of $3.28\cdot10^6$ points. Besides generating plausible radar detections, the framework enables further labeled low-level radar signal datasets for applications of perception and Autonomous Driving learning tasks.
Ontology-based and User-focused Automatic Text Summarization (OATS): Using COVID-19 Risk Factors as an Example
Authors: Po-Hsu Allen Chen, Amy Leibrand, Jordan Vasko, Mitch Gauthier
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2012.02028
Pdf link: https://arxiv.org/pdf/2012.02028
Abstract This paper proposes a novel Ontology-based and user-focused Automatic Text Summarization (OATS) system, in the setting where the goal is to automatically generate text summarization from unstructured text by extracting sentences containing the information that aligns to the user's focus. OATS consists of two modules: ontology-based topic identification and user-focused text summarization; it first utilizes an ontology-based approach to identify relevant documents to user's interest, and then takes advantage of the answers extracted from a question answering model using questions specified from users for the generation of text summarization. To support the fight against the COVID-19 pandemic, we used COVID-19 risk factors as an example to demonstrate the proposed OATS system with the aim of helping the medical community accurately identify relevant scientific literature and efficiently review the information that addresses risk factors related to COVID-19.
Recursive Tree Grammar Autoencoders
Authors: Benjamin Paassen, Irena Koprinska, Kalina Yacef
Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2012.02097
Pdf link: https://arxiv.org/pdf/2012.02097
Abstract Machine learning on tree data has been mostly focused on trees as input. Much less research has covered trees as output, like in chemical molecule optimization or hint generation for intelligent tutoring systems. In this work, we propose recursive tree grammar autoencoders (RTG-AEs), which encode trees via a bottom-up parser and decode trees via a tree grammar, both controlled by a neural network that minimizes the variational autoencoder loss. The resulting encoding and decoding functions can then be employed in subsequent tasks, such as optimization and time series prediction. Our key message is that combining grammar knowledge with recursive processing improves performance compared to using either grammar knowledge or non-sequential processing, but not both. In particular, we show experimentally that our model improves autoencoding error, training time, and optimization score on four benchmark datasets compared to baselines from the literature.
BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling
Authors: Jing Su, Qingyun Dai, Frank Guerin, Mian Zhou
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2012.02128
Pdf link: https://arxiv.org/pdf/2012.02128
Abstract Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.
Throughput and Capacity Evaluation of 5G New Radio Non-Terrestrial Networks with LEO Satellites
Authors: Jonas Sedin, Luca Feltrin, Xingqin Lin
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2012.02136
Pdf link: https://arxiv.org/pdf/2012.02136
Abstract A non-terrestrial network (NTN), a term coined by the 3rd Generation Partnership Project (3GPP), refers to a network utilizing airborne or spaceborne payload for communication. The use of NTN has the potential of facilitating providing connectivity to underserved areas. This has motivated the work in 3GPP on evolving the fifth generation (5G) wireless access technology, known as new radio (NR), to support NTN. The broadband opportunities promised by NTN with low Earth orbit (LEO) satellites have attracted much attention, but the performance of LEO NTN using 5G NR has not been well studied. In this paper, we address this gap by analyzing and evaluating the throughput and capacity performance of LEO NTN. The evaluation results show that the downlink capacity of a LEO satellite in S band with 30 MHz bandwidth serving handheld terminal is about 600 Mbps and the downlink capacity of a LEO satellite in Ka band with 400 MHz bandwidth serving very small aperture terminal (VSAT) is about 7 Gbps. For a LEO NTN similar to the Kuiper project proposed by Amazon, we find that, due to the large cell sizes in the LEO NTN, the area capacity density is moderate: 1-10 kbps/km$^2$ in the S band downlink and 14-120 kbps/km$^2$ in the Ka band downlink depending on latitude.

dajinstory / daily-arxiv-noti

New submissions for Fri, 4 Dec 20 #34

Keyword: detection

Video Anomaly Detection by Estimating Likelihood of Representations

Differential Morphed Face Detection Using Deep Siamese Networks

Mutual Information Maximization on Disentangled Representations for Differential Morph Detection

Multimodal Contact Detection using Auditory and Force Features for Reliable Object Placing in Household Environments

SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

Motion-based Camera Localization System in Colonoscopy Videos

Feature-Based Software Design Pattern Detection

A Study on the Autoregressive and non-Autoregressive Multi-label Learning

Learning Disentangled Intent Representations for Zero-shot Intent Detection

Parallel Residual Bi-Fusion Feature Pyramid Network for Accurate Single-Shot Object Detection

Dual Refinement Feature Pyramid Networks for Object Detection

Automatic Routability Predictor Development Using Neural Architecture Search

On Root Detection Strategies for Android Devices

D-Unet: A Dual-encoder U-Net for Image Splicing Forgery Detection and Localization

Make One-Shot Video Object Segmentation Efficient Again

Hotspot identification for Mapper graphs

Label Enhanced Event Detection with Heterogeneous Graph Attention Networks

Patch2Pix: Epipolar-Guided Pixel-Level Correspondences

Teaching the Machine to Explain Itself using Domain Knowledge

Classifying Malware Using Function Representations in a Static Call Graph

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection

IoT DoS and DDoS Attack Detection using ResNet

Automated Artefact Relevancy Determination from Artefact Metadata and Associated Timeline Events

Detection of False-Reading Attacks in the AMI Net-Metering System

Radar Artifact Labeling Framework (RALF): Method for Plausible Radar Detections in Datasets

Fast Track: Synchronized Behavior Detection in Streaming Tensors

Context in Informational Bias Detection

SuperOCR: A Conversion from Optical Character Recognition to Image Captioning

Characteristics of Reversible Circuits for Error Detection

A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation

SAFCAR: Structured Attention Fusion for Compositional Action Recognition

Deep Inverse Sensor Models as Priors for evidential Occupancy Mapping

Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline

Keyword: segmentation

Contour Transformer Network for One-shot Segmentation of Anatomical Structures

From a Fourier-Domain Perspective on Adversarial Examples to a Wiener Filter Defense for Semantic Segmentation

Single-shot Path Integrated Panoptic Segmentation

Learning Hyperbolic Representations for Unsupervised 3D Segmentation

Dual-Branch Network with Dual-Sampling Modulated Dice Loss for Hard Exudate Segmentation from Colour Fundus Images

Dual Refinement Feature Pyramid Networks for Object Detection

Make One-Shot Video Object Segmentation Efficient Again

A small note on variation in segmentation annotations

Aerial Imagery Pixel-level Segmentation

A Multi-task Contextual Atrous Residual Network for Brain Tumor Detection & Segmentation

Towards Part-Based Understanding of RGB-D Scans

Robust Instance Segmentation through Reasoning about Multi-Object Occlusion

Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline

Keyword: human pose

Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Holistic 3D Human and Scene Mesh Estimation from Single View Images

Keyword: super resolution

Keyword: generation

Data-driven Analysis of Turbulent Flame Images

TAN-NTM: Topic Attention Networks for Neural Topic Modeling

Deep learning collocation method for solid mechanics: Linear elasticity, hyperelasticity, and plasticity as examples

Corrected subdivision approximation of piecewise smooth functions

Meta-Generating Deep Attentive Metric for Few-shot Classification

YAP: Tool Support for Deriving Safety Controllers from Hazard Analysis and Risk Assessments

VICToRy: Visual Interactive Consistency Management in Tolerant Rule-based Systems

A Novel 3D Non-Stationary Multi-Frequency Multi-Link Wideband MIMO Channel Model

DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances

Attributes Aware Face Generation with Generative Adversarial Networks

Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training

Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization

Competition analysis on the over-the-counter credit default swap market

An Empirical Study of Derivative-Free-Optimization Algorithms for Targeted Black-Box Attacks in Deep Neural Networks

Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

Detection of False-Reading Attacks in the AMI Net-Metering System

Radar Artifact Labeling Framework (RALF): Method for Plausible Radar Detections in Datasets

Ontology-based and User-focused Automatic Text Summarization (OATS): Using COVID-19 Risk Factors as an Example

Recursive Tree Grammar Autoencoders

BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling

Throughput and Capacity Evaluation of 5G New Radio Non-Terrestrial Networks with LEO Satellites