Abstract
Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.
This is not the End: Rethinking Serverless Function Termination
Authors: Kalev Alpernas, Aurojit Panda, Mooly Sagiv
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Abstract
Elastic scaling is one of the central benefits provided by serverless platforms, and requires that they scale resource up and down in response to changing workloads. Serverless platforms scale-down resources by terminating previously launched instances (which are containers or processes). The serverless programming model ensures that terminating instances is safe assuming all application code running on the instance has either completed or timed out. Safety thus depends on the serverless platform's correctly determining that application processing is complete. In this paper, we start with the observation that current serverless platforms do not account for pending asynchronous I/O operations when determining whether application processing is complete. These platforms are thus unsafe when executing programs that use asynchronous I/O, and incorrectly deciding that application processing has terminated can result in data inconsistency when these platforms are used. We show that the reason for this problem is that current serverless semantics couple termination and response generation in serverless applications. We address this problem by proposing an extension to current semantics that decouples response generation and termination, and demonstrate the efficacy and benefits of our proposal by extending OpenWhisk, an open source serverless platform.
GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting
Authors: Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, Raquel Urtasun
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract
The task of motion forecasting is critical for self-driving vehicles (SDVs) to be able to plan a safe maneuver. Towards this goal, modern approaches reason about the map, the agents' past trajectories and their interactions in order to produce accurate forecasts. The predominant approach has been to encode the map and other agents in the reference frame of each target agent. However, this approach is computationally expensive for multi-agent prediction as inference needs to be run for each agent. To tackle the scaling challenge, the solution thus far has been to encode all agents and the map in a shared coordinate frame (e.g., the SDV frame). However, this is sample inefficient and vulnerable to domain shift (e.g., when the SDV visits uncommon states). In contrast, in this paper, we propose an efficient shared encoding for all agents and the map without sacrificing accuracy or generalization. Towards this goal, we leverage pair-wise relative positional encodings to represent geometric relationships between the agents and the map elements in a heterogeneous spatial graph. This parameterization allows us to be invariant to scene viewpoint, and save online computation by re-using map embeddings computed offline. Our decoder is also viewpoint agnostic, predicting agent goals on the lane graph to enable diverse and context-aware multimodal prediction. We demonstrate the effectiveness of our approach on the urban Argoverse 2 benchmark as well as a novel highway dataset.
Keyword: calibration
Graph-Based Multi-Camera Soccer Player Tracker
Authors: Jacek Komorowski, Grzegorz Kurzejamski
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The paper presents a multi-camera tracking method intended for tracking soccer players in long shot video recordings from multiple calibrated cameras installed around the playing field. The large distance to the camera makes it difficult to visually distinguish individual players, which adversely affects the performance of traditional solutions relying on the appearance of tracked objects. Our method focuses on individual player dynamics and interactions between neighborhood players to improve tracking performance. To overcome the difficulty of reliably merging detections from multiple cameras in the presence of calibration errors, we propose the novel tracking approach, where the tracker operates directly on raw detection heat maps from multiple cameras. Our model is trained on a large synthetic dataset generated using Google Research Football Environment and fine-tuned using real-world data to reduce costs involved with ground truth preparation.
Efficient Extrinsic Calibration of Multi-Sensor 3D LiDAR Systems for Autonomous Vehicles using Static Objects Information
Authors: Brahayam Ponton, Magda Ferri, Lars Koenig, Marcus Bartels
Abstract
For an autonomous vehicle, the ability to sense its surroundings and to build an overall representation of the environment by fusing different sensor data streams is fundamental. To this end, the poses of all sensors need to be accurately determined. Traditional calibration methods are based on: 1) using targets specifically designed for calibration purposes in controlled environments, 2) optimizing a quality metric of the point clouds collected while traversing an unknown but static environment, or 3) optimizing the match among per-sensor incremental motion observations along a motion path fulfilling special requirements. In real scenarios, however, the online applicability of these methods can be limited, as they are typically highly dynamic, contain degenerate paths, and require fast computations. In this paper, we propose an approach that tackles some of these challenges by formulating the calibration problem as a joint but structured optimization problem of all sensor calibrations that takes as input a summary of the point cloud information consisting of ground points and pole detections. We demonstrate the efficiency and quality of the results of the proposed approach in a set of experiments with LiDAR simulation and real data from an urban trip.
Abstract
We present a HoloLens 2 server application for streaming device data via TCP in real time. The server can stream data from the four grayscale cameras, depth sensor, IMU, front RGB camera, microphone, head tracking, eye tracking, and hand tracking. Each sent data frame has a timestamp and, optionally, the instantaneous pose of the device in 3D space. The server allows downloading device calibration data, such as camera intrinsics, and can be integrated into Unity projects as a plugin, with support for basic upstream capabilities. To achieve real time video streaming at full frame rate, we leverage the video encoding capabilities of the HoloLens 2. Finally, we present a Python library for receiving and decoding the data, which includes utilities that facilitate passing the data to other libraries. The source code, Python demos, and precompiled binaries are available at https://github.com/jdibenes/hl2ss.
Keyword: out of distribution detection
There is no result
Keyword: out-of-distribution detection
There is no result
Keyword: expected calibration error
There is no result
Keyword: overconfident
There is no result
Keyword: overconfidence
There is no result
Keyword: confidence
There is no result
Keyword: scaling
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
This is not the End: Rethinking Serverless Function Termination
GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting
Keyword: calibration
Graph-Based Multi-Camera Soccer Player Tracker
Efficient Extrinsic Calibration of Multi-Sensor 3D LiDAR Systems for Autonomous Vehicles using Static Objects Information
HoloLens 2 Sensor Streaming