Latent factor models for imputation and temporal modeling of urban sound data

Environmental noise has been shown to result in adverse effects on quality-of-life in urban environments. Reporting sound is often done at an event-reporting level by individual citizen level, and response by city officials can take up to days in large cities such as NYC. Sounds of NYC (SONYC) is an NYC-focused solution that combines a network of low-cost wireless recording devices to record continuous, real-time audio across the city, providing a time-continuous spatially separated set of audio recordings for further research on urban sound reporting, analysis, and enforcement. Over the lifetime of the SONYC project, over 50 years worth of usable audio data have been collected, but sensor downtime introduces discontinuities and empty segments in the data. Additionally, gaining insight into temporal dynamics of the system can deepen general understanding of the city’s soundscape. To address this, we propose to model the dynamics of the urban soundscape in SONYC data for the following tasks:

Imputing missing and/or low quality data that breaks the uniformity of existing data
Evaluate the latent model ability to capture longer-term temporal structure via evaluation on the proxy task of auditory scene classification
Explore the interpretability of the latent space model through exploratory data analysis

Currently, we have convenient access to the SONYC data from the year 2017 across about 40 sensors (of known location) in the form of timestamped (and encrypted) raw-audio (10 second clips, sampled at roughly uniform and widely spaced intervals), deep audio embeddings extracted from these audio clips using models trained on a self-supervised audio-visual correspondence task (known as OpenL3), and predictions from an urban sound tagging model. We propose learning a latent-state dynamical system using Kalman filtering and Kalman-filtering inspired methods, with a primary interest in using a particular method incorporating nonlinear dynamics into the Kalman filtering framework via deep learning such as Deep Kalman Filters [7], Kalman Variational Autoencoders, and other deep latent-state dynamics models. For comparison, we may also investigate using traditional Kalman filters. In our case, the observations will be the OpenL3 embeddings. We propose the following experiments for the given tasks:

Reconstruct portions of held-out data using filtering methods, evaluated on the mean L2 distance between reproduced/held-out audio data
Use our latent-state model as a feature extractor used for performing auditory scene classification on the TAU Urban Acoustic Scenes 2019 dataset [8], evaluated using test accuracy compared with results from the associated DCASE 2019 Challenge task.
Using the latent-state model as a feature extractor, perform clustering on the dataset and qualitatively interpret them. One such way is studying the distributions of predicted urban sound tags in each cluster are meaningful.

auroracramer / sonyc-kalman

readme

Latent factor models for imputation and temporal modeling of urban sound data