The foundation model paradigm leverages a shared foundation model to achievestate-of-the-art (SOTA) performance for various tasks, requiring minimaldownstream-specific modeling and data annotation. This approach has provencrucial in the field of Natural Language Processing (NLP). However, the speechprocessing community lacks a similar setup to explore the paradigmsystematically. In this work, we establish the Speech processing UniversalPERformance Benchmark (SUPERB) to study the effectiveness of the paradigm forspeech. We propose a unified multi-tasking framework to address speechprocessing tasks in SUPERB using a frozen foundation model followed bytask-specialized, lightweight prediction heads. Combining our results withcommunity submissions, we verify that the foundation model paradigm ispromising for speech, and our multi-tasking framework is simple yet effective,as the best-performing foundation model shows competitive generalizabilityacross most SUPERB tasks. For reproducibility and extensibility, we havedeveloped a long-term maintained platform that enables deterministicbenchmarking, allows for result sharing via an online leaderboard, and promotescollaboration through a community-driven benchmark database to support newdevelopment cycles. Finally, we conduct a series of analyses to offer anin-depth understanding of SUPERB and speech foundation models, includinginformation flows across tasks inside the models, the correctness of theweighted-sum benchmarking protocol and the statistical significance androbustness of the benchmark.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)