Audio token completion using Vall-E for impaired audio

WWW As a researcher having completed a recent literature review (summarized here I would like to see if atypical speech can be synthesized as an audio completion task using audio tokens extracted from neural codecs. Recently a model called vall-e and vall-e-x were released with open source implementations not by the original authors: https://github.com/Plachtaa/VALL-E-X/tree/master

AC The goal of this ticket would be to explore audio prompt completion as a task suitable for our use case. A few unknowns here are:

How can we keep the audio completion grounded and not have it hallucination
Getting accurate speech recognition transcription and/or finding ways to condition the generation so that it generates what is needed

SlangLab-NU / torgo_vc

Audio token completion using Vall-E for impaired audio #2