can you please tell what should be the structure of dataset to feed into this model and what feature you extracted from audio from all out there such as mel spectrogram, filterbank, mfcc etc.
how will i map each recording to it's relevant trancript characters.