Closed flassTer closed 5 years ago
Sure, it is possible. But you'll need a labeled speech dataset with all these symbols. Of course, the larger dataset the better. Normalized LibriSpeech doesn't have such characters. Another option is to train a punctuation prediction model on a large text corpus and then use it to post-process ASR transcriptions.
@vsl9 , thank you for the response, however if I use the comma symbol wouldn't it mess up the csv file that is being read during train/validation mode?
We use pandas
for parsing csv files. It should work fine with commas if a transcript with a comma is inside double quotas:
wav_filename,transcript
0000.wav,there is no comma
0001.wav,"here is a comma , symbol"
Thank you @vsl9 .
Hello everyone, is it possible to add new symbols to the vocabulary? For example I can see that the "toy set" vocabulary contains the whole alphabet a-z plus a blank character. Is it possible to just add "!", "?", ",", "." to this text file, so that sentiment analysis after transcription can be more accurate?
Thank you.