NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 371 forks source link

Vocabulary symbols #443

Closed flassTer closed 5 years ago

flassTer commented 5 years ago

Hello everyone, is it possible to add new symbols to the vocabulary? For example I can see that the "toy set" vocabulary contains the whole alphabet a-z plus a blank character. Is it possible to just add "!", "?", ",", "." to this text file, so that sentiment analysis after transcription can be more accurate?

Thank you.

vsl9 commented 5 years ago

Sure, it is possible. But you'll need a labeled speech dataset with all these symbols. Of course, the larger dataset the better. Normalized LibriSpeech doesn't have such characters. Another option is to train a punctuation prediction model on a large text corpus and then use it to post-process ASR transcriptions.

flassTer commented 5 years ago

@vsl9 , thank you for the response, however if I use the comma symbol wouldn't it mess up the csv file that is being read during train/validation mode?

vsl9 commented 5 years ago

We use pandas for parsing csv files. It should work fine with commas if a transcript with a comma is inside double quotas:

wav_filename,transcript
0000.wav,there is no comma
0001.wav,"here is a comma , symbol"
flassTer commented 5 years ago

Thank you @vsl9 .