NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

How to get text for speech2text? #528

Closed ghost closed 4 years ago

ghost commented 4 years ago

When I run Interactive_infer script for speech2text, It gives float array, not text. How can I get text instead of? Who can help me this urgently?

WillemGooderham1 commented 4 years ago

If you are trying to get text for many audio files the best option would be to use infer mode. This will generate a text file containing all of the transcriptions. If you need to use interactive infer look through the sparse_tensor_to_chars function and infer function of speech2text.py and the get_interactive_infer function of utils.py

ghost commented 4 years ago

I tried with both infer and interactive_infer mode and with one or several wave files. But it gives only probability distribution, not real transcription. I think I'm wrong with decoder setting. This is infer_param what I set:

infer_params = { "data_layer": Speech2TextDataLayer, "data_layer_params": { "backend": "librosa", "num_audio_features": 96, "input_type": "spectrogram", "vocab_file": "open_seq2seq/test_utils/toy_speech_data/vocab.txt", "dataset_files": [ "data/test.csv", ], "shuffle": False, }, }

what was I wrong with?

WillemGooderham1 commented 4 years ago

Have you trained your own model or are you using the released Nvidia one? Which framework are you using? Jasper, DeepSpeech2 or Wave2Letter+? What are your decoder_params?

ghost commented 4 years ago

I trained my own model and used DeepSpeech2

WillemGooderham1 commented 4 years ago

For a similar model my parameters for interactive infer and infer are as follows ` infer_params = { "data_layer": Speech2TextDataLayer, "data_layer_params": { "dataset_files": [ "/ATC_DATA/ldc_test_clean.csv", ], "shuffle": False, }, }

interactive_infer_params = { "data_layer": Speech2TextDataLayer, "data_layer_params": { "num_audio_features": 64, "input_type": "spectrogram", "vocab_file": "./Resources/DeepSpeech2/vocab.txt", "dataset_files": [], "shuffle": False, }, } `

And my decoder params were like this ` "decoder": FullyConnectedCTCDecoder, "decoder_params": { "use_language_model": True,

# params for decoding the sequence with language model
"beam_width": 512,
"alpha": 2.0,
"beta": 1.0,

"decoder_library_path": "./resources/DeepSpeech2/Packages/libctc_decoder_with_kenlm.so",
"lm_path": "./resources/DeepSpeech2/lm/ds2-lm.binary",
"trie_path": "./resources/DeepSpeech2/lm/ds2-lm.trie",
"alphabet_config_path": "./resources/DeepSpeech2/vocab.txt",

}, "loss": CTCLoss, "loss_params": {}, `

If your parameters differ try changing them. If the output still doesn't work then without more information I am unsure about how to help you.

ghost commented 4 years ago

Thanks for your advance. I'll try with your reference.