Open csukuangfj opened 1 year ago
Here is the command for testing
./build/bin/sherpa-ncnn \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/tokens.txt \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param \
./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin \
./test-files_en_speech_jfk_11s.wav
1 \
greedy_search
And here is the result
Disable fp16 for Zipformer encoder
Don't Use GPU. has_gpu: 0, config.use_vulkan_compute: 1
RecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/tokens.txt", encoder num_threads=4, decoder num_threads=4, joiner num_threads=4), decoder_config=DecoderConfig(method="greedy_search", num_active_paths=4), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.4, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=False)
wav filename: ./test-files_en_speech_jfk_11s.wav
wav duration (s): 11
Started!
Done!
Recognition result for ./test-files_en_speech_jfk_11s.wav
text: AND SAW MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY
timestamps: 0.8 1.28 1.44 1.68 1.8 1.92 2 2.12 2.2 2.36 2.52 2.8 4 4.2 4.44 5.76 6.08 6.32 6.6 6.84 7.08 7.36 7.64 8.64 8.8 9.04 9.32 9.6 9.8 10 10.16 10.44 10.76
Elapsed seconds: 1.150 s
Real time factor (RTF): 1.150 / 11.000 = 0.105
Note: the above test is run on macOS, but it can also be run on raspberry pi.
I will test the new models soon, thanks for mentioning 👍
Did a quick test-run, results are definitely much better! 😎👍
Some examples:
Old: PLAY HARD WIFE HERSELF DESTRUCTS BY THE TALLICA
New: PLAY HARD WIRE TO SELF DISTRACTS BY METELICA
(pretty close)
Old: WHOM HE WAY WHO THE TRAIN
New: SHOW ME THE WAY FROM NEW YORK TO CHICAGO WITH THE TRAIN
(nailed it)
Old: SAID WHEN HE WILL DECREASE
New: SAID THE TWO TO TWENTY ONE DEGREES
Org.: "Set the heater to 21 degrees" 😑
Do you have instructions how to include language models or maybe a way to add/emphasize custom vocabulary somehow (dynamic graph etc.)?
The model
small-2023-01-09
is not our best-performing model.Please have a look at of our latest streaming zipformer at https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/zipformer-transucer-models.html
They can get a reasonable WER even without an LM and is quite fast.