How to improve the accuracy of speech recognition?

seepine commented 10 months ago

I set up a vosk-server using Docker and downloaded the 1GB+ models for both Chinese (cn) and Japanese (ja). However, after recognizing various audio files including online audios and my own recordings, the recognized text is completely unrelated to the actual content.

with node websocket client demo

const websocket = require('ws');
const fs = require("fs");
const ws = new websocket('ws://192.168.100.110:2710/asr');

ws.on('open', function open() {
    console.log('open');
    var readStream = fs.createReadStream('test-cn.wav');
    readStream.on('data', function (chunk) {
        ws.send(chunk);
    });
    readStream.on('end', function () {
        ws.send('{"eof" : 1}');
    });
});

ws.on('message', function incoming(data) {
    const message = data.toString('utf8');
    if (!message) {
        return
    }
    const obj = JSON.parse(message)
    if (obj.text) {
        console.log('recv', obj);
    }
});

ws.on('close', function close() {
    process.exit()
});

nshmyrev commented 10 months ago

Probably audio format is wrong, it must be 16khz 16bit mono strictly. If format is wrong you get garbage.

For Chinese FunASR should have best accuracy https://github.com/alibaba-damo-academy/FunASR/tree/main

seepine commented 10 months ago

I converted using ffmpeg, maybe 16khz 16bit mono strictly

ffmpeg -i "input.m4a" -ac 1 -ar 16000 -acodec pcm_s16le test-ja.wav

seepine commented 10 months ago

@nshmyrev Hi ,It is the model problem? It can accurately recognize what I'm saying on https://alphacephei.com/cn/ demo. But i download 1.3GB model in https://alphacephei.com/vosk/models/vosk-model-cn-0.22.zip, it can not be work well.

nshmyrev commented 10 months ago

You can try 8khz also, maybe your vosk-server configured to 8khz

seepine commented 10 months ago

You can try 8khz also, maybe your vosk-server configured to 8khz

yes, must 8k..

nshmyrev commented 10 months ago

@seepine You'd better reconfigure both to 16khz, it will be better accuracy. VOSK_SAMPLE_RATE environment variable for the server.

alphacep / vosk-api

How to improve the accuracy of speech recognition? #1431