espnet / espnet_onnx

Onnx wrapper for espnet infrernce model
MIT License
152 stars 24 forks source link

Decoding speed and accuracy on the transformed onnx model #42

Open yangyi0818 opened 2 years ago

yangyi0818 commented 2 years ago

Hi, thanks for you share of the espnet_onnx system!

I met two problems when I tried to inference thorough your codes. My acoustic model is trained by myself on our own dataset. The AM architecture is the typical Conformer. I downloaded this code on June.

First, the decoding speed is too slow by it. When using torch to decode, the RTF is around 2.32; however it becomes around 20 when using the transformed onnx.

Second, the CER calculated in the torch version is 7.8% while for the onnx, it becomes 10.6%. I think it is probably wrong.

I'm giving some configs here:

export.py

import sys
sys.path.append('espnet-master')
sys.path.append('espnet-master/espnet_tts_frontend-master')
sys.path.append('espnet_onnx-master/espnet_onnx/export/asr')
import torch

from export_asr import ModelExport
from espnet2.bin.asr_inference import Speech2Text

if __name__ == '__main__':
    m = ModelExport(cache_dir = sys.argv[5])

    # export from trained model
    speech2text=Speech2Text(
            asr_train_config = sys.argv[1],
            asr_model_file=sys.argv[2],
            lm_train_config=sys.argv[3],
            lm_file=sys.argv[4],
            )

    m.export(model = speech2text, tag_name = 'speech2text', quantize=True)

And I get an onnx dir structured like:

asr/onnx/speech2text/       config.yaml
      feats_stats.npz
      full/
      quantize/

The test wav is a filelist, structured as:

bigfar_001_000001 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000001.wav
bigfar_001_000002 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000002.wav
bigfar_001_000003 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000003.wav
bigfar_001_000004 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000004.wav
bigfar_001_000005 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000005.wav
bigfar_001_000006 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000006.wav
...

The decoding process is:

decode.py

import sys
sys.path.append('espnet_onnx-master/espnet_onnx/asr')

import time
import threading
import librosa
import os
from tqdm import tqdm
from asr_model import Speech2Text

if __name__ == '__main__':
    """ step1: load onnx file """
        speech2text = Speech2Text(tag_name = 'speech2text', model_dir=sys.argv[3],)

        """ step2: ASR """
        f = open(sys.argv[1])
        lines = f.readlines()
        for line in tqdm(lines):
            with open(os.path.join(sys.argv[2], 'hyp_flush_1process.trn'),'a') as fout:
                wav_name = line.split(' ')[0].strip()
                processing_wav = line.split(' ')[1].strip()

                start = time.time()
                y, sr = librosa.load(processing_wav, sr=16000)
                nbest = speech2text(y)
                asr_result = nbest[0][0]
                end = time.time()

                for j in range (len(asr_result)):
                    fout.write(asr_result[j])
                    if j != len(asr_result) - 1:
                        fout.write(' ')
                fout.write('\t')
                fout.write('(')
                fout.write(wav_name)
                fout.write('-')
                fout.write(wav_name)
                fout.write(')')
                fout.write('\n')

                print('processing:  ', processing_wav)
                print('Result:         ', asr_result)
                print('Time:           ', end-start, 's')

Furthermore, I noticed that you have mentioned there may be some problems for Conformer AM considering ASR in latest issue, has it been fixed?

Looking forward for your reply!

Masao-Someki commented 2 years ago

Hi @yangyi0818, Thank you for reporting the issue! About the first point, I would like to know the following information:

And about the second point, I would like to know the following information:

The latest Conformer-related issue is not yet fixed, and I'm trying to solve it!

yangyi0818 commented 2 years ago

Hi @Masao-Someki ! Thank you for your kind reply! Here are my answers.

About the first point:

What is your device? CPU or GPU? CPU

Am I right that your model was constructed with Conformer encoder and Transformer decoder? Yes.

Did you use LM for the inference? Yes. It is a transformer structured LM.

There are two Conformer blocks in ESPnet, the legacy and the latest versions. Which block did you use? Our AM was trained last year, maybe it is a legacy one?

I see quantization is applied to your model. Did you execute your quantized model on GPU? It is true that I set 'quantize=True' in 'export.py'. But I have only tried the unquantized model on CPU.

About the second point: Yes , I checked the weights and I also tried different configurations. It seems that it didn't help much. Here are the results: weights: {ctc: 0.3, decoder: 0.7, length_bonus: 0.0, lm: 0.3} # cer=10.8% (This is the same configuration as inferencing on torch) weights: {ctc: 0.3, decoder: 0.7, length_bonus: 0.0, lm: 1.0} # cer=10.8% weights: {ctc: 0.3, decoder: 1.0, length_bonus: 0.0, lm: 0.1} # cer=11.6% weights: {ctc: 0.5, decoder: 0.5, length_bonus: 0.0, lm: 1.0} # cer=10.7%

Masao-Someki commented 2 years ago

Thank you! About the RTF, it may be a problem with the frontend process. If you are using the default frontend, which contains stft and logmel, is it possible to check the performance difference between the torch frontend and the onnx frontend? I recently found a little speed down in espnet_onnx's frontend compared to the ESPnet version. Now I'm considering converting this whole process into onnx. If the frontend causes this problem, I think I have to do this quickly..

rookie0607 commented 2 years ago

Hi, thanks for you share of the espnet_onnx system!

I met two problems when I tried to inference thorough your codes. My acoustic model is trained by myself on our own dataset. The AM architecture is the typical Conformer. I downloaded this code on June.

First, the decoding speed is too slow by it. When using torch to decode, the RTF is around 2.32; however it becomes around 20 when using the transformed onnx.

Second, the CER calculated in the torch version is 7.8% while for the onnx, it becomes 10.6%. I think it is probably wrong.

I'm giving some configs here:

export.py

import sys
sys.path.append('espnet-master')
sys.path.append('espnet-master/espnet_tts_frontend-master')
sys.path.append('espnet_onnx-master/espnet_onnx/export/asr')
import torch

from export_asr import ModelExport
from espnet2.bin.asr_inference import Speech2Text

if __name__ == '__main__':
    m = ModelExport(cache_dir = sys.argv[5])

    # export from trained model
    speech2text=Speech2Text(
            asr_train_config = sys.argv[1],
            asr_model_file=sys.argv[2],
            lm_train_config=sys.argv[3],
            lm_file=sys.argv[4],
            )

    m.export(model = speech2text, tag_name = 'speech2text', quantize=True)

And I get an onnx dir structured like:

asr/onnx/speech2text/       config.yaml       feats_stats.npz       full/       quantize/

The test wav is a filelist, structured as:

bigfar_001_000001 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000001.wav
bigfar_001_000002 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000002.wav
bigfar_001_000003 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000003.wav
bigfar_001_000004 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000004.wav
bigfar_001_000005 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000005.wav
bigfar_001_000006 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000006.wav
...

The decoding process is:

decode.py

import sys
sys.path.append('espnet_onnx-master/espnet_onnx/asr')

import time
import threading
import librosa
import os
from tqdm import tqdm
from asr_model import Speech2Text

if __name__ == '__main__':
    """ step1: load onnx file """
        speech2text = Speech2Text(tag_name = 'speech2text', model_dir=sys.argv[3],)

        """ step2: ASR """
        f = open(sys.argv[1])
        lines = f.readlines()
        for line in tqdm(lines):
            with open(os.path.join(sys.argv[2], 'hyp_flush_1process.trn'),'a') as fout:
                wav_name = line.split(' ')[0].strip()
                processing_wav = line.split(' ')[1].strip()

                start = time.time()
                y, sr = librosa.load(processing_wav, sr=16000)
                nbest = speech2text(y)
                asr_result = nbest[0][0]
                end = time.time()

                for j in range (len(asr_result)):
                    fout.write(asr_result[j])
                    if j != len(asr_result) - 1:
                        fout.write(' ')
                fout.write('\t')
                fout.write('(')
                fout.write(wav_name)
                fout.write('-')
                fout.write(wav_name)
                fout.write(')')
                fout.write('\n')

                print('processing:  ', processing_wav)
                print('Result:         ', asr_result)
                print('Time:           ', end-start, 's')

Furthermore, I noticed that you have mentioned there may be some problems for Conformer AM considering ASR in latest issue, has it been fixed?

Looking forward for your reply!

What is your torch version?

yangyi0818 commented 2 years ago

HI @rookie0607 my torch version is 1.7.1 and onnx version is 1.7.0

joazoa commented 2 years ago

In relation to the slow speed, can you check how many cores are loaded when you try to inference with onnx as i suspect it could be related? @Masao-Someki I notice that all cpu cores are in use when i try to do cpu inference. Is there a way to avoid this other than setting tasksel 1 ? I tried export OMP_NUM_THREADS=1 but no luck.

Masao-Someki commented 2 years ago

@joazoa You can limit the number of threads with the following options:

Currently, there is no script to limit the number of threads in espnet_onnx, so you may need to modify inference codes like this:

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.inter_op_num_threads = 1
sess_options.intra_op_num_threads = 1

self.encoder = onnxruntime.InferenceSession(
                self.config.quantized_model_path,
                providers=providers,
                sess_options=sess_options
            )
joazoa commented 2 years ago

@Masao-Someki thank you!