ESPnet without PyTorch!
Utility library to easily export, quantize, and optimize espnet models to onnx format. There is no need to install PyTorch or ESPnet on your machine if you already have exported files!
Now demonstration notebook is available in google colab!
espnet_onnx
can be installed with pippip install espnet_onnx
torch>=1.11.0
, espnet
, espnet_model_zoo
, onnx
additionally.
onnx==1.12.0
might cause some errors. If you got an error while inference or exporting, please consider downgrading the onnx version.git clone git@github.com:espnet/espnet_onnx.git
cd tools
make venv export
. tools/venv/bin/activate
# Please reference official installation guide of PyTorch.
pip install torch
cd tools
git clone https://github.com/s3prl/s3prl
cd s3prl
pip install .
cd tools
git clone --single-branch --branch espnet_v1.1 https://github.com/b-flo/warp-transducer.git
cd warp-transducer
mkdir build
# Please set WITH_OMP to ON or OFF.
cd build && cmake -DWITH_OMP="ON" .. && make
cd pytorch_binding && python3 -m pip install -e .
If you want to develop optimization, you also need to develop onnxruntime. Please clone the onnxruntime repository.
Since espnet==202308(latest on v0.2.0 release) requires protobuf<=3.20.1
while the latest onnx requires protobuf>=3.20.2
, you may get error with installation.
In this case, first, install the espnet==202308, update protobuf to 3.20.3, and then install other libraries.
espnet_onnx
can export pretrained model published on espnet_model_zoo
. By default, exported files will be stored in ${HOME}/.cache/espnet_onnx/<tag_name>
.from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ASRModelExport
m = ASRModelExport()
# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained('<tag name>', quantize=True)
# export from trained model
speech2text = Speech2Text(args)
m.export(speech2text, '<tag name>', quantize=True)
meta.yaml
.from espnet_onnx.export import ASRModelExport
m = ASRModelExport()
m.export_from_zip(
'path/to/the/zipfile',
tag_name='tag_name_for_zipped_model',
quantize=True
)
from espnet_onnx.export import ASRModelExport
m = ASRModelExport()
# Set maximum sequence length to 3000
m.set_export_config(max_seq_len=3000)
m.export_from_zip(
'path/to/the/zipfile',
tag_name='tag_name_for_zipped_model',
)
optimize
option. If you want to fully optimize your model, you need to install the custom version of onnxruntime from here. Please read this document for more detail.from espnet_onnx.export import ASRModelExport
m = ASRModelExport()
m.export_from_zip(
'path/to/the/zipfile',
tag_name='tag_name_for_zipped_model',
optimize=True,
quantize=True
)
python -m espnet_onnx.export \
--model_type asr \
--input ${path_to_zip} \
--tag transformer_lm \
--apply_optimize \
--apply_quantize
tag_name
or model_dir
is used to load onnx file. tag_name
has to be defined in tag_config.yaml
import librosa
from espnet_onnx import Speech2Text
speech2text = Speech2Text(tag_name='<tag name>')
# speech2text = Speech2Text(model_dir='path to the onnx directory')
y, sr = librosa.load('sample.wav', sr=16000)
nbest = speech2text(y)
StreamingSpeech2Text
class. The speech length should be the same as StreamingSpeech2Text.hop_size
from espnet_onnx import StreamingSpeech2Text
stream_asr = StreamingSpeech2Text(tag_name)
# start streaming asr
stream_asr.start()
while streaming:
wav = <some code to get wav>
assert len(wav) == stream_asr.hop_size
stream_text = stream_asr(wav)[0][0]
# You can get non-streaming asr result with end function
nbest = stream_asr.end()
You can also simulate streaming model with your wav file with simulate
function. Passing True
as the second argument will show the streaming text as the following code.
import librosa
from espnet_onnx import StreamingSpeech2Text
stream_asr = StreamingSpeech2Text(tag_name)
y, sr = librosa.load('path/to/wav', sr=16000)
nbest = stream_asr.simulate(y, True)
# Processing audio with 6 processes.
# Result at position 0 :
# Result at position 1 :
# Result at position 2 : this
# Result at position 3 : this is
# Result at position 4 : this is a
# Result at position 5 : this is a
print(nbest[0][0])
# 'this is a pen'
If you installed the custom version of onnxruntime, you can run optimized model for inference. You don't have to change any code from the above. If the model was optimized, then espnet_onnx would automatically load the optimized version.
You can use only hubert model for your frontend.
from espnet_onnx.export import ASRModelExport
# export your model
tag_name = 'ESPnet pretrained model with hubert'
m = ASRModelExport()
m.export_from_pretrained(tag_name, optimize=True)
# load only the frontend model
from espnet_onnx.asr.frontend import Frontend
frontend = Frontend.get_frontend(tag_name)
# use the model in your application
import librosa
y, sr = librosa.load('wav file')
# y: (B, T)
# y_len: (B,)
feats = frontend(y[None,:], np.array([len(y)]))
torch
in your environment, you can use frontend in your training.from espnet_onnx.asr.frontend import TorchFrontend
frontend = TorchFrontend.get_frontend(tag_name) # load pretrained frontend model
# use the model while training
import librosa
y, sr = librosa.load('wav file')
# You need to place your data on GPU,
# and specify the output shape in tuple
y = torch.Tensor(y).unsqueeze(0).to('cuda') # (1, wav_length)
output_shape = (batch_size, feat_length, feats_dims)
feats = frontend(y, y.size(1), output_shape)
from espnet2.bin.tts_inference import Text2Speech
from espnet_onnx.export import TTSModelExport
m = TTSModelExport()
# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained('<tag name>', quantize=True)
# export from trained model
text2speech = Text2Speech(args)
m.export(text2speech, '<tag name>', quantize=True)
from espnet_onnx import Text2Speech
tag_name = 'kan-bayashi/ljspeech_vits'
text2speech = Text2Speech(tag_name, use_quantized=True)
text = 'Hello world!'
output_dict = text2speech(text) # inference with onnx model.
wav = output_dict['wav']
Install dependency.
First, we need onnxruntime-gpu
library, instead of onnxruntime
. Please follow this article to select and install the correct version of onnxruntime-gpu
, depending on your CUDA version.
Inference on GPU
Now you can speedup the inference speed with GPU. All you need is to select the correct providers, and give it to the Speech2Text
or StreamingSpeech2Text
instance. See this article for more information about providers.
import librosa
from espnet_onnx import Speech2Text
PROVIDERS = ['CUDAExecutionProvider']
tag_name = 'some_tag_name'
speech2text = Speech2Text(
tag_name,
providers=PROVIDERS
)
y, sr = librosa.load('path/to/wav', sr=16000)
nbest = speech2text(y) # runs on GPU.
Note that some quantized models are not supported for GPU computation. If you got an error with quantized model, please try not-quantized model.
To avoid the cache problem, I modified some scripts from the original espnet implementation.
Add <blank>
before <sos>
Give some torch.zeros()
arrays to the model.
Remove the first token in post process. (remove blank
)
Replace make_pad_mask
into new implementation, which can be converted into onnx format.
Removed extend_pe()
from positional encoding module. The length of pe
is 512 by default.
ASR: Supported architecture for ASR
TTS: Supported architecture for TTS
ASR: Developer's Guide
Copyright (c) 2022 Maso Someki
Released under MIT licence
Masao Someki
contact: masao.someki@gmail.com