flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Python "Hello World.wav" example #614

Open abhaygargab opened 4 years ago

abhaygargab commented 4 years ago

Thank You for providing such a proper guide to install python bindings for wav2letter.. I have been trying to follow https://github.com/facebookresearch/wav2letter/wiki/Python-bindings but could not understand how to evaluate on a test file. Can you please provide a python script going through all the steps including loading any pretrained model, reading test files and generating the transcription?

tlikhomanenko commented 4 years ago

Hi @abhaygargab,

Thanks, your feedback is very helpful!

Here is an example how we use decoder and call it to generate transcriptions for the input data https://github.com/facebookresearch/wav2letter/blob/master/bindings/python/examples/decoder_example.py (your feedback about comments/docs in this file are welcome). To run this example:

cd wav2letter/bindings/python
python examples/decoder_example.py ../../src/decoder/test

To prepare on your own data you need to do the following steps:

After this you are ready to run decoder_example.py and get transcriptions.

Keep in mind that this decoder python binding only will work with wav2letter CTC/ASG models. We don't support seq2seq decoding for now. Also you can use any model trained in python framework, you just need to prepare T, N, emissions and transitions in the right format for the decoder (numpy arrays of specific shapes).

luweishuang commented 4 years ago

want python audio input example too

tlikhomanenko commented 4 years ago

@luweishuang could you be more precise what example you need and what is absent in mine above description?

luweishuang commented 4 years ago

@tlikhomanenko perhaps my states is not clear. Actually I want to know how to use a audio as a input to binding/python, and get a asr result directly in python environment. Not using “sampleId.bin”

tlikhomanenko commented 4 years ago

@luweishuang Are you training AM with w2l or you train in python with pytorch/tensorflow?

We don't have python support for AM training and converting it from w2l bin format into python format. So you can only serialize predictions from c++ and then load them in python (as I describe above or writing you own script to call forward and dumping emission matrix). If your AM model is trained in python then you just run forward, obtain emission matrix and use it in the decoder.

luweishuang commented 4 years ago

@tlikhomanenko I trained AM model using wav2letter and I had got "sampleId.bin" by running build/Test . Because in Test it does Greedy Path decode and I want to do Lexicon-free beam-search decoder . So I build LexiconFreeDecoder to binding/python and I do it in python as in attachments decoder.zip I got a ununderstandable result as "# score=1.0493672130552311e+40 prediction='# 锜 韜 兴 芹 酋 豁 摔 嶝 課 隊 嗆 邸 投 蛭 讥 酸 荇 蹒 峇 胗 珈 升 纂 骹 偏 酋 痔 贖 謠 耗 務 賀 起 杰 卒 酋 璘 鉅 課 寮 莴 #'", any wrong in my py script, thanks

tlikhomanenko commented 4 years ago

One thing which I see is silence index and blank are not set correctly (as soon as you have CTC criterion), it should be:

sil_idx = token_dict.get_index("|")
blank_idx = token_dict.get_index("#")

# and then
decoder = LexiconFreeDecoder(opts, lm, sil_idx, blank_idx, transitions)

So the output should contain a lot of blanks and you will need to remove repetitions at first and then remove blanks to get the final transcription from the tokens sequence.

Also we use tokens as dictionary for LM https://github.com/facebookresearch/wav2letter/blob/master/Decode.cpp#L216 (possibly your lexicon file is the same as token dict - then it is ok, but you can simplify the code. But if lexicon has another ordering - this is bad).

Let me know if this fixes your issue.

luweishuang commented 4 years ago

@tlikhomanenko I do like yours and it doesn't solve my question. I decided to debug wav2letter/Decoder.cpp first and until I got a right result from it, I debug bindings/python again.

tlikhomanenko commented 4 years ago

# correspondig to struct EmissionUnit read now emissions emissions = numpy.frombuffer(raw_data[:T * (N + 1) * 4], dtype=numpy.float32)

Update: it should be emissions = numpy.frombuffer(raw_data[:T * N * 4], dtype=numpy.float32)

lds-gt commented 4 years ago

@luweishuang Are you training AM with w2l or you train in python with pytorch/tensorflow?

We don't have python support for AM training and converting it from w2l bin format into python format. So you can only serialize predictions from c++ and then load them in python (as I describe above or writing you own script to call forward and dumping emission matrix). If your AM model is trained in python then you just run forward, obtain emission matrix and use it in the decoder.

When I've understood it right: There is no possibility to use available models (e.g. from here using acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt) in python bindings? It would be a clean pipeline to use such an inference model (only the describing data) in the python bindings. Is it possible that there will be some extensions in the future?

tlikhomanenko commented 4 years ago

@lds-gt could you specify what do you want to do?

If you have w2l model then inference pipeline from python is simple to do, see "How we can I use this from Python" in the https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples. You don't need even python bindings for that.

lds-gt commented 4 years ago

Thank you for the advice. I have already seen the Inference-Run-Examples. But I thought of something different. An already w2l trained model can be described by a set of binary and architectural files. In the inference examples there are 6 descriptive files (acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt). From my point it would be nice to use the w2l - written in C++ - as backend but can fully process the transcription with an arbitrary architecture (given by a set of already trained w2l files) in python (without using subprocesses as shown in the examples). Is something like that planed in the future?

tlikhomanenko commented 4 years ago

Still for me your example is not clear. Do you have an idea of interfaces/functions in c++ and python which you consider?

dmzubr commented 4 years ago

Thank you for the advice. I have already seen the Inference-Run-Examples. But I thought of something different. An already w2l trained model can be described by a set of binary and architectural files. In the inference examples there are 6 descriptive files (acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt). From my point it would be nice to use the w2l - written in C++ - as backend but can fully process the transcription with an arbitrary architecture (given by a set of already trained w2l files) in python (without using subprocesses as shown in the examples). Is something like that planed in the future?

Suppose that you mean a high-level abstraction above inference. Something like the following.

from w2l import W2lInference

w2l_inference_obj = W2lInference(am_file, arch_file, decoder_options, <[other inference artifacts]>)
w2l_inference_obj.init()
.......................
# We should have an option to keep model in memory to avoid any delays when need to call service
.......................
# Here - we finally receive file data to transcribe, so - perform transribe operation
transcribes = w2l_inference_obj.transcribe(files_paths: [])
# Or with raw audio data
transcribes = w2l_inference_obj.transcribe(files_bytes: [[],[],])

It's convenient to use the mentioned approach under REST service for example. Ten lines of code with Flask and we got almost out-of-box REST transribe service with w2l backend...

Such an option will be great. The purpose is to hide all boilerplate of initialization process in this theoretical module. Generally - inference module is a good component to provide this functionality also. But right now with inference module there is a limitations of necessity to convert trained models. And there is no option for ASG for example. So - there are some limitations yet.

tlikhomanenko commented 4 years ago

@dmzubr

What is the problem then just to write simple wrapper over https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples "How we can I use this from Python"?

Models will be anyway in binary C++ format, so you could not use pytorch models with the example you showed.

About limitations: we have plans to extend inference for other models too in the future.

cc @avidov @vineelpratap

dmzubr commented 4 years ago

@dmzubr

What is the problem then just to write simple wrapper over https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples "How we can I use this from Python"?

Models will be anyway in binary C++ format, so you could not use pytorch models with the example you showed.

About limitations: we have plans to extend inference for other models too in the future.

cc @avidov @vineelpratap

Generally no problem with your suggestion to develop custom example like any from here: https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples.

I didn't mean that we need to get models compatible with other runtimes (torch, for example). It's obvious that infrastructure of w2l runtime differs so much. Moreover, such an incompatibility is absolutely reasonable due to speed of w2l runtime. :)

I mean another aspect. Usually project's Python bindings aimed to make an entry threshold a little lower (versus C++). And this is reasonable for w2l too. But right now - current bindings supports only researching functionality of framework. Saying "researching" I imply features related to analysis of intermediate artifacts of inference process.

And, IMO, it will be cool to have a Python API that gives an opportunity to use w2l backend with a couple lines of code. Why not C++? Just to lower a entry threshold to use a framework for inference purpose.

Do actually framework needs this or not - it's a decision of contributors, of course...

Anyway, thanks for the great and efficient project to all your team!

tlikhomanenko commented 4 years ago

@dmzubr thanks for clarification!

I agree on the point of lowering an entry threshold to use framework. However, there should be the balance between simplicity for people to use and flexibility. Particularly, for inference pipeline we have discussed python wrapper: there could be a lot of implementation depending on what you want to have at the end, like just wrapper, web-service, or something else. That is why we just made it as https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples "How we can I use this from Python" to have flexibility for people to build on top any particular wrapper depending on their pipeline.

From my practice of using C++ libs from python (some packages, like ffmpeg, and other processing stuff) I think having them as command line process which takes input from stdin and returns output in stdout is the simplest design with huge flexibility to plugin this into any pipeline one can have.

We don't use python much in our research, that is why it is hard to say what design will be more appropriate here. But we always welcome suggestions and pull requests :), especially for python side as people are using this more frequently than us.

So your suggestion on API is

from w2l import W2lInference

w2l_inference_obj = W2lInference(am_file, arch_file, decoder_options, <[other inference artifacts]>)
w2l_inference_obj.init()
.......................
# We should have an option to keep model in memory to avoid any delays when need to call service
.......................
# Here - we finally receive file data to transcribe, so - perform transribe operation
transcribes = w2l_inference_obj.transcribe(files_paths: [])
# Or with raw audio data
transcribes = w2l_inference_obj.transcribe(files_bytes: [[],[],])

right? Could you add more comments on your use-case (web-service, running locally, etc.)? Any additional detail on exact API which you have in mind would be helpful.

dmzubr commented 4 years ago

@dmzubr thanks for clarification!

I agree on the point of lowering an entry threshold to use framework. However, there should be the balance between simplicity for people to use and flexibility. Particularly, for inference pipeline we have discussed python wrapper: there could be a lot of implementation depending on what you want to have at the end, like just wrapper, web-service, or something else. That is why we just made it as https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples "How we can I use this from Python" to have flexibility for people to build on top any particular wrapper depending on their pipeline.

From my practice of using C++ libs from python (some packages, like ffmpeg, and other processing stuff) I think having them as command line process which takes input from stdin and returns output in stdout is the simplest design with huge flexibility to plugin this into any pipeline one can have.

Sure - it's the most common case!

We don't use python much in our research, that is why it is hard to say what design will be more appropriate here. But we always welcome suggestions and pull requests :), especially for python side as people are using this more frequently than us.

So your suggestion on API is

from w2l import W2lInference

w2l_inference_obj = W2lInference(am_file, arch_file, decoder_options, <[other inference artifacts]>)
w2l_inference_obj.init()
.......................
# We should have an option to keep model in memory to avoid any delays when need to call service
.......................
# Here - we finally receive file data to transcribe, so - perform transribe operation
transcribes = w2l_inference_obj.transcribe(files_paths: [])
# Or with raw audio data
transcribes = w2l_inference_obj.transcribe(files_bytes: [[],[],])

right? Could you add more comments on your use-case (web-service, running locally, etc.)? Any additional detail on exact API which you have in mind would be helpful.

Yes, the pseudo code above describes my vision of target state. To make this suggestion more helpful and useful - need to design API in details. So - I suggest to create the separate issue with details on this topic.

For the point of view of using w2l in real production environment - right now I have a c# wrapper on the top of Decode sample. And it works perfectly with custom arch trained on custom Russian dataset with WER about 10. But the specificity of current project does not require ad-hoc transcribe. Anyway - it's a .NET Core service in Docker that receive a "job" from AMQP for transcribe and returns back transcribe result to AMQP. This approach is not so easy to adjust for specific task in other using contexts.

But the most common and easy way to use w2l as an out-of-box service is the REST interface I suppose. In a form-factor of docker container. So - I'll describe this in details in issue with proposed Python bindings improvements. Generally - the approach of nvidia NeMo ASR service example seems to be very convenient.

Btw, Python is not the first language for me too - so my vision can differs from best-practices there.

tlikhomanenko commented 4 years ago

@dmzubr

Thanks for your comments and sharing thoughts and experience! Yep, let's create another issue, so other could provide their input / API suggestions on this too, so we could consider in future or even people will contribute with pull requests.

light42 commented 3 years ago

@tlikhomanenko My LexiconDecoder got error when I run decoder = LexiconDecoder(options, trie, lm, sil_idx, blank_idx, unk_idx, transitions, is_token_lm).

When I checked the error, it said transitions parameter expecting List[float] variable. When I replace it with one-dimensional list of any size, it worked, but I'm still not sure what is the correct value I must provide for transitions variable.

And also, you said that wav2letter doesn't support seq2seq decoder yet for python bindings. Since I trained my model using seq2seq_tds does that mean LexiconDecoder can't use my model for decoding?

tlikhomanenko commented 3 years ago

Hey @light42!

For transitions it is expected some list, it could be empty. Mostly this parameter corresponds to the ASG trained models where you have trained transition matrix between each pair of tokes, also if you have some prior you can use it too.

About seq2seq - yes, we don't support it, because you need to run forward pass for the AM-decoder network. So if you trained s2s tds model you cannot run decoding from the python.

philipag commented 3 years ago

@dmzubr Could you share your c# wrapper? There are probably lots of people including myself who would like to do something similar.

dmzubr commented 3 years ago

@dmzubr Could you share your c# wrapper? There are probably lots of people including myself who would like to do something similar.

Hello! Current implementation is compatible with only old w2l version (that was before migration to flashlight repository) and highly depend on our communication infrastructure (RabbitMQ, EasyNetQ).

I'm working on version that will be compatible with main current version (in flashlight) and may be will public it as Nuget package or in other form.

Need to think on it.

Will pin you with any news on this.