elan-ev / vosk-cli

Apache License 2.0
2 stars 9 forks source link

Recasepunc #9

Open vJan00 opened 2 years ago

vJan00 commented 2 years ago

This branch adds a method to vosk-cli to load and use punctuation models. The model (Checkpoint file) must be placed under the matching three-character country code minus punctuation, where the language model is also located. Example: /usr/share/vosk/language/***-punctuation In particular, it:

vJan00 commented 2 years ago

Im working on a solution to the things you mentioned :)

vJan00 commented 2 years ago

I reworked mostly all of the things you mentioned @lkiesow Maybe give it a go - if you find something else just tell me :)

vJan00 commented 2 years ago

~/videos/sintel_trailer-1080p.mp4

Could you send me the file? I can't find one in English or German that causes such an error.

Linting also does still complain about some things.

So far all fixed, one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?

It was the easiest way, just as a Python module. I don't think Upstream intends to provide this as PyPi in the future - unless we deal with it.

Trying to use this, this threw me off. Why modify the path users specified using the command line parameter -p?

Since the model_path method looks for a folder and not a file, and the files are all named Checkpoint.

lkiesow commented 2 years ago

You will find the media file at: https://data.lkiesow.io/opencast/test-media/ I used the vosk-model-en-us-0.22 and https://github.com/benob/recasepunc/releases/download/0.3/en.23000

one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

vJan00 commented 2 years ago

You will find the media file at: https://data.lkiesow.io/opencast/test-media/

Seemingly fixed in the meantime.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

Yes its an overall Problem with the Model loader recasepunc uses. \ For explanation: \ Recasepunc uses Torch to load models which uses unpickler which tries to dynamically find classes in a module that is saved in the checkpoint file. All checkpoint files provided by Alpha Cephei are build with the __main__ module saved in them and that's the module (no matter what) where unpickler tries to look for the import. So because of distutils and our setup.py the __main__ module is auto generated and cannot be customised for imports - just look into the Virtual Environment that runs vosk-cli, you will find it there. So the only workaround I found was to change the __main__ module while running transcribe.py and inserting the import in the referenced modules (in this case voskcli) __init__.py. \ That's what those lines are doing (Recasepunc loads the model when calling CasePuncPredictor):

old_main = sys.modules['__main__']
sys.modules['__main__'] = voskcli
predictor = CasePuncPredictor(punc + '/checkpoint')
sys.modules['__main__'] = old_main