alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.84k stars 1.09k forks source link

Can't open model with non-ascii path on Windows #1072

Open rloibman opened 2 years ago

rloibman commented 2 years ago

Hi, when trying to create a model I'm having the following issue:

image

I'm using Jupyter Notebook, here is my full code:

import ffmpeg
import subprocess
import os
import sys
from moviepy.editor import VideoFileClip
from vosk import Model, KaldiRecognizer
import pyaudio

base_dir = r'C:\Users\JoséRubenLoibman\Downloads\Python\Aulas\Video_to_Text'
base_dir = base_dir.replace("\\", "/") + "/"
mp4_file = base_dir + "Galina_Loibman_Tape1.mp4"
mp3_file = base_dir + "Galina_Loibman_Tape1.mp3"

def convert_video_to_audio_moviepy(video_file, output_ext="mp3"):
    """Converts video to audio using MoviePy library
    that uses `ffmpeg` under the hood"""
    filename, ext = os.path.splitext(video_file)
    clip = VideoFileClip(video_file)
    clip.audio.write_audiofile(f"{filename}.{output_ext}")

if __name__ == "__main__":
    vf = sys.argv[1]
    convert_video_to_audio_moviepy(mp4_file)

model = Model(lang = "ru")
nshmyrev commented 2 years ago

You need to run the same in console

rloibman commented 2 years ago

I see, why can't I run via code in python? How is the command that I have to run in console? Thanks!

nshmyrev commented 2 years ago

I see, why can't I run via code in python?

Because you do not see debug messages

How is the command that I have to run in console?

python3 test.py

rloibman commented 2 years ago

Ok, thanks!

I run in Anaconda Prompt and got this:

image

I tried again after renaming the folder of the model from "vosk-model-ru-0.22" to "ru", and I got this:

image

I guess there is some problem with the path where the model has to be saved into, maybe my username in Windows? It´s "JoséRubenLoibman".

How can I fix that?

nshmyrev commented 2 years ago

This must be fixed on C side

nshmyrev commented 2 years ago

Meanwhile you can load model by path, using pure ascii path in the filesystem

rloibman commented 2 years ago

Thanks for your response! But I didn´t understand how can I load the model by path, could you be more specific please? Thanks again.

PeturDarriPeturs commented 1 year ago

I'm running into this issue as well. The workaround of just use a pure ASCII path works, but I'd like to be able to store the model in the AppData folder, which is below the user folder which can contain non-ASCII characters.

PeturDarriPeturs commented 1 year ago

Looking at the source, it looks like this needs to be fixed in Kaldi itself, since Vosk is using Kaldi functions to read the files in the model. Or alternatively implement the file reads in Vosk instead.

What are your thoughts on this @nshmyrev? This issue is sort of preventing me from deploying Vosk on Windows machines, because I can't guarantee the user name is pure ASCII and I want to avoid putting the model file in a root directory.

Should I open an issue in Kaldi instead?

PeturDarriPeturs commented 1 year ago

Another solution would be to allow loading models from memory instead of the filesystem, but looking at the source, that might be easier said than done.

nshmyrev commented 1 year ago

@PeturDarriPeturs it is Vosk issue, not really Kaldi one.

For a quick fix you can set cache path to pure ascii folder here:

https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L18

you can use something different than Path.home().

nshmyrev commented 1 year ago

Or we can encode path to native charset, not to utf-8 here:

https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L52

nshmyrev commented 1 year ago

Something like https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding

PeturDarriPeturs commented 1 year ago

I'm using the C# bindings. I could try changing the marshalling type for the string parameter in this function:

[global::System.Runtime.InteropServices.DllImport("libvosk", EntryPoint="vosk_model_new")]
  public static extern global::System.IntPtr new_Model(string jarg1);

But I didn't think to try that because I was sure the change would need to be done in the C API. I can find plenty of threads talking about not being to open non-ASCII paths on Windows with C/C++ without some workaround.

PeturDarriPeturs commented 1 year ago

@nshmyrev I've tried a couple of different marshalling types for the path parameter (LPStr, LPWStr, LPUTF8Str, BStr), but they all result in crashes when given a non-ASCII path.

I'm not convinced there's any kind of modification that can be done to the string before it's passed to the C API that will fix this issue. I think the C API needs to be changed to use another function to read the files.

For example, this function in Kaldi is used by Vosk to load config files: https://github.com/kaldi-asr/kaldi/blob/ac29a6ff09823d1cbb4814da60360c966f33cd0d/src/util/parse-options.cc#L461 But this thread suggests you can't open non-ASCII paths in Windows with std::ifstream. There's std::wifstream for that.

nshmyrev commented 1 year ago

You should be able to use ISO8859-1 paths then, the question where to convert them, it can be one on C# side

nshmyrev commented 1 year ago

But this thread suggests you can't open non-ASCII paths in Windows with std::ifstream. There's std::wifstream for that

This is wrong I think

PeturDarriPeturs commented 1 year ago

@nshmyrev I'll try encoding the path to ISO8859-1 before passing it. That should fix the paths I've been having issues with (accented characters). But are you sure it would work for any UTF-8 character? What about Cyrillic characters?

nshmyrev commented 1 year ago

For Cyrillic there is cp1251. In general on can figure out Windows file encoding with some API, it is just few more steps to create a filepath string in a proper encoding.

PeturDarriPeturs commented 1 year ago

@nshmyrev Encoding to ISO8859-1 works when the path only contains characters in that set, but doesn't work for any characters outside that set. What if the path contains both Latin (accented) and Cyrillic characters, for example? I don't see how this solution could scale.

Here's another thread talking about this limitation in Windows and how you need to use the wide/16-bit variations of IO functions to support UTF-8 file paths on Windows: https://stackoverflow.com/questions/30829364/open-utf8-encoded-filename-in-c-windows

Sharcoux commented 1 year ago

For Node users:

/**
 * Convert a path to a DOS compatible path.
 * This pattern is named as 8.3 filename (https://en.wikipedia.org/wiki/8.3_filename)
 * MS official docs: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/18e63b13-ba43-4f5f-a5b7-11e871b71f14
 * @param {string} filePath The original path with non ASCII characters
 * @returns {string} The path in ASCII and 8.3 filename format
 */
function convertToDOSName(filePath) {
  // Regex to match all non-ASCII characters
  // eslint-disable-next-line no-control-regex
  const accentPattern = /[^\u0000-\u007F]/g

  const convertedPath = filePath.split('\\').reduce((fullPath, relativePath) => {
    if (!accentPattern.test(relativePath)) return fullPath + '\\' + relativePath
    const entries = fs.readdirSync(fullPath)

    // Read all entries in the directory and sort them by creation date
    // Truncate the name to 6 characters and convert it to uppercase withouth accents
    const sortedEntries = entries
      .map(entry => {
        const entryPath = path.resolve(fullPath, entry)
        const stats = fs.statSync(entryPath)
        return { name: entry, createdAt: stats.birthtimeMs }
      })
      .sort((a, b) => a.createdAt - b.createdAt)
      .map(entry => {
        return {
          ...entry,
          dosName: entry.name.replace(accentPattern, '').toUpperCase().slice(0, 6),
        }
      })

    // We just care for the path that we need to access.
    // So we get the index of the part that we are looking for.
    // The directories truncated name's indexes are given according to creation date (older first).
    const indexOfPartName = sortedEntries.findIndex(entry => entry.name === relativePath)
    const repeatedNames = sortedEntries.filter(
      entry => entry.dosName === sortedEntries[indexOfPartName].dosName
    )
    const indexInRepeatedName =
      repeatedNames.findIndex(entry => entry.name === relativePath) + 1

    return (
      fullPath + '\\' + sortedEntries[indexOfPartName].dosName + `~${indexInRepeatedName}`
    )
  })

  return convertedPath
}
PeturDarriPeturs commented 1 year ago

I ended up working around this issue by storing the model in the C:/Users/Public/Documents folder, which doesn't require admin privileges to write to.

m1adow commented 1 month ago

@nshmyrev Hello. Is there any update on this? I'm still having this issue in the C# project.