Open rloibman opened 2 years ago
You need to run the same in console
I see, why can't I run via code in python? How is the command that I have to run in console? Thanks!
I see, why can't I run via code in python?
Because you do not see debug messages
How is the command that I have to run in console?
python3 test.py
Ok, thanks!
I run in Anaconda Prompt and got this:
I tried again after renaming the folder of the model from "vosk-model-ru-0.22" to "ru", and I got this:
I guess there is some problem with the path where the model has to be saved into, maybe my username in Windows? It´s "JoséRubenLoibman".
How can I fix that?
This must be fixed on C side
Meanwhile you can load model by path, using pure ascii path in the filesystem
Thanks for your response! But I didn´t understand how can I load the model by path, could you be more specific please? Thanks again.
I'm running into this issue as well. The workaround of just use a pure ASCII path works, but I'd like to be able to store the model in the AppData folder, which is below the user folder which can contain non-ASCII characters.
Looking at the source, it looks like this needs to be fixed in Kaldi itself, since Vosk is using Kaldi functions to read the files in the model. Or alternatively implement the file reads in Vosk instead.
What are your thoughts on this @nshmyrev? This issue is sort of preventing me from deploying Vosk on Windows machines, because I can't guarantee the user name is pure ASCII and I want to avoid putting the model file in a root directory.
Should I open an issue in Kaldi instead?
Another solution would be to allow loading models from memory instead of the filesystem, but looking at the source, that might be easier said than done.
@PeturDarriPeturs it is Vosk issue, not really Kaldi one.
For a quick fix you can set cache path to pure ascii folder here:
https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L18
you can use something different than Path.home().
Or we can encode path to native charset, not to utf-8 here:
https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L52
I'm using the C# bindings. I could try changing the marshalling type for the string
parameter in this function:
[global::System.Runtime.InteropServices.DllImport("libvosk", EntryPoint="vosk_model_new")]
public static extern global::System.IntPtr new_Model(string jarg1);
But I didn't think to try that because I was sure the change would need to be done in the C API. I can find plenty of threads talking about not being to open non-ASCII paths on Windows with C/C++ without some workaround.
@nshmyrev I've tried a couple of different marshalling types for the path parameter (LPStr
, LPWStr
, LPUTF8Str
, BStr
), but they all result in crashes when given a non-ASCII path.
I'm not convinced there's any kind of modification that can be done to the string before it's passed to the C API that will fix this issue. I think the C API needs to be changed to use another function to read the files.
For example, this function in Kaldi is used by Vosk to load config files:
https://github.com/kaldi-asr/kaldi/blob/ac29a6ff09823d1cbb4814da60360c966f33cd0d/src/util/parse-options.cc#L461
But this thread suggests you can't open non-ASCII paths in Windows with std::ifstream
. There's std::wifstream
for that.
You should be able to use ISO8859-1 paths then, the question where to convert them, it can be one on C# side
But this thread suggests you can't open non-ASCII paths in Windows with std::ifstream. There's std::wifstream for that
This is wrong I think
@nshmyrev I'll try encoding the path to ISO8859-1 before passing it. That should fix the paths I've been having issues with (accented characters). But are you sure it would work for any UTF-8 character? What about Cyrillic characters?
For Cyrillic there is cp1251. In general on can figure out Windows file encoding with some API, it is just few more steps to create a filepath string in a proper encoding.
@nshmyrev Encoding to ISO8859-1 works when the path only contains characters in that set, but doesn't work for any characters outside that set. What if the path contains both Latin (accented) and Cyrillic characters, for example? I don't see how this solution could scale.
Here's another thread talking about this limitation in Windows and how you need to use the wide/16-bit variations of IO functions to support UTF-8 file paths on Windows: https://stackoverflow.com/questions/30829364/open-utf8-encoded-filename-in-c-windows
For Node users:
/**
* Convert a path to a DOS compatible path.
* This pattern is named as 8.3 filename (https://en.wikipedia.org/wiki/8.3_filename)
* MS official docs: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/18e63b13-ba43-4f5f-a5b7-11e871b71f14
* @param {string} filePath The original path with non ASCII characters
* @returns {string} The path in ASCII and 8.3 filename format
*/
function convertToDOSName(filePath) {
// Regex to match all non-ASCII characters
// eslint-disable-next-line no-control-regex
const accentPattern = /[^\u0000-\u007F]/g
const convertedPath = filePath.split('\\').reduce((fullPath, relativePath) => {
if (!accentPattern.test(relativePath)) return fullPath + '\\' + relativePath
const entries = fs.readdirSync(fullPath)
// Read all entries in the directory and sort them by creation date
// Truncate the name to 6 characters and convert it to uppercase withouth accents
const sortedEntries = entries
.map(entry => {
const entryPath = path.resolve(fullPath, entry)
const stats = fs.statSync(entryPath)
return { name: entry, createdAt: stats.birthtimeMs }
})
.sort((a, b) => a.createdAt - b.createdAt)
.map(entry => {
return {
...entry,
dosName: entry.name.replace(accentPattern, '').toUpperCase().slice(0, 6),
}
})
// We just care for the path that we need to access.
// So we get the index of the part that we are looking for.
// The directories truncated name's indexes are given according to creation date (older first).
const indexOfPartName = sortedEntries.findIndex(entry => entry.name === relativePath)
const repeatedNames = sortedEntries.filter(
entry => entry.dosName === sortedEntries[indexOfPartName].dosName
)
const indexInRepeatedName =
repeatedNames.findIndex(entry => entry.name === relativePath) + 1
return (
fullPath + '\\' + sortedEntries[indexOfPartName].dosName + `~${indexInRepeatedName}`
)
})
return convertedPath
}
I ended up working around this issue by storing the model in the C:/Users/Public/Documents
folder, which doesn't require admin privileges to write to.
@nshmyrev Hello. Is there any update on this? I'm still having this issue in the C# project.
Hi, when trying to create a model I'm having the following issue:
I'm using Jupyter Notebook, here is my full code: