dictation-toolbox / natlink

Natlink provides the interface between Dragon and python
Other
25 stars 17 forks source link

NatLink does not correctly handle not ASCII characters for natlink.execScript, natlink.playString, dictation_begin_callback #113

Closed mdbridge closed 2 years ago

mdbridge commented 2 years ago

For example, natlink.playString("naïve") produces naïve. A similar issue exists with natlink.execScript.

discovered that dictation_begin_callback is also broken in a similar way -- naïve turning into naïve

" the non ascii issue is tedious and not very easy to solve. ": this shouldn't be too hard; just translate the Unicode into Windows-1252, changing unencodable characters to .something like '?'. until this is fixed Vocola/vortex are essentially unusable, especially for people using foreign languages

quintijn commented 2 years ago

I had indeed quite some trouble tackling non ascii characters. I think the readwritefile.py module in the natlink repository solves reading strings from a file, both for utf-8 and latin-1 or cp-1252 encondings. Passing this on to playString and execScript should be done in the C++ code. For the moment the main focus is to get the installer procedure and repository structure as good as possible.

dougransom commented 2 years ago

Was this introduced in natlight-light? or did it exist prior to to switch to natlink-light?

quintijn commented 2 years ago

I was thinking about investigating this too,

I will try to find out! Still have a python 2 version running, on my "normal" computer...

dougransom commented 2 years ago

well we had a python3 natlink working last fall, then we switched to natlink light. I have no idea what needs to be fixed in the C++ — i don't really understand what happens in natlink.

Does something simple need to happen in these functions?

BOOL CDragonCode::playString( const char * pszKeys, DWORD dwFlags )

or BOOL CDragonCode::execScript( const char * pszScript, PCCHAR * ppWords, const char * pszComment )?

@mdbridge can you help with the C++?

dougransom commented 2 years ago

Also, there are some conditional compiles.

My suspicion is that the problem will manifest only on dragon15 but not on dragon13. Do either of you have a dragon13 to test this?

`

ifdef UNICODE

    /*int size_needed = ::MultiByteToWideChar( CP_UTF8, 0, pszKeys, -1, NULL, 0 );
    CPointerChar pszKeysW = new TCHAR[ size_needed ];
    ::MultiByteToWideChar( CP_UTF8, 0, pszKeys, -1, pszKeysW, size_needed );*/
    CComBSTR bstrKeys( pszKeys );
    rc = m_pIDgnSSvcOutputEvent->PlayString(
        bstrKeys,   // string to send
        dwFlags,        // flags
        0xFFFFFFFF,     // delay (-1 for app specific delay)
        dwClientCode,   // to identify which WM_PLAYBACK is ours
        &dwNumUndo );   // not used (number of backspaces needed to undo)
#else
    rc = m_pIDgnSSvcOutputEvent->PlayString(
        pszKeys,      // string to send
        dwFlags,      // flags
        0xFFFFFFFF,   // delay (-1 for app specific delay)
        dwClientCode, // to identify which WM_PLAYBACK is ours
        &dwNumUndo ); // not used (number of backspaces needed to undo)
#endif

`

dougransom commented 2 years ago

::MultiByteToWideChar( CP_UTF8, 0, pszKeys, -1, pszKeysW, size_needed );/ CComBSTR bstrKeys( pszKeys );

Is the culprit CP_UTF8 in the first argument (code page)?

There are several options here https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte but which would be correct?

dougransom commented 2 years ago

oh it is confusing, somebody commented out code and left it in years ago.

MultiByteToWideChar is not actually called. Unicode is relying on CComBstr default behavior which probably results in ansi code page.

I think this problem has always been in dragon15 versions of natlink.

So maybe something like this in playString? https://github.com/dictation-toolbox/natlink/blob/master/NatlinkSource/DragonCode.cpp

if this is the correct approach what is the choice for code_page?

` int code_page=CP_ACP (or CP_THREAD_ACP, or 1252)?

    int size_needed = ::MultiByteToWideChar( CP_UTF8, 0, pszKeys, -1, NULL, 0 );
    BSTR pszKeysW = new TCHAR[ size_needed ];
    ::MultiByteToWideChar( CP_UTF8, 0, pszKeys, -1, pszKeysW, size_needed );

    CComBSTR bstrKeys();
            bstrKeys.attach(pszKeysW);

    rc = m_pIDgnSSvcOutputEvent->PlayString(
        bstrKeys,   // string to send
        dwFlags,        // flags
        0xFFFFFFFF,     // delay (-1 for app specific delay)
        dwClientCode,   // to identify which WM_PLAYBACK is ours
        &dwNumUndo );   // not used (number of backspaces needed to undo)

`

If this looks like the right approach, execScript, should be changed.

There are a bunch of places where calls to ::MultiByteToWideChar( are commented out. Possible it was just thought CComBstr would take care of it.

Is it possible to write a test in natlinkcore/tests that would reproduce the problem?

LexiconCode commented 2 years ago

Do either of you have a dragon13 to test this?

I do and could test.

dougransom commented 2 years ago

I have created the following branches in my forks to work on this.

https://github.com/dougransom/natlink/tree/natlink_issue_113

https://github.com/dougransom/natlinkcore/tree/natlink_issue_113

dougransom commented 2 years ago

unit test for failing playString added in https://github.com/dougransom/natlinkcore/commit/18c65590c589fb8413e61cde51cf31a85eca149d

dougransom commented 2 years ago

added some simple unittests for execscript as well.

Now in file
https://github.com/dougransom/natlinkcore/blob/natlink_issue_113/tests/manual_test_dragoncode.py

quintijn commented 2 years ago

I can confirm, that in the python2.7 version of Natlink, with Dragon 15, the error is also present.

natlink.playString("na\xc3ve") (which is i with dots) produces

naA~eve (tilde on top of A)

quintijn commented 2 years ago

Great, solves my problem with env variable NATLINK_USERDIR too. Thanks Doug. We should incorporate the unittests of natlink itself in this too. Do we do this in natlinkcore repo or in the natlink repo? And if in the natlink repo, which branch are we now on?

mdbridge commented 2 years ago

Quintijn Hoogenboom @.***> writes:

I can confirm, that in the python2.7 version of Natlink, with Dragon 15, the error is also present.

natlink.playString("na\xc3ve") (which is i with dots) produces

naA~eve (tilde on top of A)

No, that's that's the correct behavior.  \xc3 in the windows-1252

encoding is an A with a ~ on top -- see https://en.wikipedia.org/wiki/Windows-1252.

Naive w/ 2 dots on the i is na\xefve in Windows-1252.

Remember that playString like many of the other Dragon routines

takes Windows-1252 encoded data.

dougransom commented 2 years ago

Note for now code for tests mentioned above (and natinkcore) is in a new repo:

https://github.com/dictation-toolbox/natlinkcore/blob/natlink_issue_113/tests/manual_test_dragoncode.py

dougransom commented 2 years ago

Ok, i have figured this out for displayText and the Script for execScript. I am not going to fix for the arguments of execScript unless someone actually uses it, as it is a fair bit more investigation, though the approach is the same: _natlinkcore.pyd must accept windows-1252 encoded data for stuff being send on to natlink. easy to to the encoding in init.py.

dougransom commented 2 years ago

I don't see teh issue with the callback.

dougransom commented 2 years ago

i think it is fixed in https://github.com/dictation-toolbox/natlink/commit/0aa24b8934176b32d1a287f750c91a9752e39a74.

I have only tested through running code in a python shell — not in vocola or unimaro etc. I'll wait till someone tries this before i push to DT:master.

mdbridge commented 2 years ago

Doug Ransom @.***> writes:

I don't see teh issue with the callback.

If you dictate say "naïve" using a dictation grammar like vortex

does, the wrong characters are sent to the dictation_change_callback. That callback is set by

~/voice/Vocola_development/Vocola_2/build/Vocola/to_MacroSystem/_vortex.py:113: def init(self, handler): self.my_handle = None self.handler = handler self.underlying_DictObj = natlink.DictObj() self.underlying_DictObj.setBeginCallback (handler.dictation_begin_callback) self.underlying_DictObj.setChangeCallback(handler.dictation_change_callback)

~/voice/Vocola_development/Vocola_2/build/Vocola/to_MacroSystem/_vortex.py:387: def dictation_change_callback(self, deletion_start, deletion_end, new_text, selection_start, selection_end):

In particular, if I remember correctly, new_text has the incorrect characters.

dougransom commented 2 years ago

I think i know what is going on. I am fixing stuff without really understanding the big picture I am afraid.

I think: CDictationObject::TextChanged is called when text is dicated. The new text is grabbed. the text probably is in windws-12whatever encoding, but treated as unicode without conversion and sent to Python.

quintijn commented 2 years ago

This is fixed, by Doug, in the upcoming release of Natlink. Also the recognitionMimic call seems to work now.

Several natlink functions are tested (succesfully) in unittestNatlink...