extrabacon / python-shell

Run Python scripts from Node.js with simple (but efficient) inter-process communication through stdio
2.12k stars 224 forks source link

How do you get Windows 11 Shell to work with UTF-8 Hanzi Chinese Characters #301

Closed hockyy closed 10 months ago

hockyy commented 11 months ago

there is this one shit OS called "Windows 11" which wouldn't let my shell communicate with Chinese characters (Hanzi), I'm at this point to lazy to support my app further for this certain OS 😭

>>> import locale
>>> locale.getencoding()
'cp1252'

Setting the encoding option doesnt help too, adding os.environ["PYTHONUTF8"] = "1" doesn't help too. I give up.

Windows 11 Node v18.15.0 Python3 3.11.3

hockyy commented 11 months ago
 let pyshell = new PythonShell(cantoneseScriptAppDataPath);
  ipcMain.handle('tokenizeUsingPyCantonese', async (event, sentence) => {
    pyshell.send(sentence);
    return new Promise((resolve, reject) => {
      pyshell.once('message', function (message) {
        resolve(JSON.parse(message));
      });
    });
  });
hockyy commented 11 months ago
import json
import pycantonese
import re
import os
os.environ["PYTHONUTF8"] = "1"
def separate_jyutping(jyutping_string):
    # Regular expression to match Jyutping syllables
    pattern = re.compile(r'([a-z]+\d)')
    return pattern.findall(jyutping_string)

# Initialize PyCantonese
def generate_json(sentence):
    # Parse the sentence for POS and Jyutping
    # print(sentence)
    parsed_sentence = pycantonese.parse_text(sentence)
    # print(parsed_sentence)

    # Initialize the result list
    result = []
    # Loop through each word and its details
    for word in parsed_sentence[0].tokens():
        separation_dict = [{"main" : word.word, "jyutping": word.jyutping}]
        if(word.jyutping):
            separation = separate_jyutping(word.jyutping)
            if(len(separation) == len(word.word)):
                separation_dict = [{"main" : word.word[i], "jyutping": separation[i]} for i in range(len(separation))]

        word_dict = {
            "origin": word.word,
            "pos": word.pos,
            "jyutping": word.jyutping,
            "separation" : separation_dict
        }
        result.append(word_dict)

    # Convert the result to JSON format
    return json.dumps(result, ensure_ascii=False)

if __name__ == "__main__":
    while True:
        sentence = input()
        jsonRes = generate_json(sentence)
        print(jsonRes)
hockyy commented 11 months ago

My expectations to windows are low but this is holy ..

Almenon commented 10 months ago

你能在这里发布错误吗?

hockyy commented 10 months ago

你能在这里发布错误吗?

happy new year

No error, but when i received the message event, it shows symbol like this

Screenshot_20240101_091335_Chrome

Im at macau right now i will post you the screenshot when i get back to my apartment

hockyy commented 10 months ago

the message was just full of that symbol, seeming the shell somehow uses local encoding method.

Everything works perfect in mac and deb based linux

Almenon commented 10 months ago

Read over https://docs.python.org/3/library/os.html#utf8-mode. In the bottom it says "The Python UTF-8 Mode can only be enabled at the Python startup". You're trying to enable it inside Python, but at that point Python has already started up. Instead you can enable UTF-8 mode before python startup using python-shell.

For example, save these python and typescript files to the same directory and try running index.ts. You can do so directly with ts-node.

# test.py
print('香港人')
// index.ts
import {PythonShell} from 'python-shell';

let options = {
  pythonOptions: ['-X', 'utf8'],
};

PythonShell.run('test.py', options).then(messages=>{
  // results is an array consisting of messages collected during execution
  console.log('results: %j', messages);
});
>ts-node index.ts
results: ["香港人"]

I also suggest reading over an article I wrote recently, https://medium.com/@almenon214/learn-unicode-in-y-minutes-60a8b2cef1d9. Let me know if it helps out :)

hockyy commented 10 months ago

image

@Almenon it works! hahah amazing

Almenon commented 10 months ago

棒棒! Let me know when you add mandarin support to your app. I would be interested in testing it out.

hockyy commented 10 months ago

lol haha @Almenon Im too lazy for it, not gonna do it in a near future, but surely will happen