fxsjy / jieba

结巴中文分词
MIT License
33.19k stars 6.73k forks source link

How to change the decoder #992

Open XilaBro opened 1 year ago

XilaBro commented 1 year ago

I am currently trying to use Jieba in combination with learning with texts. What I am attempting to do is for jieba to create a space between each "word" in the cmd. for example. 我想飞去北京, would break it down to 我,想,飞,去,北京. what i tried to do initially was use python -m jieba -d ' ' input.txt >output.txt but it would just keep doing "Prefix dic has been built successfully". I then tried python -m jieba -a file1 > file2 and i would get the error below

Building prefix dict from the default dictionary ... Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache Loading model cost 1.173 seconds. Prefix dict has been built successfully. Traceback (most recent call last): File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\xilab\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\xilab\lib\site-packages\jieba__main__.py", line 52, in ln = fp.readline() File "C:\Users\xilab\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 11: character maps to

What do you guys think? sorry for poor formatting, this is my first post.

manother commented 1 year ago

邮件已收到~

brynne8 commented 1 year ago

According to your description, it seems your input file encoding is not UTF-8, which causes Jieba to not decode and segment properly. I would recommend:

Convert your input.txt file to UTF-8 encoding. As mentioned in Jieba's readme, "The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8." So UTF-8 is preferred.

Hope this helps you resolve the issue with using Jieba. Let me know if you have any other questions!

XilaBro commented 1 year ago

Hey AlexanderMisel, turns out I've still got problems with it.

Initially, i tried to have the text in a word docx so i could choose the decoder, but I've got the same problem. in .docx, i selected UTF-8 and in the .txt it says it's UTF-8 BOM. Unfortunately, I've still got the same problem.

C:\Users\xilab>python -m jieba -d'' "C:\Users\xilab\Desktop\g.txt" > "C:\Users\xilab\Desktop\eb.txt" Building prefix dict from the default dictionary ... Loading model from cache C:\Users\xilab\AppData\Local\Temp\jieba.cache Loading model cost 0.578 seconds. Prefix dict has been built successfully. Traceback (most recent call last): File "C:\Users\xilab\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\xilab\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\xilab\lib\site-packages\jieba__main__.py", line 52, in ln = fp.readline() File "C:\Users\xilab\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 14: character maps to

any thoughts? Thank you.