geek-ai / Texygen

A text generation benchmarking platform
MIT License
860 stars 202 forks source link

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence #14

Closed c1a1o1 closed 6 years ago

c1a1o1 commented 6 years ago

C:\Users\caocao\Anaconda3\python.exe D:/work/zhaiyao/Texygen-master/Texygen-master/main.py -g seqgan -t real -d data/shi.txt 2018-06-05 11:24:43.458330: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2018-06-05 11:24:45.339293: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties: name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176 pciBusID: 0000:02:00.0 totalMemory: 4.00GiB freeMemory: 3.34GiB 2018-06-05 11:24:45.361035: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0) Traceback (most recent call last): File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 85, in parse_cmd(sys.argv[1:]) File "D:/work/zhaiyao/Texygen-master/Texygen-master/main.py", line 73, in parse_cmd gan_func(opt_arg['-d']) File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 300, in train_real wi_dict, iw_dict = self.init_real_trainng(data_loc) File "D:\work\zhaiyao\Texygen-master\Texygen-master\models\seqgan\Seqgan.py", line 264, in init_real_trainng self.sequence_length, self.vocab_size = text_precess(data_loc) File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 75, in text_precess train_tokens = get_tokenlized(train_text_loc) File "D:\work\zhaiyao\Texygen-master\Texygen-master\utils\text_process.py", line 50, in get_tokenlized for text in raw: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 2: illegal multibyte sequence

c1a1o1 commented 6 years ago

main.py -g textgan -t cfg ------ THIS WORKS WELL!

Yaoming95 commented 6 years ago

The Chinese dataset works fine in Ubuntu. Seems it has resulted from encoding system of Windows as GBK or ASCII instead of UFT-8, run sys.getdefaultencoding(), (first import sys) if the sys doesn't reply "uft-8", try to add this to main.py: ` import sys

reload(sys)

sys.setdefaultencoding('utf-8') `

c1a1o1 commented 6 years ago

try: reload(sys) sys.setdefaultencoding('utf-8') except: pass Thank you very much!

WuJ1n9 commented 4 years ago

try: reload(sys) sys.setdefaultencoding('utf-8') except: pass Thank you very much!

Hi, I find that these codes are only supported in Python2, how could the problem be solved? thx