X94521 / Math_mnbvc

MNBVC数学问答数据清洗
7 stars 2 forks source link

UnicodeDecodeError: 'gbk' codec can't decode byte 0x8e in position 85: illegal multibyte sequence #1

Open EurekaTesla opened 5 days ago

EurekaTesla commented 5 days ago

Traceback (most recent call last): File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 384, in p.run(engine_map_qa, 'math_qa.json') File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 334, in run for line in stream: File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\transformations.py", line 262, in flat_map_impl for element in sequence: File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 83, in read_corpus yield from self.read_file(file_path, file_type) File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 63, in read_file return seq.jsonl(file_path) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\streams.py", line 239, in jsonl return self(input_file).map(jsonapi.loads).cache(delete_lineage=True) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\pipeline.py", line 231, in cache self._base_sequence = list(self._evaluate()) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\io.py", line 68, in iter yield from file_content UnicodeDecodeError: 'gbk' codec can't decode byte 0x8e in position 85: illegal multibyte sequence

你好,我在windows系统下运行出现这样的错误,是functional.seq.jsonl()在windows系统下导致的编码问题吗?如何修复?需要修改functional.seq.jsonl()库吗?

X94521 commented 5 days ago

Traceback (most recent call last): File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 384, in p.run(engine_map_qa, 'math_qa.json') File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 334, in run for line in stream: File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\transformations.py", line 262, in flat_map_impl for element in sequence: File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 83, in read_corpus yield from self.read_file(file_path, file_type) File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 63, in read_file return seq.jsonl(file_path) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\streams.py", line 239, in jsonl return self(input_file).map(jsonapi.loads).cache(delete_lineage=True) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\pipeline.py", line 231, in cache self._base_sequence = list(self._evaluate()) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\io.py", line 68, in iter yield from file_content UnicodeDecodeError: 'gbk' codec can't decode byte 0x8e in position 85: illegal multibyte sequence

你好,我在windows系统下运行出现这样的错误,是functional.seq.jsonl()在windows系统下导致的编码问题吗?如何修复?需要修改functional.seq.jsonl()库吗?

看起来像是编码格式的问题,需要手动指定文件编码。

对于 jsonl 格式的文件

return seq.jsonl(file_path)

可以改成:

import json
return [json.loads(line) for line in open(file_path, 'r', encoding='utf-8')] # encoding或者为gbk等, 具体看你的 json 文件编码格式

同样的,对于 json 格式文件

return seq.json(file_path)

可以改成

return json.load(open(file_path, 'r', encoding='utf-8')) # encoding或者为gbk等, 具体看你的 json 文件编码格式