Open EurekaTesla opened 5 days ago
Traceback (most recent call last): File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 384, in p.run(engine_map_qa, 'math_qa.json') File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 334, in run for line in stream: File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\transformations.py", line 262, in flat_map_impl for element in sequence: File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 83, in read_corpus yield from self.read_file(file_path, file_type) File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 63, in read_file return seq.jsonl(file_path) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\streams.py", line 239, in jsonl return self(input_file).map(jsonapi.loads).cache(delete_lineage=True) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\pipeline.py", line 231, in cache self._base_sequence = list(self._evaluate()) File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\io.py", line 68, in iter yield from file_content UnicodeDecodeError: 'gbk' codec can't decode byte 0x8e in position 85: illegal multibyte sequence
你好,我在windows系统下运行出现这样的错误,是functional.seq.jsonl()在windows系统下导致的编码问题吗?如何修复?需要修改functional.seq.jsonl()库吗?
看起来像是编码格式的问题,需要手动指定文件编码。
对于 jsonl 格式的文件
return seq.jsonl(file_path)
可以改成:
import json
return [json.loads(line) for line in open(file_path, 'r', encoding='utf-8')] # encoding或者为gbk等, 具体看你的 json 文件编码格式
同样的,对于 json 格式文件
return seq.json(file_path)
可以改成
return json.load(open(file_path, 'r', encoding='utf-8')) # encoding或者为gbk等, 具体看你的 json 文件编码格式
Traceback (most recent call last): File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 384, in
p.run(engine_map_qa, 'math_qa.json')
File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 334, in run
for line in stream:
File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\transformations.py", line 262, in flat_map_impl
for element in sequence:
File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 83, in read_corpus
yield from self.read_file(file_path, file_type)
File "D:\projects\4-LLM\datasets\Math_mnbvc\format_data.py", line 63, in read_file
return seq.jsonl(file_path)
File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\streams.py", line 239, in jsonl
return self(input_file).map(jsonapi.loads).cache(delete_lineage=True)
File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\pipeline.py", line 231, in cache
self._base_sequence = list(self._evaluate())
File "C:\Users\15139\AppData\Roaming\Python\Python310\site-packages\functional\io.py", line 68, in iter
yield from file_content
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8e in position 85: illegal multibyte sequence
你好,我在windows系统下运行出现这样的错误,是functional.seq.jsonl()在windows系统下导致的编码问题吗?如何修复?需要修改functional.seq.jsonl()库吗?