e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
922 stars 123 forks source link

"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position X" #26

Closed ares0027 closed 2 months ago

ares0027 commented 3 months ago

I tried creating new text files, i tried converting to utf, ansi etc with Notepad++ including Windows 1252 and 1254...

Traceback (most recent call last): File "C:\llm-train\augmentoolkit\processing.py", line 417, in <module> asyncio.run(main()) File "C:\Users\baran\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run return loop.run_until_complete(main) File "C:\Users\baran\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete return future.result() File "C:\llm-train\augmentoolkit\processing.py", line 73, in main control_flow_functions.create_pretraining_set( File "C:\llm-train\augmentoolkit\augmentoolkit\control_flow_functions\control_flow_functions.py", line 1430, in create_pretraining_set file_contents = file.read() File "C:\Users\baran\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 102038: character maps to <undefined>

e-p-armstrong commented 3 months ago

Can I see your config and (if you can share it) the input file?

One solution might just be to track down that character and delete it, maybe

juanjopc commented 3 months ago

Hi,

I found a solution for the UnicodeDecodeError problem. The fix is to change line 1430 in the file augmentoolkit/control_flow_functions/control_flow_functions.py to use UTF-8 encoding when opening the file.

The original line is: with open(file_path, "r") as file: file_contents = file.read()

It should be changed to: with open(file_path, "r", encoding="utf-8") as file: file_contents = file.read()

I made this change and it fixed the problem for me. I hope this helps others who are having the same issue.

e-p-armstrong commented 3 months ago

@juanjopc Awesome that you were able to fix the issue! Would you mind making a PR to add this? If not, I can go ahead and do that, it seems simple enough. Thank you!

e-p-armstrong commented 2 months ago

Merged! Thanks for the contribution!