camel-ai / camel

🐫 CAMEL: Finding the Scaling Law of Agents. A multi-agent framework. https://www.camel-ai.org
https://www.camel-ai.org
Apache License 2.0
5.22k stars 631 forks source link

[BUG] Pre-commit license check fails due to encoding issues (GBK vs UTF-8) #238

Open Appointat opened 1 year ago

Appointat commented 1 year ago

Required prerequisites

What version of camel are you using?

0.1.0

System information

>>> import sys, camel >>> print(sys.version, sys.platform) 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] win32 >>> print(camel.version) 0.1.0

Problem description

Here is a sample issue written according to your requirements:

Title: Encoding issues with UTF-8 and GBK across different systems

Problem Description:

I am currently using camel version x.y.z on a Windows system. When attempting to run pre-commit checks using the update_license.py script, I encountered an error. This error appears to be due to an encoding mismatch - while my system is defaulting to GBK, the script seems to be expecting files encoded in UTF-8.

This issue specifically occurs when the script attempts to open and read from a file. The error message received is:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9d in position 145: illegal multibyte sequence

The expected behavior is for the script to successfully read from the file and execute the pre-commit checks. However, due to the encoding mismatch, this is not happening.

Reproducible example code

Reproducible Example Code:

Python Snippets:

Unfortunately, without knowing the exact content of your update_license.py script, I can only provide a generic example of where the issue may arise. The issue most likely occurs when the script attempts to read a file:

with open("file.txt") as f:
    content = f.read()

If file.txt is encoded in UTF-8 but the system defaults to GBK, this will raise a UnicodeDecodeError.

Command Lines:

This issue is encountered when running pre-commit checks using the update_license.py script:

python update_license.py

Extra Dependencies:

No additional dependencies are necessary to reproduce this issue. However, ensure you are using the correct version of Python and that you have all necessary packages installed.

Steps to Reproduce:

  1. Create or prepare a file encoded in UTF-8.
  2. On a Windows machine with Python installed, attempt to run the update_license.py script.
  3. When the script attempts to open and read from the file, observe the UnicodeDecodeError.

Traceback

> git -c user.useConfigOnly=true commit --quiet --allow-empty-message --file -
Format code..............................................................Passed
Sort imports.............................................................Passed
Check PEP8...............................................................Passed
Check License............................................................Failed
- hook id: check-license
- exit code: 1

Traceback (most recent call last):
  File "****\camel\licenses\update_license.py", line 118, in <module>
    update_license_in_directory(
  File ""****\camel\licenses\update_license.py", line 93, in update_license_in_directory
    if update_license_in_file(
       ^^^^^^^^^^^^^^^^^^^^^^^
  File ""****\camel\licenses\update_license.py", line 42, in update_license_in_file
    content = f.read()
              ^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9d in position 145: illegal multibyte sequence


### Expected behavior

Expected Behavior:

The script should successfully read from the file, regardless of the encoding used. It should handle different types of encodings without raising an error, and should carry out the pre-commit checks seamlessly.

### Additional context

## Potential Solution:

I suggest that the script be modified to explicitly use UTF-8 encoding when opening files, irrespective of the system defaults. This can help avoid such issues in the future, especially considering that UTF-8 is widely used across many systems and platforms.

Another option is to provide a way for users to specify the encoding that should be used by the script. This can be in the form of a command-line argument or a configuration file setting.

## Impact:

This issue can disrupt workflows, especially for users working on Windows systems. It can prevent successful execution of pre-commit checks, which can lead to overlooked errors or inconsistencies in the code.

## Additional Context:

This issue seems to stem from the fact that different operating systems default to different encodings. For instance, Windows defaults to GBK, while Linux and MacOS default to UTF-8. Given that UTF-8 is widely used and is a standard on many systems, it may be beneficial to align the script's encoding handling with this standard.
Obs01ete commented 1 year ago

Hm.. I develop on Windows and do not encounter this error. I normally run stuff in Anaconda powershell console.

Appointat commented 1 year ago

Thanks. I tried pre-commit on Linux, and it works too. It could be the error of my Windows env or default settings. When I resolve the issue, I will put the feedback here.

Appointat commented 1 year ago

Hi, I found this solution works for me on Windows 11:

To change the default character encoding in Windows, you need to modify Python's locale settings. Python uses the locale library to handle locale-related tasks such as character encoding, number, and date formats, etc. This is a somewhat advanced operation and may affect all Python programs on your system.

In Python 3.7 and later versions, you can globally set Python to use UTF-8 encoding by default in Windows environments by setting the PYTHONUTF8 environment variable to 1.

Here are the steps to do that:

  1. Press Win+X, and select System.
  2. Click on About, then on the right, select System info.
  3. In the list on the left, choose Advanced system settings.
  4. In the System Properties dialog, select Environment Variables.
  5. In the Environment Variables dialog, click on New below, and in the new row, input PYTHONUTF8 and 1.

Then click OK, close all dialog boxes.

Restart your command prompt or PowerShell window, Python will use UTF-8 as the default character encoding.

Please note that this method will change the default encoding method for all Python programs. If some programs depend on GBK or other encodings, unpredictable problems may occur. You need to ensure you understand the impact of this operation and know how to restore settings if something goes wrong.

kuang-da commented 1 year ago

It seems unrelated to the project itself but more like a common pitfall for contributors with a Chinese-English working environment. Maybe introducing a docker container or VSCode Dev container in the workflow could eliminate such issues from the root.

lightaime commented 1 year ago

It seems unrelated to the project itself but more like a common pitfall for contributors with a Chinese-English working environment. Maybe introducing a docker container or VSCode Dev container in the workflow could eliminate such issues from the root.

Introducing a docker container sounds great. Thanks @kuang-da for the suggestion!

yiyiyi0817 commented 9 months ago

Hi, I found this solution works for me on Windows 11:

To change the default character encoding in Windows, you need to modify Python's locale settings. Python uses the locale library to handle locale-related tasks such as character encoding, number, and date formats, etc. This is a somewhat advanced operation and may affect all Python programs on your system.

In Python 3.7 and later versions, you can globally set Python to use UTF-8 encoding by default in Windows environments by setting the PYTHONUTF8 environment variable to 1.

Here are the steps to do that:

  1. Press Win+X, and select System.
  2. Click on About, then on the right, select System info.
  3. In the list on the left, choose Advanced system settings.
  4. In the System Properties dialog, select Environment Variables.
  5. In the Environment Variables dialog, click on New below, and in the new row, input PYTHONUTF8 and 1.

Then click OK, close all dialog boxes.

Restart your command prompt or PowerShell window, Python will use UTF-8 as the default character encoding.

Please note that this method will change the default encoding method for all Python programs. If some programs depend on GBK or other encodings, unpredictable problems may occur. You need to ensure you understand the impact of this operation and know how to restore settings if something goes wrong.

Thank you very much for your method! I am also a windows11 user and a contributor with Chinese-English working environment. And I met the same error and found that your solution is useful. By the way, to solve this problem temporarily, inputing 'set PYTHONUTF8=1' befor input 'git commit ...' in command prompt or PowerShell window is also a convenient solution.

Appointat commented 9 months ago

@yiyiyi0817 Glad to hear that it is helpful for you.