Support Chinese characters in prompt generation stage

Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

Other

11.69k stars 1.08k forks source link

System Info

pandasai == 2.0.43 python == 3.11

🐛 Describe the bug

I was trying to use Field Descriptions feature to improve the understanding of my dataset to LLMs. The way I am doing is write a data description function to create a dictionary info about dataset then pass then to pandasai through Field Descriptions like this:

data = preview_data(df)
# define a connector
connector = PandasConnector({"original_df": df}, name='My Connector', field_descriptions=data)

My part of data looks like this:

{'时间': 'The 时间 column contains string values. The unique values are: 2023-6-14, 2022-4-22, 2022-11-5.'}

As you can see there is some Chinese characters, but in the prompt_generation stage, the Chinese characters was not decoded thus it looks like this:

dfs[0]:
  name: My Connector
  description: null
  type: pd.DataFrame
  rows: 28
  columns: 18
  schema:
    fields:
    - name: "\u65F6\u95F4"
      type: object
      samples:
      - 2022-4-22
      - 2022-11-5
      - 2023-6-14

Which makes LLM much more confused "\u65F6\u95F4".

Is any way we solve this problem? Any suggestion will be grateful!

To address the issue with Chinese characters not being properly decoded during the prompt generation stage in pandasai, it's crucial to understand the encoding of your original data source and ensure it's UTF-8 or compatible. Additionally, knowing the Python version you're using and whether this issue occurs with other non-ASCII characters can help pinpoint the problem. A minimal, reproducible example of the code causing this issue would also be invaluable for finding a solution. This information will enable a more targeted approach to solving the decoding issue with Chinese characters.

#### Details

**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sinaptik-AI / pandas-ai

Support Chinese characters in prompt generation stage #1168

System Info

🐛 Describe the bug