argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.29k stars 85 forks source link

[BUG] UnicodeEncodeError when Running Quickstart on Windows #777

Open cyberjj999 opened 1 month ago

cyberjj999 commented 1 month ago

Describe the bug I followed the instructions as per the latest documentation: https://distilabel.argilla.io/latest/sections/getting_started/installation/ and ran the code at the quickstart section, but faced some encoding errors.

My code and error output are listed below.

To Reproduce I installed distilabel and set up my .env file to use python-dotenv to see my OpenAI key. Then I ran the code in quickstart section.

Code to reproduce

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(  # 
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:  # 
    load_dataset = LoadDataFromHub(  # 
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    text_generation = TextGeneration(  # 
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),  # 
    )

    load_dataset >> text_generation  # 

if __name__ == "__main__":
    distiset = pipeline.run(  # 
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    # distiset.push_to_hub(repo_id="distilabel-example")  # 

Expected behaviour Get a functional, working output.

Actual Behaviour Got this error code:

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\JJ\AppData\Local\Programs\Python\Python311\Lib\logging\__init__.py", line 1113, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\JJ\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4be' in position 31: character maps to <undefined>
Call stack:
  File "C:\Users\JJ\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 995, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\JJ\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "c:\Users\JJ\OneDrive - SMU\Desktop\Temp Workspace\experiment-lab\Distilabel-Experiment\venv\Lib\site-packages\ipykernel\ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "C:\Users\JJ\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
Message: "💾 Loading `_BatchManager` from cache: 'C:\\Users\\JJ\\.cache\\distilabel\\pipelines\\simple-text-generation-pipeline\\0e5461be2a14da48b8c1f6d7b018b4199649b7e7\\batch_manager.json'"
Arguments: None

Desktop (please complete the following information):

Any support is greatly appreciated! Thank you!

gabrielmbmb commented 1 month ago

Hi @cyberjj999, thanks for reporting the issue! We will investigate it.