georgian-io / LLM-Finetuning-Toolkit

Toolkit for fine-tuning, ablating and unit-testing open-source LLMs.
Apache License 2.0
766 stars 91 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f648' in position 0: character maps to <undefined> #174

Closed jeetendraabvv closed 3 months ago

jeetendraabvv commented 4 months ago

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f648' in position 0: character maps to

During handling of the above exception, another exception occurred:

+--------------------- Traceback (most recent call last) ---------------------+ C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\llmtune\cli\toolki t.py:122 in run
119 config = yaml.safe_load(file)
120 config = Config(**config)
121
> 122 run_one_experiment(config, config_path)
123
124
125 @generate_app.command("config")
+-------------------------------- locals ---------------------------------+
config = Config(
save_dir='./experiment/',
ablation=AblationConfig(
use_ablate=False,
study_name='ablation'
),
data=DataConfig(
file_type='json',
path='alpaca_data_cleaned.json',
prompt='Below is an instruction that describes a
task. Write a response that appropriat'+89,
prompt_stub='{output}',
train_size=500,
test_size=25,
train_test_split_seed=42
),
model=ModelConfig(
hf_model_ckpt='facebook/opt-125m',
device_map='auto',
torch_dtype='bfloat16',
attn_implementation=None,
quantize=True,
bitsandbytes=BitsAndBytesConfig(
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_skip_modules=None,
llm_int8_enable_fp32_cpu_offload=False,
llm_int8_has_fp16_weight=False,
load_in_4bit=True,
bnb_4bit_compute_dtype='bfloat16',
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True
)
),
lora=LoraConfig(
r=32,
task_type='CAUSAL_LM',
lora_alpha=64,
bias='none',
lora_dropout=0.1,
target_modules='all-linear',
fan_in_fan_out=False,
modules_to_save=None,
layers_to_transform=None,
layers_pattern=None
),
training=TrainingConfig(
training_args=TrainingArgs(
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
optim='paged_adamw_32bit',
logging_steps=1,
learning_rate=0.0002,
bf16=True,
tf32=True,
fp16=False,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type='constant',
save_steps=500
),
sft_args=SftArgs(
max_seq_length=1024,
neftune_noise_alpha=None
)
),
inference=InferenceConfig(
max_length=None,
max_new_tokens=256,
min_length=0,
min_new_tokens=None,
early_stopping=False,
max_time=None,
do_sample=True,
num_beams=1,
num_beam_groups=1,
penalty_alpha=None,
use_cache=True,
temperature=0.8,
top_k=50,
top_p=0.9,
typical_p=1.0,
epsilon_cutoff=0.0,
eta_cutoff=0.0,
diversity_penalty=0.0,
repetition_penalty=1.0,
encoder_repetition_penalty=1.0,
length_penalty=1.0,
no_repeat_ngram_size=0,
bad_words_ids=None,
force_words_ids=None,
renormalize_logits=False
),
qa=QaConfig(
llm_tests=[
'jaccard_similarity',
'dot_product',
'rouge_score',
'word_overlap',
'verb_percent',
'adjective_percent',
'noun_percent',
'summary_length'
]
)
)
config_path = './config.yml'
configs = [
{
'save_dir': './experiment/',
'ablation': {'use_ablate': False},
'data': {
'file_type': 'json',
'path': 'alpaca_data_cleaned.json',
'prompt': 'Below is an instruction that
describes a task. Write a response that appropriat'+89,
'prompt_stub': '{output}',
'test_size': 25,
'train_size': 500,
'train_test_split_seed': 42
},
'model': {
'hf_model_ckpt': 'facebook/opt-125m',
'torch_dtype': 'bfloat16',
'quantize': True,
'bitsandbytes': {
'load_in_4bit': True,
'bnb_4bit_compute_dtype': 'bfloat16',
'bnb_4bit_quant_type': 'nf4'
}
},
'lora': {
'task_type': 'CAUSAL_LM',
'r': 32,
'lora_alpha': 64,
'lora_dropout': 0.1,
'target_modules': 'all-linear'
},
'training': {
'training_args': {
'num_train_epochs': 1,
'per_device_train_batch_size': 4,
'gradient_accumulation_steps': 4,
'gradient_checkpointing': True,
'optim': 'paged_adamw_32bit',
'logging_steps': 1,
'learning_rate': 0.0002,
'bf16': True,
'tf32': True,
'max_grad_norm': 0.3,
... +2
},
'sft_args': {'max_seq_length': 1024}
},
'inference': {
'max_new_tokens': 256,
'use_cache': True,
'do_sample': True,
'top_p': 0.9,
'temperature': 0.8
},
'qa': {
'llm_tests': [
'jaccard_similarity',
'dot_product',
'rouge_score',
'word_overlap',
'verb_percent',
'adjective_percent',
'noun_percent',
'summary_length'
]
}
}
]
dir_helper = <llmtune.utils.save_utils.DirectoryHelper object at
0x000002B9A12FC550>
file = <_io.TextIOWrapper
name='experiment\eH6u\config\config.yml' mode='r'
encoding='cp1252'>
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\llmtune\cli\toolki
t.py:46 in run_one_experiment
43 # Loading Data -------------------------------
44 RichUI.before_dataset_creation()
45
> 46 with RichUI.during_dataset_creation("Injecting Values into Prompt
47 dataset_generator = DatasetGenerator(**config.data.model_dump
48
49 _ = dataset_generator.train_columns
+-------------------------------- locals ---------------------------------+
config = Config(
save_dir='./experiment/',
ablation=AblationConfig(
use_ablate=False,
study_name='ablation'
),
data=DataConfig(
file_type='json',
path='alpaca_data_cleaned.json',
prompt='Below is an instruction that describes a
task. Write a response that appropriat'+89,
prompt_stub='{output}',
train_size=500,
test_size=25,
train_test_split_seed=42
),
model=ModelConfig(
hf_model_ckpt='facebook/opt-125m',
device_map='auto',
torch_dtype='bfloat16',
attn_implementation=None,
quantize=True,
bitsandbytes=BitsAndBytesConfig(
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_skip_modules=None,
llm_int8_enable_fp32_cpu_offload=False,
llm_int8_has_fp16_weight=False,
load_in_4bit=True,
bnb_4bit_compute_dtype='bfloat16',
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True
)
),
lora=LoraConfig(
r=32,
task_type='CAUSAL_LM',
lora_alpha=64,
bias='none',
lora_dropout=0.1,
target_modules='all-linear',
fan_in_fan_out=False,
modules_to_save=None,
layers_to_transform=None,
layers_pattern=None
),
training=TrainingConfig(
training_args=TrainingArgs(
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
optim='paged_adamw_32bit',
logging_steps=1,
learning_rate=0.0002,
bf16=True,
tf32=True,
fp16=False,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type='constant',
save_steps=500
),
sft_args=SftArgs(
max_seq_length=1024,
neftune_noise_alpha=None
)
),
inference=InferenceConfig(
max_length=None,
max_new_tokens=256,
min_length=0,
min_new_tokens=None,
early_stopping=False,
max_time=None,
do_sample=True,
num_beams=1,
num_beam_groups=1,
penalty_alpha=None,
use_cache=True,
temperature=0.8,
top_k=50,
top_p=0.9,
typical_p=1.0,
epsilon_cutoff=0.0,
eta_cutoff=0.0,
diversity_penalty=0.0,
repetition_penalty=1.0,
encoder_repetition_penalty=1.0,
length_penalty=1.0,
no_repeat_ngram_size=0,
bad_words_ids=None,
force_words_ids=None,
renormalize_logits=False
),
qa=QaConfig(
llm_tests=[
'jaccard_similarity',
'dot_product',
'rouge_score',
'word_overlap',
'verb_percent',
'adjective_percent',
'noun_percent',
'summary_length'
]
)
)
config_path = './config.yml'
dir_helper = <llmtune.utils.save_utils.DirectoryHelper object at
0x000002B9A12FD210>
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\llmtune\ui\rich_ui
.py:28 in exit
25 return self # This allows you to use variables from this con
26
27 def exit(self, exc_type, exc_val, exc_tb):
> 28 self.task.exit(exc_type, exc_val, exc_tb) # Cleanly exit
29
30
31 class LiveContext:
+-------------------------------- locals ---------------------------------+
exc_tb = <traceback object at 0x000002B9A108ED00>
exc_type = <class 'UnicodeEncodeError'>
exc_val = UnicodeEncodeError('charmap', '\U0001f648 ', 0, 1, 'character maps
to ')
self = <llmtune.ui.rich_ui.StatusContext object at
0x000002B9A12EFD90>
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\status.py:106
in exit
103 exc_val: Optional[BaseException],
104 exc_tb: Optional[TracebackType],
105 ) -> None:
> 106 self.stop()
107
108
109 if name == "main": # pragma: no cover
+-------------------------------- locals ---------------------------------+
exc_tb = <traceback object at 0x000002B9A108ED00>
exc_type = <class 'UnicodeEncodeError'>
exc_val = UnicodeEncodeError('charmap', '\U0001f648 ', 0, 1, 'character maps
to ')
self = <rich.status.Status object at 0x000002B9FFA44A50>
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\status.py:91
in stop
88
89 def stop(self) -> None:
90 """Stop the spinner animation."""
> 91 self._live.stop()
92
93 def rich(self) -> RenderableType:
94 return self.renderable
+------------------------- locals -------------------------+
self = <rich.status.Status object at 0x000002B9FFA44A50>
+----------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\live.py:147
in stop
144 self._refresh_thread = None
145 # allow it to fully render on the last even if overflow
146 self.vertical_overflow = "visible"
> 147 with self.console:
148 try:
149 if not self._alt_screen and not self.console.is_j
150 self.refresh()
+----------------------- locals -----------------------+
self = <rich.live.Live object at 0x000002B9A12FC3D0>
+------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\console.py:86
5 in exit
862
863 def exit(self, exc_type: Any, exc_value: Any, traceback: Any
864 """Exit buffer context."""
> 865 self._exit_buffer()
866
867 def begin_capture(self) -> None:
868 """Begin capturing console output. Call :meth:end_capture
+---------------------- locals ----------------------+
exc_type = None
exc_value = None
self =
traceback = None
+----------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\console.py:82
3 in _exit_buffer
820 def _exit_buffer(self) -> None:
821 """Leave buffer context, and render content if required."""
822 self._buffer_index -= 1
> 823 self._check_buffer()
824
825 def set_live(self, live: "Live") -> None:
826 """Set Live instance. Used by Live context manager.
+------------------- locals --------------------+
self =
+-----------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich\console.py:20
27 in _check_buffer
2024 if self.no_color and self._color_system:
2025 buffer = list(Segment.remove_color(b
2026
> 2027 legacy_windows_render(buffer, LegacyWind
2028 else:
2029 # Either a non-std stream on legacy Wind
2030 text = self._render_buffer(self._buffer[
+-------------------------------- locals ---------------------------------+
buffer = [
Segment(
'Generating train split: ',
Style()
),
Segment(
'0',
Style(
color=Color(
'cyan',
ColorType.STANDARD,
number=6
),
bold=True,
italic=False
)
),
Segment(' examples ', Style()),
Segment('[', Style(bold=True)),
Segment(
'00:00',
Style(
color=Color(
'bright_green',
ColorType.STANDARD,
number=10
),
bold=True
)
),
Segment(', ? examples/s', Style()),
Segment(']', Style(bold=True)),
Segment('\n'),
Segment(
'\U0001f648 ',
Style(
color=Color(
'green',
ColorType.STANDARD,
number=2
)
)
),
Segment(
' Injecting Values into Prompt',
Style()
),
... +6
]
fileno = 1
legacy_windows_render = <function legacy_windows_render at
0x000002B9A12F2B60>
LegacyWindowsTerm = <class
'rich._win32_console.LegacyWindowsTerm'>
self =
use_legacy_windows_render = True
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich_windows_rend
erer.py:17 in legacy_windows_render
14 for text, style, control in buffer:
15 if not control:
16 if style:
> 17 term.write_styled(text, style)
18 else:
19 term.write_text(text)
20 else:
+-------------------------------- locals ---------------------------------+
buffer = [
Segment('Generating train split: ', Style()),
Segment(
'0',
Style(
color=Color(
'cyan',
ColorType.STANDARD,
number=6
),
bold=True,
italic=False
)
),
Segment(' examples ', Style()),
Segment('[', Style(bold=True)),
Segment(
'00:00',
Style(
color=Color(
'bright_green',
ColorType.STANDARD,
number=10
),
bold=True
)
),
Segment(', ? examples/s', Style()),
Segment(']', Style(bold=True)),
Segment('\n'),
Segment(
'\U0001f648 ',
Style(
color=Color(
'green',
ColorType.STANDARD,
number=2
)
)
),
Segment(' Injecting Values into Prompt', Style()),
... +6
]
control = None
style = Style(color=Color('green', ColorType.STANDARD, number=2))
term = <rich._win32_console.LegacyWindowsTerm object at
0x000002B9A108EC90>
text = '\U0001f648 '
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich_win32_consol
e.py:442 in write_styled
439 SetConsoleTextAttribute(
440 self._handle, attributes=ctypes.c_ushort(fore (back <<
441 )
> 442 self.write_text(text)
443 SetConsoleTextAttribute(self._handle, attributes=self._defaul
444
445 def move_cursor_to(self, new_position: WindowsCoordinates) -> Non
+-------------------------------- locals ---------------------------------+
back = 0
bgcolor = None
color = Color('green', ColorType.STANDARD, number=2)
fore = 2
self = <rich._win32_console.LegacyWindowsTerm object at
0x000002B9A108EC90>
style = Style(color=Color('green', ColorType.STANDARD, number=2))
text = '\U0001f648 '
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\rich_win32_consol
e.py:403 in write_text
400 Args:
401 text (str): The text to write to the console
402 """
> 403 self.write(text)
404 self.flush()
405
406 def write_styled(self, text: str, style: Style) -> None:
+-------------------------------- locals ---------------------------------+
self = <rich._win32_console.LegacyWindowsTerm object at
0x000002B9A108EC90>
text = '\U0001f648 '
+-------------------------------------------------------------------------+
C:\ProgramData\anaconda3\envs\llm_tkit\Lib\encodings\cp1252.py:19 in encode
16
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
> 19 return codecs.charmap_encode(input,self.errors,encoding_table
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
+-------------------------------- locals ---------------------------------+
final = False
input = '\U0001f648 '
self = <encodings.cp1252.IncrementalEncoder object at
0x000002B9A5964D50>
+-------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+ UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f648' in position 0: character maps to Exception ignored in: <function tqdm.del at 0x000002B9DEC44A40> Traceback (most recent call last): File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\tqdm\std.py", line 1148, in del self.close() File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\tqdm\std.py", line 1277, in close if self.last_print_t < self.start_t + self.delay: ^^^^^^^^^^^^^^^^^ AttributeError: 'tqdm' object has no attribute 'last_print_t' !set PYTHONIOENCODINGS='utf-8' Click to add a cell.

jeetendraabvv commented 4 months ago

Getting the above problem during execution of the default config.yml. I am beginner pl help me resole this issues

benjaminye commented 4 months ago

Hi @jeetendraabvv. I'm seeing Click to add a cell. in your output. Are you running this inside a notebook? tqdm might not be working as intended inside an interactive session such as jupyter.

If that's the case I would encourage you to run the CLI directly via command line. And hopefully that solves the problem. If not, let me know!

jeetendraabvv commented 4 months ago

Thank you @benjaminye . After your suggestion it worked but its generated new error. It is

RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following

command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA

libraries. You might need to add them to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues I tried the following command to resove the issues but not get success PS C:\Users\Administrator> pip install bitsandbytes Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ Requirement already satisfied: bitsandbytes in c:\programdata\anaconda3\envs\llm_tkit\lib\site-packages (0.42.0) Requirement already satisfied: scipy in c:\programdata\anaconda3\envs\llm_tkit\lib\site-packages (from bitsandbytes) (1.13.0) Requirement already satisfied: numpy<2.3,>=1.22.4 in c:\programdata\anaconda3\envs\llm_tkit\lib\site-packages (from scipy->bitsandbytes) (1.26.4) PS C:\Users\Administrator> PS C:\Users\Administrator> python -m bitsandbytes False

===================================BUG REPORT=================================== C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\cuda_setup\main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg) The following directories listed in your path were found to be non-existent: {WindowsPath('C')} C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\cuda_setup\main.py:167: UserWarning: C:\ProgramData\anaconda3\envs\llm_tkit did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) The following directories listed in your path were found to be non-existent: {WindowsPath('http'), WindowsPath('/localhost'), WindowsPath('8888')} CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')} DEBUG: Possible options found for libcudart.so: set() CUDA SETUP: PyTorch settings found: CUDA_VERSION=121, Highest Compute Capability: 8.6. CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md CUDA SETUP: Loading binary C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.so... argument of type 'WindowsPath' is not iterable CUDA SETUP: Problem: The main issue seems to be that the main CUDA runtime library was not detected. CUDA SETUP: Solution 1: To solve the issue the libcudart.so location needs to be added to the LD_LIBRARY_PATH variable CUDA SETUP: Solution 1a): Find the cuda runtime library via: find / -name libcudart.so 2>/dev/null CUDA SETUP: Solution 1b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_1a CUDA SETUP: Solution 1c): For a permanent solution add the export from 1b into your .bashrc file, located at ~/.bashrc CUDA SETUP: Solution 2: If no library was found in step 1a) you need to install CUDA. CUDA SETUP: Solution 2a): Download CUDA install script: wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/cuda_install.sh CUDA SETUP: Solution 2b): Install desired CUDA version to desired location. The syntax is bash cuda_install.sh CUDA_VERSION PATH_TO_INSTALL_INTO. CUDA SETUP: Solution 2b): For example, "bash cuda_install.sh 113 ~/local/" will download CUDA 11.3 and install into the folder ~/local Traceback (most recent call last): File "", line 189, in run_module_as_main File "", line 148, in get_module_details File "", line 112, in get_module_details File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes_init.py", line 6, in from . import cuda_setup, utils, research File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\research_init.py", line 1, in from . import nn File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\research\nn_init.py", line 1, in from .modules import LinearFP8Mixed, LinearFP8Global File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\research\nn\modules.py", line 8, in from bitsandbytes.optim import GlobalOptimManager File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\optiminit.py", line 6, in from bitsandbytes.cextension import COMPILED_WITH_CUDA File "C:\ProgramData\anaconda3\envs\llm_tkit\Lib\site-packages\bitsandbytes\cextension.py", line 20, in raise RuntimeError(''' RuntimeError: CUDA Setup failed despite GPU being available. Please run the following command to get more information:

benjaminye commented 4 months ago

Can you send us the output after running transformers-cli env?

Do you know if your machine is set up to run cuda (good way to test is by running nvidia-smi)? If not, please install by following the instructions here: https://developer.nvidia.com/cuda-12-1-0-download-archive.

In any case, I recommend using linux to run this, on Windows you can use WSL.

jeetendraabvv commented 4 months ago

PS C:\Users\Administrator\Pictures\lll_tk> transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

jeetendraabvv commented 4 months ago

can we use the bitsandbytes >=43.0 with llm_toolkit

benjaminye commented 3 months ago

I think this might have to do with cuda not being found in your system. Can you run nvidia-smi and take a screenshot?

benjaminye commented 3 months ago

Closed due to inactivity. Please feel free to reopen this issue if assistance is still required.