MNeMoNiCuZ / joy-caption-batch

A batch captioning tool for joy_caption
MIT License
66 stars 6 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 412: invalid start byte #1

Closed futureflix87 closed 1 month ago

futureflix87 commented 1 month ago

During a training, I detected the model is printing a non 'utf-8' character, a byte 0x92, which is a kind of "'s". I removed the invalid character and train went fine.

Traceback (most recent call last): File "x:\Users\user\OneTrainer\modules\ui\TrainUI.py", line 544, in training_thread_function trainer.train() File "x:\Users\user\OneTrainer\modules\trainer\GenericTrainer.py", line 546, in train for epoch_step, batch in enumerate(step_tqdm): File "x:\Users\user\OneTrainer\venv\lib\site-packages\tqdm\std.py", line 1181, in iter for obj in iterable: File "x:\Users\user\OneTrainer\venv\lib\site-packages\torch\utils\data\dataloader.py", line 631, in next data = self._next_data() File "x:\Users\user\OneTrainer\venv\lib\site-packages\torch\utils\data\dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "x:\Users\user\OneTrainer\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 32, in fetch data.append(next(self.dataset_iter)) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\LoadingPipeline.py", line 120, in next item = self.output_module.get_next_item() File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\OutputPipelineModule.py", line 40, in get_next_item item[output_name] = self._get_previous_item(self.current_variation, input_name, self.current_index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 100, in _get_previous_item item = module.get_item(index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\AspectBatchSorting.py", line 90, in get_item item[name] = self._get_previous_item(self.current_variation, name, index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\VariationSorting.py", line 130, in get_item value = self._get_previous_item(variation, requested_name, in_index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\Tokenize.py", line 36, in get_item text = self._get_previous_item(variation, self.in_name, index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\ShuffleTags.py", line 34, in get_item text = self._get_previous_item(variation, self.text_in_name, index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\SelectRandomText.py", line 25, in get_item texts = self._get_previous_item(variation, self.texts_in_name, index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\SelectInput.py", line 32, in get_item out = self._get_previous_item(variation, in_name, index) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item item = module.get_item(variation, index, item_name) File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\LoadMultipleTexts.py", line 32, in get_item texts = [line.strip() for line in f] File "x:\users\user\onetrainer\venv\src\mgds\src\mgds\pipelineModules\LoadMultipleTexts.py", line 32, in texts = [line.strip() for line in f] File "x:\Users\user\AppData\Local\Programs\Python\Python310\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 412: invalid start byte

MNeMoNiCuZ commented 1 month ago

Thanks for the report! I've added UTF-8 enforcing to the saved text files in https://github.com/MNeMoNiCuZ/joy-caption-batch/commit/11982107b03b699ca678e7d1ff37d155304045c9

So hopefully this is now fixed!