RUCAIBox / TextBox

TextBox 2.0 is a text generation library with pre-trained language models
https://github.com/RUCAIBox/TextBox
MIT License
1.07k stars 117 forks source link

[🐛BUG] UnicodeDecodeError #364

Closed lz99316 closed 11 months ago

lz99316 commented 11 months ago

描述这个 bug UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1105: illegal multibyte sequence

如何复现 C:\Users\dell>python ./TextBox/run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base

日志 06 Oct 20:08 INFO 66 parameters found.

General Hyper Parameters:

gpu_id: 0 use_gpu: True device: cpu seed: 2020 reproducibility: True cmd: ./TextBox/run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base filename: BART-samsum-2023-Oct-06_20-08-24 saved_dir: saved/ state: INFO wandb: online

Training Hyper Parameters:

do_train: True do_valid: True optimizer: adamw adafactor_kwargs: {'lr': 0.001, 'scale_parameter': False, 'relative_step': False, 'warmup_init': False} optimizer_kwargs: {} valid_steps: 1 valid_strategy: epoch stopping_steps: 2 epochs: 50 learning_rate: 3e-05 train_batch_size: 4 grad_clip: 0.1 accumulation_steps: 48 disable_tqdm: False resume_training: True

Evaluation Hyper Parameters:

do_test: True lower_evaluation: True multiref_strategy: max bleu_max_ngrams: 4 bleu_type: nltk smoothing_function: 0 corpus_bleu: False rouge_max_ngrams: 2 rouge_type: files2rouge meteor_type: pycocoevalcap chrf_type: m-popovic distinct_max_ngrams: 4 inter_distinct: True unique_max_ngrams: 4 self_bleu_max_ngrams: 4 tgt_lang: en metrics: ['rouge'] eval_batch_size: 16 corpus_meteor: True

Model Hyper Parameters:

model: BART model_name: bart model_path: facebook/bart-base config_kwargs: {} tokenizer_kwargs: {'use_fast': True} generation_kwargs: {'num_beams': 5, 'no_repeat_ngram_size': 3, 'early_stopping': True} efficient_kwargs: {} efficient_methods: [] efficient_unfreeze_model: False label_smoothing: 0.1

Dataset Hyper Parameters:

dataset: samsum data_path: dataset/samsum tgt_lang: en src_len: 1024 tgt_len: 128 truncate: tail metrics_for_best_model: ['rouge-1', 'rouge-2', 'rouge-l'] prefix_prompt: Summarize:

Unrecognized Hyper Parameters:

find_unused_parameters: False load_type: from_pretrained tokenizer_add_tokens: []

================================================================================ 06 Oct 20:08 INFO Pretrain type: pretrain disabled Traceback (most recent call last): File "C:\Users\dell\TextBox\run_textbox.py", line 12, in run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={'model_path': 'facebook/bart-base'}) File "C:\Users\dell\TextBox\textbox\quick_start\quick_start.py", line 20, in run_textbox experiment = Experiment(model, dataset, config_file_list, config_dict) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\dell\TextBox\textbox\quick_start\experiment.py", line 56, in init self._init_data(self.get_config(), self.accelerator) File "C:\Users\dell\TextBox\textbox\quick_start\experiment.py", line 82, in _init_data train_data, valid_data, test_data = data_preparation(config, tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\dell\TextBox\textbox\data\utils.py", line 23, in data_preparation train_dataset = AbstractDataset(config, 'train') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\dell\TextBox\textbox\data\abstract_dataset.py", line 25, in init self.source_text = load_data(source_filename, max_length=self.quick_test) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\dell\TextBox\textbox\data\misc.py", line 25, in load_data for line in fin: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1105: illegal multibyte sequence

StevenTang1998 commented 11 months ago

这是因为在windows中读取了中文的原因,强烈建议在ubuntu系统中使用textbox,我们并没有针对windows系统进行测试。

这个问题可以修改这里(https://github.com/RUCAIBox/TextBox/blob/2.0.0/textbox/data/misc.py#L22)的代码临时解决https://blog.csdn.net/ProgramNovice/article/details/126712944

lz99316 commented 11 months ago

请问如果使用下载下来的数据集,修改那个文件下的代码来使用本地数据集内容?

StevenTang1998 commented 11 months ago

请参考这里 https://github.com/RUCAIBox/TextBox/blob/2.0.0/asset/dataset.md#new-dataset