csebuetnlp / banglabert

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.
230 stars 31 forks source link

UnicodeEncodeError: 'charmap' codec can't encode characters in position 1210-1213: character maps to <undefined> #8

Closed arbitropy closed 9 months ago

arbitropy commented 10 months ago

--I am trying to run the example evaluation given in this repo on my local machine, but this is the error I am getting. what can I do? below is the whole info.

$python ./question_answering/question_answering.py --model_name_or_path "csebuetnlp/banglabert" --dataset_dir "./question_answering/sample_inputs/" --output_dir "./question_answering/outputs/" --per_device_eval_batch_size=24 --overwrite_output_dir --do_predict D:\Work\anaconda3\envs\banglabert\lib\site-packages\torch\cuda__init__.py:104: UserWarning: NVIDIA GeForce RTX 3050 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA GeForce RTX 3050 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) 10/02/2023 23:42:15 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False 10/02/2023 23:42:15 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=True, do_train=False, eval_accumulation_steps=None, eval_steps=None, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, greater_is_better=None, group_by_length=False, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=./question_answering/outputs/runs\Oct02_23-42-14_WIN-849LR3KF8F4, logging_first_step=False, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, output_dir=./question_answering/outputs/, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=24, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=outputs, push_to_hub_organization=None, push_to_hub_token=None, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=./question_answering/outputs/, save_on_each_node=False, save_steps=500, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 10/02/2023 23:42:15 - WARNING - datasets.builder - Using custom data configuration default-431d55e0f1c961a4 10/02/2023 23:42:15 - INFO - datasets.builder - Overwrite dataset info from restored data version. 10/02/2023 23:42:15 - INFO - datasets.info - Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0 10/02/2023 23:42:15 - WARNING - datasets.builder - Reusing dataset qa_dataset_builder (C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0) 10/02/2023 23:42:15 - INFO - datasets.info - Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets\qa_dataset_builder\default-431d55e0f1c961a4\0.0.0 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 57.23it/s] [INFO|configuration_utils.py:561] 2023-10-02 23:42:16,323 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5 [INFO|configuration_utils.py:598] 2023-10-02 23:42:16,326 >> Model config ElectraConfig { "architectures": [ "ElectraForPreTraining" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "embedding_size": 768, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "electra", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "summary_activation": "gelu", "summary_last_dropout": 0.1, "summary_type": "first", "summary_use_proj": true, "transformers_version": "4.11.0.dev0", "type_vocab_size": 2, "vocab_size": 32000 }

[INFO|configuration_utils.py:561] 2023-10-02 23:42:17,541 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5 [INFO|configuration_utils.py:598] 2023-10-02 23:42:17,544 >> Model config ElectraConfig { "architectures": [ "ElectraForPreTraining" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "embedding_size": 768, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "electra", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "summary_activation": "gelu", "summary_last_dropout": 0.1, "summary_type": "first", "summary_use_proj": true, "transformers_version": "4.11.0.dev0", "type_vocab_size": 2, "vocab_size": 32000 }

[INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/vocab.txt from cache at C:\Users\Administrator/.cache\huggingface\transformers\65e95b847336b6bf69b37fdb8682a97e822799adcd9745dcf9bf44cfe4db1b9a.8f92ca2cf7e2eaa550b10c40331ae9bf0f2e40abe3b549f66a3d7f13bfc6de47 [INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer.json from cache at None [INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,760 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/added_tokens.json from cache at None [INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,761 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/special_tokens_map.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\7820dfc553e8dfb8a1e82042b7d0d691c7a7cd1e30ed2974218f696e81c5f3b1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d [INFO|tokenization_utils_base.py:1739] 2023-10-02 23:42:23,761 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer_config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\76fa87a0ec9c34c9b15732bf7e06bced447feff46287b8e7d246a55d301784d7.b4f59cefeba4296760d2cf1037142788b96f2be40230bf6393d2fba714562485 [INFO|configuration_utils.py:561] 2023-10-02 23:42:25,458 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5 [INFO|configuration_utils.py:598] 2023-10-02 23:42:25,460 >> Model config ElectraConfig { "architectures": [ "ElectraForPreTraining" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "embedding_size": 768, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "electra", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "summary_activation": "gelu", "summary_last_dropout": 0.1, "summary_type": "first", "summary_use_proj": true, "transformers_version": "4.11.0.dev0", "type_vocab_size": 2, "vocab_size": 32000 }

[INFO|configuration_utils.py:561] 2023-10-02 23:42:26,158 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at C:\Users\Administrator/.cache\huggingface\transformers\60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5 [INFO|configuration_utils.py:598] 2023-10-02 23:42:26,161 >> Model config ElectraConfig { "architectures": [ "ElectraForPreTraining" ], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "embedding_size": 768, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "electra", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "summary_activation": "gelu", "summary_last_dropout": 0.1, "summary_type": "first", "summary_use_proj": true, "transformers_version": "4.11.0.dev0", "type_vocab_size": 2, "vocab_size": 32000 }

[INFO|file_utils.py:1665] 2023-10-02 23:42:26,817 >> https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to C:\Users\Administrator.cache\huggingface\transformers\tmpe23b9o4n Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 443M/443M [05:26<00:00, 1.36MB/s] [INFO|file_utils.py:1669] 2023-10-02 23:47:54,041 >> storing https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin in cache at C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615 [INFO|file_utils.py:1677] 2023-10-02 23:47:54,042 >> creating metadata file for C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615 [INFO|modeling_utils.py:1279] 2023-10-02 23:47:54,044 >> loading weights file https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin from cache at C:\Users\Administrator/.cache\huggingface\transformers\913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615 [WARNING|modeling_utils.py:1516] 2023-10-02 23:47:54,822 >> Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraForQuestionAnswering: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias']

abhik1505040 commented 10 months ago

Your Python installation seems to be using windows-1252 encoding by default instead of utf-8. Try invoking your Python scripts like python -X utf8 <path_to_script> <arguments> to see whether it fixes the issue.

arbitropy commented 10 months ago

This fixed it, but now I get a new error. 10/03/2023 22:21:08 - INFO - __main__ - *** Predict *** [INFO|trainer.py:521] 2023-10-03 22:21:08,957 >> The following columns in the test set don't have a corresponding argument inElectraForQuestionAnswering.forwardand have been ignored: example_id, offset_mapping. [INFO|trainer.py:2181] 2023-10-03 22:21:09,046 >> ***** Running Prediction ***** [INFO|trainer.py:2183] 2023-10-03 22:21:09,047 >> Num examples = 2614 [INFO|trainer.py:2186] 2023-10-03 22:21:09,047 >> Batch size = 24 100%|██| 109/109 [05:48<00:00, 3.17s/it]10/03/2023 22:28:21 - INFO - utils - Post-processing 2504 example predictions split into 2614 features. 100%|█| 2504/2504 [00:05<00:00, 492.78it/ 10/03/2023 22:28:26 - INFO - utils - Saving predictions to ./question_answering/outputs/predict_predictions.json. 10/03/2023 22:28:26 - INFO - utils - Saving nbest_preds to ./question_answering/outputs/predict_nbest_predictions.json. Traceback (most recent call last): File "./question_answering/question_answering.py", line 617, in <module> main() File "./question_answering/question_answering.py", line 599, in main results = trainer.predict(predict_dataset, predict_examples) File "D:\Work\anaconda3\envs\banglabert\question_answering\utils.py", line 427, in predict metrics = self.compute_metrics(predictions) File "./question_answering/question_answering.py", line 550, in compute_metrics return metric.compute(predictions=p.predictions, references=p.label_ids) 196c9cdb151dcf934848655d77e\evaluate.py", line 67, in evaluate exact_match += metric_max_over_ground_truths(exact_match_score, prediction, ground_truths) File "C:\Users\Administrator\.cache\huggingface\modules\datasets_modules\metrics\squad\513bf9facd7f12b0871a3d74c6999c866ce28196c9cdb151dcf934848655d77e\evaluate.py", line 52, in metric_max_over_ground_truths return max(scores_for_ground_truths) ValueError: max() arg is an empty sequence100%|██| 109/109 [05:56<00:00, 3.27s/it]

Edit: The output JSON files were correctly generated, in the folder. I can't seem to understand what's not being added.

abhik1505040 commented 9 months ago

If you are using the sample dataset provided with this repo, you need to add the flag --allow_null_ans