aiforsec / CyNER

Cyber Security concepts extracted from unstructured threat intelligence reports using Named Entity Recognition
MIT License
79 stars 28 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7701: character maps to <undefined> #12

Open Tf-arch opened 6 months ago

Tf-arch commented 6 months ago

got this error:

2024-01-03 21:14:47 INFO initialize network 2024-01-03 21:14:47 INFO create new checkpoint 2024-01-03 21:14:47 INFO removed incomplete checkpoint .ckpt 2024-01-03 21:14:47 INFO checkpoint: .ckpt 2024-01-03 21:14:47 INFO - [arg] dataset: dataset/mitre 2024-01-03 21:14:47 INFO - [arg] transformers_model: xlm-roberta-base 2024-01-03 21:14:47 INFO - [arg] random_seed: 1 2024-01-03 21:14:47 INFO - [arg] lr: 5e-06 2024-01-03 21:14:47 INFO - [arg] epochs: 20 2024-01-03 21:14:47 INFO - [arg] warmup_step: 0 2024-01-03 21:14:47 INFO - [arg] weight_decay: 1e-07 2024-01-03 21:14:47 INFO - [arg] batch_size: 32 2024-01-03 21:14:47 INFO - [arg] max_seq_length: 128 2024-01-03 21:14:47 INFO - [arg] fp16: False 2024-01-03 21:14:47 INFO - [arg] max_grad_norm: 1 2024-01-03 21:14:47 INFO - [arg] lower_case: False 2024-01-03 21:14:47 INFO target dataset: ['dataset/mitre'] 2024-01-03 21:14:47 INFO data_name: dataset/mitre 2024-01-03 21:14:47 INFO formatting custom dataset from dataset/mitre 2024-01-03 21:14:47 INFO found following files: {'test': 'test.txt', 'train': 'train.txt', 'valid': 'valid.txt'} 2024-01-03 21:14:47 INFO note that files should be named as either valid.txt, test.txt, or train.txt Traceback (most recent call last): File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\c3.py", line 11, in model.train() File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\transformers_ner.py", line 52, in train trainer.train(monitor_validation=True) File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\model.py", line 292, in train self.__setup_model_data(self.args.dataset, self.args.lower_case) File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\model.py", line 142, in __setup_model_data self.dataset_split, self.label_to_id, self.language, self.unseen_entity_set = get_dataset_ner( ^^^^^^^^^^^^^^^^ File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\get_dataset.py", line 153, in get_dataset_ner data_split_all, label_to_id, language, ues = get_dataset_ner_single(d, **param) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\get_dataset.py", line 359, in get_dataset_ner_single data_split_all, unseen_entity_set, label_to_id = decode_all_files( ^^^^^^^^^^^^^^^^^ File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\get_dataset.py", line 459, in decode_all_files label_to_id, unseen_entity_set, data_dict = decode_file( ^^^^^^^^^^^^ File "C:\Users\talia\OneDrive\Desktop\New folder (3)\CyNER-main\CyNER-main\cyner\tner\get_dataset.py", line 397, in decode_file for n, line in enumerate(f): File "C:\Users\talia\anaconda3\Lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7701: character maps to