aiforsec / CyNER

Cyber Security concepts extracted from unstructured threat intelligence reports using Named Entity Recognition
MIT License
79 stars 28 forks source link

Exception: EOF while parsing a string at line 1 column 8862550 #4

Open MrAsimZahid opened 2 years ago

MrAsimZahid commented 2 years ago

From the CyNER Demo.ipynb I have tried to train the model but I get this error.

# Training Code
cfg = {'checkpoint_dir': '.ckpt',
        'dataset': 'dataset/mitre',
        'transformers_model': 'xlm-roberta-large',
        'lr': 5e-6,
        'epochs': 20,
        'max_seq_length': 128}
model = cyner.TransformersNER(cfg)
model.train()

Output

2022-06-08 12:52:49 INFO     *** initialize network ***
2022-06-08 12:52:50 INFO     create new checkpoint
2022-06-08 12:52:50 INFO     checkpoint: .ckpt
2022-06-08 12:52:50 INFO      - [arg] dataset: dataset/mitre
2022-06-08 12:52:50 INFO      - [arg] transformers_model: xlm-roberta-base
2022-06-08 12:52:50 INFO      - [arg] random_seed: 1
2022-06-08 12:52:50 INFO      - [arg] lr: 5e-06
2022-06-08 12:52:50 INFO      - [arg] epochs: 20
2022-06-08 12:52:50 INFO      - [arg] warmup_step: 0
2022-06-08 12:52:50 INFO      - [arg] weight_decay: 1e-07
2022-06-08 12:52:50 INFO      - [arg] batch_size: 32
2022-06-08 12:52:50 INFO      - [arg] max_seq_length: 128
2022-06-08 12:52:50 INFO      - [arg] fp16: False
2022-06-08 12:52:50 INFO      - [arg] max_grad_norm: 1
2022-06-08 12:52:50 INFO      - [arg] lower_case: False
2022-06-08 12:52:50 INFO     target dataset: ['dataset/mitre']
2022-06-08 12:52:50 INFO     data_name: dataset/mitre
2022-06-08 12:52:50 INFO     formatting custom dataset from dataset/mitre
2022-06-08 12:52:50 INFO     found following files: {'test': 'test.txt', 'train': 'train.txt', 'valid': 'valid.txt'}
2022-06-08 12:52:50 INFO     note that files should be named as either `valid.txt`, `test.txt`, or `train.txt` 
2022-06-08 12:52:50 INFO     dataset dataset/mitre/test.txt: 747 entries
2022-06-08 12:52:50 INFO     dataset dataset/mitre/train.txt: 2810 entries
2022-06-08 12:52:50 INFO     dataset dataset/mitre/valid.txt: 812 entries
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [2], in <cell line: 8>()
      1 cfg = {'checkpoint_dir': '.ckpt',
      2         'dataset': 'dataset/mitre',
      3         'transformers_model': 'xlm-roberta-large',
      4         'lr': 5e-6,
      5         'epochs': 20,
      6         'max_seq_length': 128}
      7 model = cyner.TransformersNER(cfg)
----> 8 model.train()

File ~/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/CyNER/cyner/transformers_ner.py:52, in TransformersNER.train(self)
     34 cache_dir = config.get('cache_dir', None)
     36 trainer = TrainTransformersNER(checkpoint_dir=checkpoint_dir,
     37                                 dataset=dataset,
     38                                 transformers_model=transformers_model,
   (...)
     49                                 num_worker=num_worker,
     50                                 cache_dir=cache_dir)
---> 52 trainer.train(monitor_validation=True)

File ~/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/CyNER/cyner/tner/model.py:292, in TrainTransformersNER.train(self, monitor_validation, batch_size_validation, max_seq_length_validation)
    290 if self.args.is_trained:
    291     logging.warning('finetuning model, that has been already finetuned')
--> 292 self.__setup_model_data(self.args.dataset, self.args.lower_case)
    293 writer = SummaryWriter(log_dir=self.args.checkpoint_dir)
    295 data_loader = {'train': self.__setup_loader('train', self.args.batch_size, self.args.max_seq_length)}

File ~/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/CyNER/cyner/tner/model.py:155, in TrainTransformersNER.__setup_model_data(self, dataset, lower_case)
    145     config = transformers.AutoConfig.from_pretrained(
    146         self.args.transformers_model,
    147         num_labels=len(self.label_to_id),
    148         id2label=self.id_to_label,
    149         label2id=self.label_to_id,
    150         cache_dir=self.cache_dir)
    152     self.model = transformers.AutoModelForTokenClassification.from_pretrained(
    153         self.args.transformers_model, config=config)
--> 155     self.transforms = Transforms(self.args.transformers_model, cache_dir=self.cache_dir)
    157 # optimizer
    158 no_decay = ["bias", "LayerNorm.weight"]

File ~/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/CyNER/cyner/tner/tokenizer.py:38, in Transforms.__init__(self, transformer_tokenizer, cache_dir)
     36 def __init__(self, transformer_tokenizer: str, cache_dir: str = None):
     37     """ NER specific transform pipeline """
---> 38     self.tokenizer = transformers.AutoTokenizer.from_pretrained(transformer_tokenizer, cache_dir=cache_dir)
     39     self.pad_ids = {"labels": PAD_TOKEN_LABEL_ID, "input_ids": self.tokenizer.pad_token_id, "__default__": 0}
     40     self.prefix = self.__sp_token_prefix()

File ~/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:546, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    544 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    545 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 546     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    547 else:
    548     if tokenizer_class_py is not None:

File ~/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1780, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1777     else:
   1778         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1780 return cls._from_pretrained(
   1781     resolved_vocab_files,
   1782     pretrained_model_name_or_path,
   1783     init_configuration,
   1784     *init_inputs,
   1785     use_auth_token=use_auth_token,
   1786     cache_dir=cache_dir,
   1787     **kwargs,
   1788 )

File ~/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1915, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1913 # Instantiate tokenizer.
   1914 try:
-> 1915     tokenizer = cls(*init_inputs, **init_kwargs)
   1916 except OSError:
   1917     raise OSError(
   1918         "Unable to load vocabulary from file. "
   1919         "Please check that the provided vocabulary is accessible and not corrupted."
   1920     )

File ~/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py:139, in XLMRobertaTokenizerFast.__init__(self, vocab_file, tokenizer_file, bos_token, eos_token, sep_token, cls_token, unk_token, pad_token, mask_token, **kwargs)
    123 def __init__(
    124     self,
    125     vocab_file=None,
   (...)
    135 ):
    136     # Mask token behave like a normal word, i.e. include the space before it
    137     mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
--> 139     super().__init__(
    140         vocab_file,
    141         tokenizer_file=tokenizer_file,
    142         bos_token=bos_token,
    143         eos_token=eos_token,
    144         sep_token=sep_token,
    145         cls_token=cls_token,
    146         unk_token=unk_token,
    147         pad_token=pad_token,
    148         mask_token=mask_token,
    149         **kwargs,
    150     )
    152     self.vocab_file = vocab_file
    153     self.can_save_slow_tokenizer = False if not self.vocab_file else True

File ~/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py:109, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    106     fast_tokenizer = tokenizer_object
    107 elif fast_tokenizer_file is not None and not from_slow:
    108     # We have a serialization from tokenizers which let us directly build the backend
--> 109     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    110 elif slow_tokenizer is not None:
    111     # We need to convert a slow tokenizer to build the backend
    112     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: EOF while parsing a string at line 1 column 8862550

Unable to pinpoint where the problem is occurring. Could you help me with this? Thank you.

tilusnet commented 2 years ago

Try with transformers==4.15.0, that's what I've got.

MrAsimZahid commented 2 years ago

It worked perfectly. Thank you so much!

On Wed, Jun 8, 2022, 2:19 PM Attila Szász @.***> wrote:

Try with transformers==4.15.0, that's what I've got.

— Reply to this email directly, view it on GitHub https://github.com/aiforsec/CyNER/issues/4#issuecomment-1149672430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTL2ZL2HODB2IO4XL2AHOTVOBQTZANCNFSM5YFSDRAA . You are receiving this because you authored the thread.Message ID: @.***>

MrAsimZahid commented 2 years ago

Again, facing the same issue but in some other section of code. Could you please help with the resolution? @tilusnet

Traceback (most recent call last):
  File "/home/yasir/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/v2/CyNER/run.py", line 10, in <module>
    model.train()
  File "/home/yasir/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/v2/CyNER/cyner/transformers_ner.py", line 52, in train
    trainer.train(monitor_validation=True)
  File "/home/yasir/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/v2/CyNER/cyner/tner/model.py", line 292, in train
    self.__setup_model_data(self.args.dataset, self.args.lower_case)
  File "/home/yasir/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/v2/CyNER/cyner/tner/model.py", line 155, in __setup_model_data
    self.transforms = Transforms(self.args.transformers_model, cache_dir=self.cache_dir)
  File "/home/yasir/Documents/Projects/Blog_Data_Extraction/prototype/test_CyNER/v2/CyNER/cyner/tner/tokenizer.py", line 38, in __init__
    self.tokenizer = transformers.AutoTokenizer.from_pretrained(transformer_tokenizer, cache_dir=cache_dir)
  File "/home/yasir/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 550, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/yasir/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1747, in from_pretrained
    return cls._from_pretrained(
  File "/home/yasir/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1882, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/yasir/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 139, in __init__
    super().__init__(
  File "/home/yasir/anaconda3/envs/blogsIntel/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 108, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: EOF while parsing a string at line 1 column 8862550