aiforsec / CyNER

Cyber Security concepts extracted from unstructured threat intelligence reports using Named Entity Recognition
MIT License
79 stars 28 forks source link

model.get_entities predict output is different from the demo #2

Open JeJe-LIAO opened 2 years ago

JeJe-LIAO commented 2 years ago

Hi, I have some trouble in model.get_entities. The following picture is my code (same as demo), I'm not sure what's going on. :( image Sometimes will predict the different output with the same code (both of them are different from the demo's output). image image

tilusnet commented 2 years ago

I confirm this problem. I even retrained a model, I get the same issue.

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tilusnet commented 2 years ago

@xashru could you give us a clue as to what could have changed since you wrote the code?

Is it something about the underlying Roberta model?

tilusnet commented 2 years ago

@zxcy if you are still looking at this -

I managed to get the code working after understanding it better.

The possible causes for the error:

  1. The codebase has changed since the demo was written
  2. The demo's examples seem to hint that they use the upstream pretrained models, but their predictions show labels from the fine tuned CyNER model

The second point makes me believe that cyner.CyNER(transformer_model='xlm-roberta-large', ...)'s transformer_model is a locally fune tuned CyNER model. It is not a huggingface base model.

My solution therefore was:

  1. Train (finetune) the transformer as per the demo
  2. Reload this model and run the predictions on it

The trick is that the transformer_model can also be a path, in my case .ckpt.
To summarise:

cfg = {'checkpoint_dir': '.ckpt',
        'dataset': 'dataset/mitre',
        'transformers_model': 'xlm-roberta-large',
        'lr': 5e-6,
        'epochs': 20,
        'max_seq_length': 128}
model = cyner.TransformersNER(cfg)
model.train()

# NB. `model` cannot be used for prediction; reload:

model4 = cyner.CyNER(transformer_model='.ckpt', use_heuristic=False, flair_model=None)
entities = model4.get_entities(text)

for e in entities:
    print(e)

Output:

2022-06-07 11:19:30 INFO     *** initialize network ***
[(0, 246)]
(0, 246)
Proofpoint report mentions that the German-language messages were turned off once the UK messages were established, indicating a conscious effort to spread FluBot 446833e3f8b04d4c3c2d2288e456328266524e396adbfeba3769d00727481e80 in Android phones.
Mention: Proofpoint, Class: Organization, Start: 0, End: 10, Confidence: 0.78
Mention: FluBot, Class: Malware, Start: 156, End: 162, Confidence: 0.88
Mention: 446833e3f8b04d4c3c2d2288e456328266524e396adbfeba3769d00727481e80, Class: Indicator, Start: 163, End: 227, Confidence: 0.95
Mention: Android, Class: System, Start: 231, End: 238, Confidence: 0.98
MrAsimZahid commented 2 years ago

@tilusnet while training I faced this issue. Could you help with this?

https://github.com/aiforsec/CyNER/issues/4

Thank you.

vivianamarquez commented 1 year ago

@tilusnet Thanks for that explanation, I was able to get the same output as the demo using your steps. However, I have noticed that the cgf dictionary ignores the transformers_model parameter and always uses xlm-roberta-base instead. Any ideas on why is it doing that? Thank you!

tilusnet commented 1 year ago

@vivianamarquez by the look of it the parameter name is model, not transformers_model.

Aravpc commented 1 year ago

I am getting an error like

/usr/lib/python3.8/encodings/ascii.py in decode(self, input, final) 24 class IncrementalDecoder(codecs.IncrementalDecoder): 25 def decode(self, input, final=False): ---> 26 return codecs.ascii_decode(input, self.errors)[0] 27 28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 508: ordinal not in range(128)

Can you please help me to understand this error

msyuaa commented 1 year ago

Hi! Can anyone help with this warning "FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning"?

msyuaa commented 1 year ago

do pip install markupsafe==2.0.1