center-for-threat-informed-defense / tram

TRAM is an open-source platform designed to advance research into automating the mapping of cyber threat intelligence reports to MITRE ATT&CK®.
https://ctid.mitre-engenuity.org/our-work/tram/
Apache License 2.0
422 stars 90 forks source link

Need Help: Regarding Sci-BERT Fine tuning using "fine_tune_multi_label.ipynb " #214

Closed abhishekdhiman25 closed 4 months ago

abhishekdhiman25 commented 5 months ago

Hi Reader (I have installed TRAM using Developer's setup in Windows 10) I want to fine tune Sci-BERT using my own data with "fine_tune_multi_label.ipynb" notebook. My dataset is larger than original data set "multi_label.json". My data set contains large number of ATT&CK techniques "For example :100 ATT&CK techniques". I have changed the training JSON name from "multi_label.json" with my JSON data set in 3rd cell, my dataset has same structure, keys and values as "multi_label.json". My Question is do I need to change the 50 ATT&CK techniques list defined in 2nd cell code with my 100 ATT&CK techniques. Because if I am changing those, I am getting error while training in 8th cell in which EPOCH's are defined. Error: " ValueError: Target size (torch.Size([10, 100])) must be the same as input size (torch.Size([10, 50]) )"

3rd Cell Code: from sklearn.preprocessing import MultiLabelBinarizer as MLB

CLASSES = [ 'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027', 'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055', 'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004', 'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083', 'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113', 'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219', 'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002', 'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001', 'T1569.002', 'T1570', 'T1573.001', 'T1574.002' ]

mlb = MLB(classes=CLASSES) mlb.fit([[c] for c in CLASSES])

mlb

8th Cell Code: NUM_EPOCHS = 3

from statistics import mean

from tqdm import tqdm from torch.optim import AdamW

optim = AdamW(bert.parameters(), lr=2e-5, eps=1e-8)

for epoch in range(NUM_EPOCHS): epoch_losses = [] for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)): bert.zero_grad() out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y) epoch_losses.append(out.loss.item()) out.loss.backward() optim.step() print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

swfarnsworth commented 5 months ago

Hello and thank you for showing interest in TRAM.

The 'scibert_multi_label_model' that we developed is the 'allenai/scibert_scivocab_uncased' model (found here) with an additional linear layer. That additional layer has 50 outputs--one for each ATT&CK technique that the model was fine-tuned to identify. The code that produced the model is here.

Key to your issue is that there is no non-trivial way to extend the final layer for the additional classes that you want to learn. Your best bet is to adapt the notebook that produced 'scibert_multi_label_model' and retrain from the original 'allenai/scibert_scivocab_uncased' model. To do this, load your additional training instances in a DataFrame that is structured the same way as the data DataFrame in train_multi_label.ipynb, and then create a new DataFrame named data that is a concatenation of the two. (I recommend doing this in one cell, so that you can't inadvertently cause data to refer to the wrong DataFrame.)

Keep in mind also that the output of the model is an one-dimensional array where each index represents a class (an ATT&CK technique, in this case), and the element at that index represents the probability that that training instance belongs to that class. It is up to you as the developer to keep track of which index represents which class. The CLASSES constant in fine_tune_multi_label.ipynb does this. If you create a new model, your new CLASSES constant should be whatever mlb.classes is (see the sklearn docs for MultiLabelBinarizer).

To your question in #213, refer to the docs for saving Transformers models here.

By way of advice, bare in mind that we were not able to get acceptable performance from our multi-label model given how much training data would be needed to learn 50 classes in a multi-label context. I would expect very poor performance on 100 classes, unless you have invested hundreds of hours of annotator time on producing a large training set with high class frequency.

mehaase commented 4 months ago

Closing this due to inactivity. Please re-open if your question has not been answered.