center-for-threat-informed-defense / tram

TRAM is an open-source platform designed to advance research into automating the mapping of cyber threat intelligence reports to MITRE ATT&CK®.
https://ctid.mitre-engenuity.org/our-work/tram/
Apache License 2.0
422 stars 90 forks source link

Please Help: Regarding fine tuning #216

Open abhishekdhiman25 opened 4 months ago

abhishekdhiman25 commented 4 months ago

Hi Reader,

I wish you are well. I was trying to understand fine-tuning part from "fine_tune_multi_label.ipynb" notebook. Few Questions: Q 1. - I want to know what is the order of 50 ATT&CK Labels defined under CLASSES Variable. Q 2. - Why is it recommend not to change the code of particular cell. Q 3. - If somebody wants to change the classes to fine tune model on some other ATT&CK labels, what is the correct method to do so and in what order the labels should be placed. Q 4. - If somebody wants to increase number of classes what is the correct approach.

Thanks for your support in advance

For Reference CLASSES: CLASSES = [ 'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027', 'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055', 'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004', 'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083', 'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113', 'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219', 'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002', 'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001', 'T1569.002', 'T1570', 'T1573.001', 'T1574.002' ]

mehaase commented 4 months ago

Hi @abhishekdhiman25,

Q1 - They are in lexical order, but the order is somewhat arbitrary. The order of the classes affects how the labels are vectorized, i.e. turned from strings like "T1003.001" into dense vectors. E.g. the vector [1, 0, 0, 0, 0, ....] means that the associated technique is the first item in CLASSES: T1003.001. Q2 - The notebook says not to modify that cell because we have already fine-tuned SciBERT using that vectorization scheme. This notebook is intended for continuing to fine tune with additional training data for the same set of labels. If you change the order of the labels, then additional fine tuning will be counter-productive, because the model has to relearn what each position in the label vector represents. Q3 - If you want to fine tune SciBERT using different labels, you should look at the model-development/train_multi_label.ipynb notebook. That notebook illustrates how to start with an upstream SciBERT checkpoint and fine-tune it on the training data in data/tram2-data/multi_label.json. Q4 - Same as for Q3. You'll want to set up MITRE Annotation Toolkit for labeling your additional training data. See: https://github.com/center-for-threat-informed-defense/tram/wiki/Data-Annotation

I hope this helps!