buihuukhoi / CREME

CREME: A toolchain of automatic dataset collection for machine learning in intrusion detection
13 stars 12 forks source link

Issues about tactics, techniques, sub techniques labeling #7

Open hungdino opened 3 years ago

hungdino commented 3 years ago

Hi Khoi,

I am undergraduate student, Wei Ting. I am trying to enhance labeling for CREME recently. In CREME_backend_execution/classes/CREME.py, I found that you updated labels of each scenario 12 days ago. (Commit: [fix bugs] update tactic, technique, subtechnique for attack scenario…) I am wondering that

  1. End-point DoS part's tactic labeling is missing (there is no def process_data_end_point_dos section). If it is not yet done, I am very willing to participate in it!
  2. It seems that tactic, technique, sub technique are mapping to each other, for example tactic_names[1] involves technique_names[1] as well as sub_technique_names[1]. If it is true, what should we do if there are multiple techniques used in one single tactics? Also, I think that there could be more than 3 tactics in 1 single scenario, is it because CREME collects data in 3 separate stages, a.k.a. initial access, compromise and propagation, and complete mission, as said in the paper that we can only label 3 tactics in 1 scenario?
  3. Is the comment "#only for syslog" means that these labels are only used in processing syslog? Or just the list "labels = [1, 1, 1]" is used in processing syslog only?

Best Regards, Wei Ting

I attach def process_data_mirai as example here. labels = [1, 1, 1] # only for syslog tactic_names = ['Initial Access', 'Command and Control', 'Impact'] technique_names = ['Valid Accounts', 'Non-Application Layer Protocol', 'Network Denial of Service'] sub_technique_names = ['Local Accounts', 'Non-Application Layer Protocol', 'Direct Network Flood'] image

buihuukhoi commented 3 years ago

Hi Wei Ting,

  1. I added it in the test branch (d1e3af80e041d88e41b8846b07d7b0a1a8a024f8). It will be merged to the main branch after finished testing.
  2. That is a limitation of CREME. Yes, you are correct. The original design only supports collecting and labeling data in three separate stages. If we want to label data with more than three tactics/techniques, we can treat them as different stages (each technique is one stage). However, we need to have some improvements to the codes. For example, we may use a for loop in the label_filtered_syslog method instead of three hard-code stages.
  3. The list "labels" is used only for labeling syslog. Accounting and traffic always label 1 for abnormal data (an issue when designing the functions at the beginning time of CREME). Reference to f42b6df011a49e4914a22cdf48c740b2f96e083d

Thanks.

hungdino commented 2 years ago

Hi, Khoi

  1. I found it, thanks for the information!
  2. I understood the limitation. Do you recommend to do the split (of the 3 stages framework into multiple stages)? Or is it too troublesome comparing to the potential benefit (better labeling)
  3. I will check this out, thanks.

BTW I found that ATT&CK is using the term 'Exploit Public-Facing Application' instead of 'Exploit Public Application' now. I simply adjusted that and added some comments for possible further labeling (not suitable for current 3-stages framework though) in a Pull Request.

Thanks!

hungdino commented 2 years ago

@buihuukhoi Hi Khoi, I squashed the commits as you asked and opened another Pull Request for that. Besides this, Professors (mainly Professor Lin, but also Professor Huang) want to ask you opinion towards refining labeling of attacks. Wether to

  1. Change from 3-stage attack framework to N-stage to allow data collection more flexible. (But it comes with a lot of revision to the codes.) or
  2. Modify current data structure to allow multiple labels in a single stage, but it won't change the fact that single attack stage may involve multiple and not so precise labels. (I think if we go this way, it would become the problem for ML to select feature and train the model). If it if more convenient for you to talk about this in Skype, I am available at Thursday 17:30, Friday 18:00 or during weekend, thanks!
hungdino commented 2 years ago

I listed out some ideas about the mentioned 2 approaches.

Approach 1: Relating Codes: Reproduction Module(Attack) Data Collection Module Data Storage Labeling Module Pros: Thorough and make CREME flexible to be able to launch attacks in multi-stages. Cons: Complicated, covering most part of CREME makes it difficult to adjust

Approach 2: Relating Codes: Data Collection Module Data Storage Feature Extracting Module Labeling Module Pros: Left Reproduction Module(Attack) untouched Cons: Feature Extracting Module would need more adjustment to fit into this solution

buihuukhoi commented 2 years ago

Are you ok with talking by Skype at 17/10 10 AM? If yes, please send me your Skype ID by email. Thanks

hungdino commented 2 years ago

Soory for the late reply, I have a lecture every Sunday morning, maybe Saturday morning or Sunday afternoon? My Skype contact: https://join.skype.com/invite/bYIc4BAXgJ5G