13o-bbr-bbq / machine_learning_security

Source code about machine learning and security.
1.96k stars 648 forks source link

feature vector of states #12

Closed MasanoriYamada closed 6 years ago

MasanoriYamada commented 6 years ago

Thank you for publishing your code.

I have some questions to DeepExploit.py.

  1. I think that it is necessary to handle the state function (port, protocol, etc.) as one hot vector, is it correct? (I think that distance in feature space can not be handled correctly)

  2. Although it seems that reinforcement learning is applied while state does not change after performing action, is this good? (In this problem, state transition by Markov decision process is not done)

13o-bbr-bbq commented 6 years ago

@MasanoriYamada
thanks for your advises.
now, i think want to improve the accuracy of exploitation.

  1. current version of DeepExploit is only using the normalization.
    but, i think that accuracy is not as good as i thought. so, i'll try to one-hot encoding instead of the normalization.

  2. i think want to improve definition of after performing action.
    if DeepExploit succeeds exploitation, status of metasploit's console is changed (one example: msf -> meterpreter). so, i think want to use console's status to definition of after performing action.

i'd be pleased if you could give me other advice.

MasanoriYamada commented 6 years ago

@13o-bbr-bbq Thanks for quick reply.

  1. current version of DeepExploit is only using the normalization. but, i think that accuracy is not as good as i thought. so, i'll try to one-hot encoding instead of the normalization.

OK. Since the pace of ports and versions are higher dimensions, you may have to use domain knowledge. i.e. some ports are handled specially.

2.i think want to improve definition of after performing action. if DeepExploit succeeds exploitation, status of metasploit's console is changed (one example: msf -> meterpreter). so, i think want to use console's status to definition of after performing action.

OK! I understand. I am not an expert on penetration testing, so please teach me. How many post-exploitation do you do in your experience?

(Because the action space is large, I am concerned about the depth of exploration)

13o-bbr-bbq commented 6 years ago

@MasanoriYamada thanks for reply!

Since the pace of ports and versions are higher dimensions, you may have to use domain knowledge.

great thanks for your advice.
i have a question. is domain knowledge a technique used in transfer learning?

How many post-exploitation do you do in your experience? (Because the action space is large, I am concerned about the depth of exploration)

actually there are various patterns post-exploitation (penetrate internal servers via compromised server, extract credential information on compromised server, etc.). but, the purpose of the current DeepExploit is to penetrate the internal server.
therefore, if DeepExploit succeed exploitation of first server, it tries to penetrate the internal server via the first server. and, if DeepExploit succeed exploitation of the internal server (=second server) via first server, it tries to penetrate other internal servers via second server as well. repeat this for the number of servers.

my assumption is as follows.

start --[exploitation (using RL)]--> first server --[exploitation (using RL)]--> second server --[exploitation (using RL)]--> third server --> ...

but, it is not realistic to repeat penetration infinitely, so in practice i define the maximum number of target servers.

MasanoriYamada commented 6 years ago

i have a question. is domain knowledge a technique used in transfer learning?

No, my mean that domain knowledge is network and security knowledge.

start --[exploitation (using RL)]--> first server --[exploitation (using RL)]--> second server --[exploitation (using RL)]--> third server --> ...

OK! I understand. If you are penetrating under the assumption that an attack should not be detected. Post exploit is nice, but it may be good to give a negative reward if an attack is detected. Because in this situation the delayed reward for reinforcement learning is more efficient.

13o-bbr-bbq commented 6 years ago

No, my mean that domain knowledge is network and security knowledge.

oh, i misunderstood.
i'll improve state of ports and versions using my security knowledge.

but it may be good to give a negative reward if an attack is detected.

you're completely right. in penetration test, it is bad that an attack is detected.
so, i'll consider method of negative reward that attack is detected (anti virus etc).

thanks for your advice!