arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Clustering as a data quality check #29

Open arthurherbout opened 4 years ago

arthurherbout commented 4 years ago

In order to get an idea of the data quality, I think it could be interesting to look into unsupervised learning.

Clustering the files with extracted features might tell us whether the data is too simple and not representative.

arthurherbout commented 4 years ago

Here are some results for the co-clustering approach. The idea is to assign some features to some data points. In this experiment I created two clusters:

Results of spectral recursive embedding: First cluster: A and B number of non crypto code in A : 3864 number of crypto code in A : 183

selected features for first cluster: ['proxy_line_count', 'proxy_comment_count', 'proxy_multiline_comment_count', 'proxy_long_count', 'proxy_while_loops_count', 'proxy_for_loops_count', 'proxy_include_count', 'proxy_bit_left_shift_count', 'proxy_bit_right_shift_count', 'proxy_bitwise_and_count', 'proxy_complement_count', 'proxy_xor_count', 'proxy_loops_count', 'proxy_bitwise_count']

Second cluster: A and B number of non crypto code in A : 5591 number of crypto code in A : 1101

selected features for second cluster: ['proxy_int_count', 'proxy_bitwise_xor_count', 'proxy_hexadecimal_count']

As we can see from those results, the model cannot properly separate crypto from non-crypto code. In a sense, the data is not too biased.