Clustering as a data quality check

Here are some results for the co-clustering approach. The idea is to assign some features to some data points. In this experiment I created two clusters:

A, B: A is the cluster of data points, B is the corresponding cluster of features
A, B: A is the cluster of data points, B is the corresponding cluster of features.

Results of spectral recursive embedding: First cluster: A and B number of non crypto code in A : 3864 number of crypto code in A : 183

selected features for first cluster: ['proxy_line_count', 'proxy_comment_count', 'proxy_multiline_comment_count', 'proxy_long_count', 'proxy_while_loops_count', 'proxy_for_loops_count', 'proxy_include_count', 'proxy_bit_left_shift_count', 'proxy_bit_right_shift_count', 'proxy_bitwise_and_count', 'proxy_complement_count', 'proxy_xor_count', 'proxy_loops_count', 'proxy_bitwise_count']

Second cluster: A and B number of non crypto code in A : 5591 number of crypto code in A : 1101

selected features for second cluster: ['proxy_int_count', 'proxy_bitwise_xor_count', 'proxy_hexadecimal_count']

As we can see from those results, the model cannot properly separate crypto from non-crypto code. In a sense, the data is not too biased.

arthurherbout / crypto_code_detection

Clustering as a data quality check #29