Code Embedding Results Analysis

arnaudstiegler commented 4 years ago

Studying the results of the code embedding model, searching for patterns within false positives and false negatives, and find potential bias in the data

arnaudstiegler commented 4 years ago

Around 48 false positives / false negatives:

20 false negatives from others
13 false positives from crypto-competitions
15 false positives from crypto-library

Pretty much all of the false negatives are hashing/crypto functions that ended up labeled as non-crypto even though they are (files like sha, or containing bitwise operations). Check out those examples: sha256.c, ffi_fnv.c

For crypto-competitions, the files that were badly labeled are files that call crypto functions mostly. Check out: determine_decrypt_method.cc

For crypto-library, the badly classified are mostly edge-cases header files: very long files with a few keywords, almost empty files, files containing only block of bites. Some header files should be removed (ladder_base_namespace.h, spr.h)

Overall, the false negatives are almost all correct crypto files and the false positives are edge cases

arnaudstiegler commented 4 years ago

Food for thoughts: should we also stratify on the nature of the doc? (.h vs .c file)

arthurherbout / crypto_code_detection

Code Embedding Results Analysis #30