Closed arnaudstiegler closed 4 years ago
Around 48 false positives / false negatives:
Pretty much all of the false negatives are hashing/crypto functions that ended up labeled as non-crypto even though they are (files like sha, or containing bitwise operations). Check out those examples: sha256.c, ffi_fnv.c
For crypto-competitions, the files that were badly labeled are files that call crypto functions mostly. Check out: determine_decrypt_method.cc
For crypto-library, the badly classified are mostly edge-cases header files: very long files with a few keywords, almost empty files, files containing only block of bites. Some header files should be removed (ladder_base_namespace.h, spr.h)
Overall, the false negatives are almost all correct crypto files and the false positives are edge cases
Food for thoughts: should we also stratify on the nature of the doc? (.h vs .c file)
Studying the results of the code embedding model, searching for patterns within false positives and false negatives, and find potential bias in the data