Open arnaudstiegler opened 4 years ago
Just to make sure that the aim of that is clear, the main point of doing this is to see whether our models can do better than a regex search. If they can't, either or both the approach and the data are wrong.
Besides, it raises an interesting question: for what cases will regex search fails? (which are exactly the cases we are aiming at)
Will do, probably won't be as "quick" as one might think getting around their APIs with our wonderful file structure etc. haha but worth the time !
So Wind River's crypto detector is not packaged in any way, so the only way to add it to the repo is to actually copy-paste the whole thing within our repo... (and I would like to have it in the repo for the sake of reproducibility at least)
Should I add a folder benchmark
and put it in there? I would also mention in the READMe a disclaimer that it is mostly not our code - I will have to make a couple changes to simplify the API to do only what we need it to do. The tool is under the Apache License, 2.0 so we can basically do whatever we want with it I think.
Would love your opinion on this @arnaudstiegler
We decided to run Wind-River as is and extract information from the outputs in the repo only, rather than tweak their code or incorporate any of their stuff in here.
Would it be easy to reuse your code on another set of data (I'm thinking the full wolfssl package)?
Yes :) would need to run Wind-River on it separately, add the output to the folder in models
, change the sources
array to incorporate it as a new source and from line 95
instead of reading full_data.json
you would need to read the json generated from wolfssl instead
You'll find findings and exploratory analysis of the benchmark's outputs in models/benchmark/explore_benchmark_results.ipynb
.
Some key findings:
nearly twice as many false positives as false negatives
false positives have mislabeled files from others
false positives have short headers that don't implement anything from others
false positives have headers that only declare variables that will be used in some kind of cryptography protocols in others
false positives contain key authentication programs in others
false positives contain OS code from others
false negatives contain files from crypto-competitions
that implement bitwise shifts and operations used for cryptographic purposes
false negatives contain files from crypto-library
that contain just lists of digits (not even hexadecimals) - headers
false negatives contain files from crypto-library
that contain algorithms to conduct cryptographic operations rooted in mathematical structures - in absolute terms the functions and operations defined in the file have no reason to be called crypto on their own...
a lot of matching is done on very generic terms like crypt
or cipher
only two code-jam
were misclassified based on a treacherous variable name
crypto-library
files were matched primarily on some known crypto libraries patterns, some protocols and algorithms
others
files were matched mostly on generic strings but also a lot on OpenSSL
the crypto-competitions
files were overwhelmingly matched on generic clues and then on a variety of algorithms (hardly any protocols)
We have had our first results with our models, and as we are investigating about what they actually learned, having a regex benchmark would be a good way of assessing whether those models actually improve on the performance you could get with a regex.
If someone has the time to run a quick experiment to see what performance we get from WindRiver, that would be awesome!