Closed redouane-dziri closed 4 years ago
We agreed on fetching individual files, as well as writing and running scripts to extract the content of the file and output one JSON file per type of source - fields are file_name
and file_content
. Will concatenate the JSONs (shuffle?) once they're created.
Regarding filenames
, they will be prefixed by the following:
crypto-competition_
crypto-library_
code-jam_
others_
So that we can trace the source of each data point easily - for debugging and evaluation purposes mainly.
I have a cool script I would like to run as soon as you give me the files please ! It will look for a measure of software similarity using tokens (MOSS).
I have a cool script I would like to run as soon as you give me the files please ! It will look for a measure of software similarity using tokens (MOSS).
@Hadrien-Cornier Great! (let's continue this discussion in the Feature Extraction issue)
Let's shoot for this structure when adding our files to the repo:
| data
| ---- crypto-competition
| | crypto-competition_data.json
| | data_to_json.py
| |---- files
| file_example.cpp
| file_example_2.c
| ....
| ---- crypto-library
| | crypto_library_data.json
| | data_to_json.py
| |---- files
| file_example.cpp
| file_example_2.c
| ....
| ---- code-jam
| | code-jam_data.json
| | data_to_json.py
| |---- files
| file_example.cpp
| file_example_2.c
| ....
| ---- others
| others_data.json
| data_to_json.py
|---- files
file_example.cpp
file_example_2.c
....
And prefix_data.json
can be like:
{
"data_source": prefix,
"label": x,
"data": [
{
"file_name": "example.c",
"file_content": "//example string /n const c = 0"
},
...
]
}
where x
is 1
for prefix==crypto-*
and 0
for the rest
Let's shoot for this structure when adding our files to the repo:
| data | ---- crypto-competition | | crypto-competition_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- crypto-library | | crypto_library_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- code-jam | | code-jam_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- others | others_data.json | data_to_json.py |---- files file_example.cpp file_example_2.c ....
@redouane-dziri Are we sure we want to have .py scripts in our data folder? That folder is usually reserved for pure data
Let's shoot for this structure when adding our files to the repo:
| data | ---- crypto-competition | | crypto-competition_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- crypto-library | | crypto_library_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- code-jam | | code-jam_data.json | | data_to_json.py | |---- files | file_example.cpp | file_example_2.c | .... | ---- others | others_data.json | data_to_json.py |---- files file_example.cpp file_example_2.c ....
@redouane-dziri Are we sure we want to have .py scripts in our data folder? That folder is usually reserved for pure data
No definite opinion on this on my end, as long as we all follow the same structure to avoid cleaning it up after :) I liked the idea of having the script to generate the json
close to its output but by all means, if you really want to change that, suggest a new folder tree and I will adapt my PR consequently.
In a slightly different topic, be aware that different extensions count as c++ files. According to stackoverflow,
GNU GCC recognises all of the following as C++ files, and will use C++ compilation regardless of whether you invoke it through gcc or g++: .C, .cc, .cpp, .CPP, .c++, .cp, or .cxx.
Note the .C - case matters in GCC, .c is a C file whereas .C is a C++ file (if you let the compiler decide what it is compiling that is).
And
prefix_data.json
can be like:{ "data_source": prefix, "label": x, "data": [ { "file_name": "example.c", "file_content": "//example string /n const c = 0" }, ... ] }
where
x
is1
forprefix==crypto-*
and0
for the rest
I think it would be better to have "file_path" as one of the field rather than the file name (because of duplicate names etc...)
In my understanding all file_names are prefixed right? prefix appears twice, once in "data_source" and once in file_name.
Maybe I misunderstood
The most convenient way is to have the file_path as a field (and not the path name). And you don't really need the prefix within the json (will only be useful when concatenating each of our jsons) so for better readibility, I'd rather not have prefix
Once we got all the jsons, we need to write a script to take those in and build the following json with all the data:
{
{
file_name: ...,
is_header: ...,
source: ...,
label: ...,
content: ...
},
{
file_name: ...,
is_header: ...,
source: ...,
label: ...,
content: ...
},
....
}
where is_header
is 1
for (.h, .hh, .hpp, h++ files) and 0
for the others
(We decided it would be better to keep WindRiver's implementation as a baseline to compare our tool to, so best not to use it upstream, removing associated issue)
Once we got all the jsons, we need to write a script to take those in and build the following json with all the data:
{ { file_name: ..., is_header: ..., source: ..., label: ..., content: ... }, { file_name: ..., is_header: ..., source: ..., label: ..., content: ... }, .... }
where
is_header
is1
for (.h, .hh, .hpp, h++ files) and0
for the others
This has been done. The final JSON is enclosed in [] and not {}, because it makes more sense as a list than as a dictionary (since it has no keys). So the syntax is more like
[
{
file_name: ...,
is_header: ...,
source: ...,
label: ...,
content: ...
},
{
file_name: ...,
is_header: ...,
source: ...,
label: ...,
content: ...
},
....
]
I modified some of the data collection code to be able to run it, please make sure you don't include your absolute paths in the script, we should all be able to run scripts at the push of a button.
Also relative paths are not good enough as they assume we are in a given working directory, which turned out not to be the case when I ran some scripts for example. A decent solution is using git_root
.
Also use os.path.join
to create path strings, I believe not all OS have the same separator for folders (/
vs. \
) so code is not portable when you include path strings like a/b/c
.
Added script to split in train
/test
as well as the json produced in data
. I stratified the split by data source and chose a 15% proportion of test files.
We shouldn't do anything with the test files from now on until some advanced evaluation of our models, so please only work with train
when you start building models.
If someone has something to add on dataset generation, or any of the previous in this discussion, please do so here
Will close this issue if not
Opening this to keep track of what data we fetch, how we fetch it and how we label.
Crypto
Cryptographic competitions from which we can fetch some code:
Also some of the most common cryptographic libraries (e.g.
OpenSSL
,Crypto++
,libGCrypt
,BOTAN
,libMD
,GlibC
,QT
,JAVA SE 7
,WinCrypt
)Non-crypto
Also need to grab some non-crypto (e.g. Google CodeJam Dataset https://github.com/Jur1cek/gcj-dataset, https://codingcompetitions.withgoogle.com/codejam/archive, nice because algorithmic in nature so might provide a bit of challenge to distinguish from crypto)
Important to get also some non-crypto code that is completely different from crypto and try to be representative (as much as we can) of code we might want to test our tool on.
How to label the non-crypto code
We might fetch some crypto stuff amongst what is supposed to be non-crypto by mistake. We can use Crypto-Detector to run regex-matching on the code fetched this way to filter out obvious crypto-code from these and move it to the crypto pile. As a matter of fact, we'll use it also on the crypto files to double-check.
Labels
One file = one label.
TODO