Step 1: Dataset Generation

redouane-dziri commented 4 years ago

Opening this to keep track of what data we fetch, how we fetch it and how we label.

Crypto

Cryptographic competitions from which we can fetch some code:

Also some of the most common cryptographic libraries (e.g. OpenSSL, Crypto++, libGCrypt, BOTAN, libMD, GlibC, QT, JAVA SE 7, WinCrypt)

Non-crypto

Also need to grab some non-crypto (e.g. Google CodeJam Dataset https://github.com/Jur1cek/gcj-dataset, https://codingcompetitions.withgoogle.com/codejam/archive, nice because algorithmic in nature so might provide a bit of challenge to distinguish from crypto)

Important to get also some non-crypto code that is completely different from crypto and try to be representative (as much as we can) of code we might want to test our tool on.

How to label the non-crypto code

We might fetch some crypto stuff amongst what is supposed to be non-crypto by mistake. We can use Crypto-Detector to run regex-matching on the code fetched this way to filter out obvious crypto-code from these and move it to the crypto pile. As a matter of fact, we'll use it also on the crypto files to double-check.

Labels

One file = one label.

TODO

[x] Extract crypto files from the crypto competition
[x] Extract crypto files from the crypto libraries
[x] Extract non-crypto files from CodeJam
[x] Extract non-crypto files from outside source
[x] Combine the JSON
[x] Separate train/test (two JSONs)

redouane-dziri commented 4 years ago

We agreed on fetching individual files, as well as writing and running scripts to extract the content of the file and output one JSON file per type of source - fields are file_name and file_content. Will concatenate the JSONs (shuffle?) once they're created.

Regarding filenames, they will be prefixed by the following:

from cryptocompetition: crypto-competition_
from crypto libraries: crypto-library_
from CodeJam: code-jam_
from other sources: others_

So that we can trace the source of each data point easily - for debugging and evaluation purposes mainly.

Hadrien-Cornier commented 4 years ago

I have a cool script I would like to run as soon as you give me the files please ! It will look for a measure of software similarity using tokens (MOSS).

redouane-dziri commented 4 years ago

I have a cool script I would like to run as soon as you give me the files please ! It will look for a measure of software similarity using tokens (MOSS).

@Hadrien-Cornier Great! (let's continue this discussion in the Feature Extraction issue)

redouane-dziri commented 4 years ago

Let's shoot for this structure when adding our files to the repo:

| data
     | ---- crypto-competition
     |          |   crypto-competition_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- crypto-library
     |          |   crypto_library_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- code-jam
     |          |   code-jam_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- others
                |   others_data.json
                |   data_to_json.py
                |---- files 
                           file_example.cpp
                           file_example_2.c
                           ....

redouane-dziri commented 4 years ago

And prefix_data.json can be like:

{
    "data_source": prefix,
    "label": x,
    "data": [ 
                   {
                       "file_name": "example.c",
                       "file_content": "//example string /n const c = 0"
                    },
                   ...
             ]

}

where x is 1 for prefix==crypto-* and 0 for the rest

corentinllorca commented 4 years ago

Let's shoot for this structure when adding our files to the repo:

| data
     | ---- crypto-competition
     |          |   crypto-competition_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- crypto-library
     |          |   crypto_library_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- code-jam
     |          |   code-jam_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- others
                |   others_data.json
                |   data_to_json.py
                |---- files 
                           file_example.cpp
                           file_example_2.c
                           ....

@redouane-dziri Are we sure we want to have .py scripts in our data folder? That folder is usually reserved for pure data

redouane-dziri commented 4 years ago

Let's shoot for this structure when adding our files to the repo:

| data
     | ---- crypto-competition
     |          |   crypto-competition_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- crypto-library
     |          |   crypto_library_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- code-jam
     |          |   code-jam_data.json
     |          |   data_to_json.py
     |          |---- files 
     |                     file_example.cpp
     |                     file_example_2.c
     |                     ....
     | ---- others
                |   others_data.json
                |   data_to_json.py
                |---- files 
                           file_example.cpp
                           file_example_2.c
                           ....

@redouane-dziri Are we sure we want to have .py scripts in our data folder? That folder is usually reserved for pure data

No definite opinion on this on my end, as long as we all follow the same structure to avoid cleaning it up after :) I liked the idea of having the script to generate the json close to its output but by all means, if you really want to change that, suggest a new folder tree and I will adapt my PR consequently.

arthurherbout commented 4 years ago

In a slightly different topic, be aware that different extensions count as c++ files. According to stackoverflow,

GNU GCC recognises all of the following as C++ files, and will use C++ compilation regardless of whether you invoke it through gcc or g++: .C, .cc, .cpp, .CPP, .c++, .cp, or .cxx.

Note the .C - case matters in GCC, .c is a C file whereas .C is a C++ file (if you let the compiler decide what it is compiling that is).

arnaudstiegler commented 4 years ago

And prefix_data.json can be like:

{
    "data_source": prefix,
    "label": x,
    "data": [ 
                   {
                       "file_name": "example.c",
                       "file_content": "//example string /n const c = 0"
                    },
                   ...
             ]

}

where x is 1 for prefix==crypto-* and 0 for the rest

I think it would be better to have "file_path" as one of the field rather than the file name (because of duplicate names etc...)

arthurherbout commented 4 years ago

In my understanding all file_names are prefixed right? prefix appears twice, once in "data_source" and once in file_name.

Maybe I misunderstood

arnaudstiegler commented 4 years ago

The most convenient way is to have the file_path as a field (and not the path name). And you don't really need the prefix within the json (will only be useful when concatenating each of our jsons) so for better readibility, I'd rather not have prefix

redouane-dziri commented 4 years ago

Once we got all the jsons, we need to write a script to take those in and build the following json with all the data:

{
      {
          file_name: ...,
          is_header: ...,
          source: ...,
          label: ...,
          content: ...
      },
      {
          file_name: ...,
          is_header: ...,
          source: ...,
          label: ...,
          content: ...
      },
      ....
}

where is_header is 1 for (.h, .hh, .hpp, h++ files) and 0 for the others

redouane-dziri commented 4 years ago

(We decided it would be better to keep WindRiver's implementation as a baseline to compare our tool to, so best not to use it upstream, removing associated issue)

corentinllorca commented 4 years ago

Once we got all the jsons, we need to write a script to take those in and build the following json with all the data:
{
      {
          file_name: ...,
          is_header: ...,
          source: ...,
          label: ...,
          content: ...
      },
      {
          file_name: ...,
          is_header: ...,
          source: ...,
          label: ...,
          content: ...
      },
      ....
}
where is_header is 1 for (.h, .hh, .hpp, h++ files) and 0 for the others

This has been done. The final JSON is enclosed in [] and not {}, because it makes more sense as a list than as a dictionary (since it has no keys). So the syntax is more like

 [
       {
           file_name: ...,
           is_header: ...,
           source: ...,
           label: ...,
           content: ...
       },
       {
           file_name: ...,
           is_header: ...,
           source: ...,
           label: ...,
           content: ...
       },
       ....
 ]

redouane-dziri commented 4 years ago

I modified some of the data collection code to be able to run it, please make sure you don't include your absolute paths in the script, we should all be able to run scripts at the push of a button. Also relative paths are not good enough as they assume we are in a given working directory, which turned out not to be the case when I ran some scripts for example. A decent solution is using git_root. Also use os.path.join to create path strings, I believe not all OS have the same separator for folders (/ vs. \) so code is not portable when you include path strings like a/b/c.

redouane-dziri commented 4 years ago

Added script to split in train/test as well as the json produced in data. I stratified the split by data source and chose a 15% proportion of test files. We shouldn't do anything with the test files from now on until some advanced evaluation of our models, so please only work with train when you start building models. If someone has something to add on dataset generation, or any of the previous in this discussion, please do so here Will close this issue if not

arthurherbout / crypto_code_detection

Step 1: Dataset Generation #1