Merge datasets as part of data preparation

Currently, each experiment / batch of samples has its own records.json file. We should eventually publish the dataset in a consise manner. We should write a script that will merge the jsons into a single file. Care must be taken to:

Include source of the sample into the json (Androzoo / Avast / custom)
Include the label of the sample (benign / malicious)

The process outlined above should be probably a part of preparation API, so that further tools (train, explain) can work with the already processed dataset.

Ideally, unless anyone intends to enrich our dataset with own samples, they will only run cli_train and cli_explain on the already merged dataset (that they can fetch from some online source).

Currently, the datasets can actually be cleaned and merged using part of the preparation API, specifically cleaning with a specified output path. However, the result JSON is not really ready for training and explanation only after cleaning. Just as a reminder, the whole preparation pipeline works like this (I will make sure to update configs and READMEs):

Clean
- removes duplicates missing values etc.
- the optional output is one JSON
Select records
- this pretty much only adjusts the target labels (swap benign for malicious as a target for detection because it makes more sense when using f1-score | group family labels and drop the records with unknown family)
- for each task (detection and labeling) the optional output is two JSON (raw features and target) - this could be possibly altered to be one JSON
Feature engineering
- for each objective, the dataset is split, numerical features are engineered, the features are optionally scaled
- for each objective, the output is currently four CSV files (train/test features/target). However, this could be potentially changed into one JSON file or one .h5 file (just like the output of cli_explain for lower memory storage).
Feature selection
- "bad" (low variance, high correlation, low strength) features are dropped here
- the output is pretty much the same as in feature engineering As a reminder, usually, the whole pipeline is used (starting from input in Clean and ending with output 4 CSVs in Feature selection). However, the API can be also potentially used to store the results of the other steps of the pipeline during computation but also to remove some steps (for example if feature selection is not desired it can be completely dropped)

I think it should be possible to include the source of the sample into the preparation API. Currently, each sample is identified by its hash, so the easiest approach would be to prefix the hash with the source of the sample. The benign/malicious sample is already present when the dataset is merged because in config the input paths have to be specified as either malicious/benign.

So, I am currently not sure how to proceed with this issue. Maybe, I should:

Make the output be always one JSON (for feature engineering or feature selection this can be a .h5 file).
Add the sample source as the prefix of hash?

adamjanovsky / AndroidMalwareCrypto

Merge datasets as part of data preparation #11