Closed adamjanovsky closed 2 years ago
Currently, the datasets can actually be cleaned and merged using part of the preparation API, specifically cleaning
with a specified output path. However, the result JSON is not really ready for training and explanation only after cleaning.
Just as a reminder, the whole preparation pipeline works like this (I will make sure to update configs and READMEs):
cli_explain
for lower memory storage).I think it should be possible to include the source of the sample into the preparation API. Currently, each sample is identified by its hash, so the easiest approach would be to prefix the hash with the source of the sample. The benign/malicious sample is already present when the dataset is merged because in config the input paths have to be specified as either malicious/benign.
So, I am currently not sure how to proceed with this issue. Maybe, I should:
Currently, each experiment / batch of samples has its own
records.json
file. We should eventually publish the dataset in a consise manner. We should write a script that will merge the jsons into a single file. Care must be taken to:The process outlined above should be probably a part of preparation API, so that further tools (train, explain) can work with the already processed dataset.
Ideally, unless anyone intends to enrich our dataset with own samples, they will only run
cli_train
andcli_explain
on the already merged dataset (that they can fetch from some online source).