DRSM (De-Randomized Smoothed MalConv)

We redesigned the de-randomized smoothing scheme to make it applicable for byte-based inputs. We used MalConv as our base classifier and hence, we named our model DRSM (De-Randomized Smoothed MalConv). We are also publishing our PACE (Publicly Accessible Collection(s) of Executables) dataset containing diverse benign raw executables.

Please cite our paper if you use our model or data. This paper will appear in ICLR 2024.

@article{saha2023drsm,
  title={Drsm: De-randomized smoothing on malware classifier providing certified robustness},
  author={Saha, S and Wang, W and Kaya, Y and Feizi, S and Dumitras, T},
  journal={arXiv preprint arXiv:2303.13372},
  year={2023}
}

Setup

Using Conda

Simply run the command -

conda env create -n pytorch_env_smksaha --file environment.yml

That's it! You're good to go! :smile:

Training

To train the models, run train_custom_malconv_by_ablation_from_csv.py. Or, simply run the provided scripts, such as train_script1.sh, train_script4.sh, etc.

Here, the number in the script name indicates the n in our DRSM-n models. For example, running train_script4.sh script would train our DRSM-4.

Below are the arguments for training -

--root_dir: path of root directory that contains the dataset

--train_path: path of csv that contains the file paths from train-set

--val_path: path of csv that contains the file paths from validation-set

--test_path: path of csv that contains the file paths from test-set

--dir_path: directory where the trained model will be saved

--ablation_idx: the specific ablation of the model will be trained

--ablations: total number of ablations a model can have

--dataset_size: if you want to limit your train on specific number of samples. Otherwise, it will train on all files from 'train.csv'

--epochs: number of epochs the model will be trained

--batch_size: batch size to train on

--non_neg: True if you want to put non-negative weight constraint on the model

We would recommend using our provided scripts to train the models. It's easier. You just need to provide root_directory, file_paths, and batch_size in that case. After running this, the models will be saved in the secml_malware/data/trained folder.

Evaluation

Standard Accuracy

Run the evaluate_custom_malconv_by_ablation_from_csv.py file. It takes almost the same arguments as training. However, you can simply run the evaluate_script.sh with just a few arguments.

After running this, it will output the standard accuracy with other metrics like false positive, true positive, confusion matrix etc.

Certified Accuracy

Run the evaluate_custom_malconv_cert_acc.py file, or the script cert_acc_script.sh. It takes a list with --perturb_size argument for which it will generate the results.

PACE Dataset

This dataset contains over 15.5K benign raw executables collected from different free websites. See our paper for a specific breakdown.

In the dataset folder, there are .csv files with website names, for example, sourceforge.csv. These csv files contain website URLs to download the executables from along with their MD5 hash.

You can follow/use the provided dataset/download_benign.py for download convenience.

Version	Total Files	Collection Time
v1	15.5K	August, 2022
v2	18.5K	September, 2023

We will be updating the dataset on a regular basis. Please, report the version of dataset you are using.

*For malware dataset, we used the Virusshare website. To be specific, we used the VirusShare_00434 folder for our malware dataset.

ShoumikSaha / DRSM

readme