elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
949 stars 277 forks source link

Saved Model .txt for Optimised LGB on EMBER2018 #51

Closed wilsoncwj closed 4 years ago

wilsoncwj commented 4 years ago

Hi, does anyone happen to have the optimized LGB model .txt file? Specifically for the LGB model trained on EMBER2018 features.

I am not able to run train_ember.py with the --optimize flag due to limited hardware memory. If anyone has the saved optimized model and is able to share that will be much appreciated!!

mrphilroth commented 4 years ago

The optimized model is available in the large data download (available at https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2). The LGB text file named ember_model_2018.txt is in there with the jsonl files:

$ tar xvf ember_dataset_2018_2.tar
x ember2018/
x ember2018/train_features_1.jsonl
x ember2018/train_features_0.jsonl
x ember2018/train_features_3.jsonl
x ember2018/test_features.jsonl
x ember2018/ember_model_2018.txt
x ember2018/train_features_5.jsonl
x ember2018/train_features_4.jsonl
x ember2018/train_features_2.jsonl
wilsoncwj commented 4 years ago

Thanks for the quick reply! I assumed that the ember_model_2018.txt was the "unoptimized" version. Seems like I've been using the right one all along!

wilsoncwj commented 4 years ago

To clarify, in your research paper EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models, it was mentioned that:

From the vectorized features, we trained a gradient-boosed decision tree (GBDT) model using LightGBM with
default parameters (100 trees, 31 leaves per tree), resulting in fewer than 10K tunable parameters [14]. Model training
took 3 hours. Baseline model performance may be much improved with appropriate hyper-parameter optimization,
which is of less interest to us in this work.

Furthermore, in your source code for ember, the optimized version would have been saved as lgbm_model.save_model(os.path.join(args.datadir, "optimised_model.txt"))

Therefore, I assumed that the ember_model_2018.txt was the original "unoptimized" version. Hence the clarification!

mrphilroth commented 4 years ago

Ah. That makes sense. That quote was written and released with EMBER 2017. In that case, we only released the default model and I don't have any optimized model anymore to post. We did not release a new paper for the EMBER 2018 release. For this one, the model is already optimized with this grid search: https://docs.google.com/presentation/d/1A13tsUkgWeujTy9SD-vDFfQp9fnIqbSE_tCihNPlArQ/edit#slide=id.g6318784c2c_0_1131

wilsoncwj commented 4 years ago

I see. So just to confirm the ember_model_2018.txt as part of https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2 is the unoptimized LGB model?

My problem is that I am unable to train the LGB model locally due to Out-Of-Memory (OOM) issues, hence I am asking around for the optimized_model.txt so I can just load it in.

Once again, wondering if anyone out there has successfully trained LGB with the --optimize flag to arrive at the following best params and able to share the resulting optimized_model.txt?

From the slides shared by Phil:

best_params = {
  “boosting”: “gbdt”,
  “objective”: “binary”,
  “num_iterations”: 1000,
  “learning_rate”: 0.05,
  “num_leaves”: 2048,
  “feature_fraction”: 0.5,
  “bagging_fraction”: 1.0,
  “max_depth”: 15,
  “min_data_in_leaf”: 50,
}
wilsoncwj commented 4 years ago

After relooking at the ember source code for the train script, I realize that the default set of parameters are already the optimized parameters. If you compare what Phil had in his CAMLIS 2019 presentation (posted in above comment) and the original params:

Default train script:
params = {
        "boosting": "gbdt",
        "objective": "binary",
        "num_iterations": 1000,
        "learning_rate": 0.05,
        "num_leaves": 2048,
        "max_depth": 15,
        "min_data_in_leaf": 50,
        "feature_fraction": 0.5
    }

The only difference is the addition of bagging_fraction”: 1.0, which according to the LGBM documentation is already the default value.