Closed wilsoncwj closed 4 years ago
The optimized model is available in the large data download (available at https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2). The LGB text file named ember_model_2018.txt
is in there with the jsonl files:
$ tar xvf ember_dataset_2018_2.tar
x ember2018/
x ember2018/train_features_1.jsonl
x ember2018/train_features_0.jsonl
x ember2018/train_features_3.jsonl
x ember2018/test_features.jsonl
x ember2018/ember_model_2018.txt
x ember2018/train_features_5.jsonl
x ember2018/train_features_4.jsonl
x ember2018/train_features_2.jsonl
Thanks for the quick reply! I assumed that the ember_model_2018.txt
was the "unoptimized" version. Seems like I've been using the right one all along!
To clarify, in your research paper EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models
, it was mentioned that:
From the vectorized features, we trained a gradient-boosed decision tree (GBDT) model using LightGBM with
default parameters (100 trees, 31 leaves per tree), resulting in fewer than 10K tunable parameters [14]. Model training
took 3 hours. Baseline model performance may be much improved with appropriate hyper-parameter optimization,
which is of less interest to us in this work.
Furthermore, in your source code for ember, the optimized version would have been saved as lgbm_model.save_model(os.path.join(args.datadir, "optimised_model.txt"))
Therefore, I assumed that the ember_model_2018.txt
was the original "unoptimized" version.
Hence the clarification!
Ah. That makes sense. That quote was written and released with EMBER 2017. In that case, we only released the default model and I don't have any optimized model anymore to post. We did not release a new paper for the EMBER 2018 release. For this one, the model is already optimized with this grid search: https://docs.google.com/presentation/d/1A13tsUkgWeujTy9SD-vDFfQp9fnIqbSE_tCihNPlArQ/edit#slide=id.g6318784c2c_0_1131
I see. So just to confirm the ember_model_2018.txt
as part of https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
is the unoptimized LGB model?
My problem is that I am unable to train the LGB model locally due to Out-Of-Memory (OOM) issues, hence I am asking around for the optimized_model.txt so I can just load it in.
Once again, wondering if anyone out there has successfully trained LGB with the --optimize
flag to arrive at the following best params and able to share the resulting optimized_model.txt?
From the slides shared by Phil:
best_params = {
“boosting”: “gbdt”,
“objective”: “binary”,
“num_iterations”: 1000,
“learning_rate”: 0.05,
“num_leaves”: 2048,
“feature_fraction”: 0.5,
“bagging_fraction”: 1.0,
“max_depth”: 15,
“min_data_in_leaf”: 50,
}
After relooking at the ember source code for the train script, I realize that the default set of parameters are already the optimized parameters. If you compare what Phil had in his CAMLIS 2019 presentation (posted in above comment) and the original params:
Default train script:
params = {
"boosting": "gbdt",
"objective": "binary",
"num_iterations": 1000,
"learning_rate": 0.05,
"num_leaves": 2048,
"max_depth": 15,
"min_data_in_leaf": 50,
"feature_fraction": 0.5
}
The only difference is the addition of bagging_fraction”: 1.0
, which according to the LGBM documentation is already the default value.
Hi, does anyone happen to have the optimized LGB model .txt file? Specifically for the LGB model trained on EMBER2018 features.
I am not able to run
train_ember.py
with the--optimize
flag due to limited hardware memory. If anyone has the saved optimized model and is able to share that will be much appreciated!!