Trouble reproducing benchmark LightGBM model

lrsamson commented 4 years ago

I am attempting to reproduce the benchmark to compute additional evaluation metrics. I followed the procedure outlined and obtained a model, but it appears to have lower performance than reported in the paper (~~AUC=.98576~~ (see my reply for correction) while paper reports .99911). I figured it's likely due to the difference between the versions of the 2017 dataset as I'm using version 2. Can anyone confirm?

I'm primarily interested in the accuracy and F1 score for the original benchmark for comparison purposes as the benchmark appears to still outperform any DNN in the existing literature (which is very cool). Thanks!

mrphilroth commented 4 years ago

Hi! We didn't write another paper when we updated the dataset to 2018 samples and version 2 features. This notebook summarizes the performance changes: https://github.com/endgameinc/ember/blob/master/resources/ember2018-notebook.ipynb We chose samples and feature that are more difficult to classify than the 2017 dataset. So that's why the model shows worse performance despite being optimized with a small grid search.

The benchmark model is shipped with the dataset. You should be able to modify it and work with it from there. You can train a very similar model by running the code in that notebook. Unfortunately, it is not exactly reproducible, as I talk about here: https://www.youtube.com/watch?v=MsZmnUO5lkY&t=10m25s

Thanks for working with EMBER!

lrsamson commented 4 years ago

Thanks for the quick reply and for pointing me to the resources directory.

Just to be clear, I have been using feature version 2 of the 2017 dataset, not the 2018 data. But regardless, I cleared up some of my confusion. I made the error of feeding rounded target values (predictions, not probabilities) into metrics.roc_auc_score(), so essentially I calculated the accuracy and not the AUC in my first post.

After downloading and processing version 1 of the 2017 data, I obtained the same AUC as in the paper and the notebook. I went ahead and computed my desired metrics for both versions of the 2017 data and am sharing them here in case they are of value to anyone else.

Some notes and disclaimers: I used lief 0.9.0 to process the raw data for both versions. Despite the warnings I did obtain the exact same AUC down to the last decimal for version 1, so I figure the generated features must be near identical. And I used the model shipped with 2017v1, but since 2017v2 doesn't ship with one, I trained one with ember.train_model() with default parameters.

2017 Feature Version 1: AUC: 0.999112327 Accuracy: 0.98607 F1: 0.9860317269317931 (Precision: 0.9887483409081768, Recall: 0.98333) Confusion Matrix: 98881 1119 1667 98333

2017 Feature Version 2: AUC: 0.9990833638 Accuracy: 0.98576 F1: 0.9857253125093979 (Precision: 0.9881323230902185, Recall: 0.98333) Confusion Matrix: 98819 1181 1667 98333

mrphilroth commented 4 years ago

This is useful information. Thanks for sharing!

whyisyoung commented 4 years ago

Thanks for the quick reply and for pointing me to the resources directory.

Just to be clear, I have been using feature version 2 of the 2017 dataset, not the 2018 data. But regardless, I cleared up some of my confusion. I made the error of feeding rounded target values (predictions, not probabilities) into metrics.roc_auc_score(), so essentially I calculated the accuracy and not the AUC in my first post.

After downloading and processing version 1 of the 2017 data, I obtained the same AUC as in the paper and the notebook. I went ahead and computed my desired metrics for both versions of the 2017 data and am sharing them here in case they are of value to anyone else.

Some notes and disclaimers: I used lief 0.9.0 to process the raw data for both versions. Despite the warnings I did obtain the exact same AUC down to the last decimal for version 1, so I figure the generated features must be near identical. And I used the model shipped with 2017v1, but since 2017v2 doesn't ship with one, I trained one with ember.train_model() with default parameters.

2017 Feature Version 1: AUC: 0.999112327 Accuracy: 0.98607 F1: 0.9860317269317931 (Precision: 0.9887483409081768, Recall: 0.98333) Confusion Matrix: 98881 1119 1667 98333

2017 Feature Version 2: AUC: 0.9990833638 Accuracy: 0.98576 F1: 0.9857253125093979 (Precision: 0.9881323230902185, Recall: 0.98333) Confusion Matrix: 98819 1181 1667 98333

Thanks @lrsamson for sharing the result. It's really helpful! I also tried myself. The shipped model file ember_model_2017.txt(about 342.7KB) was trained on 100 trees. If I use the same parameters as the one in train_ember.py, i.e., using 1000 trees (the model file would be ~96.2MB), the results would be better than claimed on the paper. In this case, we can compare feature version 1 and version 2, and we can see there is not much difference.

Here are my results:

2017 Feature Version 1:

ROC AUC: 0.9997627814

Ember Model Performance at 1% FPR: Threshold: 0.006 accuracy: 0.99282 False Positive Rate: 1.000% False Negative Rate: 0.436% Detection Rate: 99.564% confusion matrix: [[99000 1000] [ 436 99564]]

Ember Model Performance at 0.1% FPR: Threshold: 0.941 accuracy: 0.993365 False Positive Rate: 0.100% False Negative Rate: 1.227% Detection Rate: 98.773% confusion matrix: [[99900 100] [ 1227 98773]]

2017 Feature Version 2

ROC AUC: 0.9997735828

Ember Model Performance at 1% FPR: Threshold: 0.006 accuracy: 0.99288 False Positive Rate: 0.999% False Negative Rate: 0.425% Detection Rate: 99.575% confusion matrix: [[99001 999] [ 425 99575]]

Ember Model Performance at 0.1% FPR: Threshold: 0.957 accuracy: 0.993315 False Positive Rate: 0.100% False Negative Rate: 1.237% Detection Rate: 98.763% confusion matrix: [[99900 100] [ 1237 98763]]

elastic / ember

Trouble reproducing benchmark LightGBM model #43

2017 Feature Version 1:

2017 Feature Version 2