Closed lrsamson closed 4 years ago
Hi! We didn't write another paper when we updated the dataset to 2018 samples and version 2 features. This notebook summarizes the performance changes: https://github.com/endgameinc/ember/blob/master/resources/ember2018-notebook.ipynb We chose samples and feature that are more difficult to classify than the 2017 dataset. So that's why the model shows worse performance despite being optimized with a small grid search.
The benchmark model is shipped with the dataset. You should be able to modify it and work with it from there. You can train a very similar model by running the code in that notebook. Unfortunately, it is not exactly reproducible, as I talk about here: https://www.youtube.com/watch?v=MsZmnUO5lkY&t=10m25s
Thanks for working with EMBER!
Thanks for the quick reply and for pointing me to the resources directory.
Just to be clear, I have been using feature version 2 of the 2017 dataset, not the 2018 data. But regardless, I cleared up some of my confusion. I made the error of feeding rounded target values (predictions, not probabilities) into metrics.roc_auc_score(), so essentially I calculated the accuracy and not the AUC in my first post.
After downloading and processing version 1 of the 2017 data, I obtained the same AUC as in the paper and the notebook. I went ahead and computed my desired metrics for both versions of the 2017 data and am sharing them here in case they are of value to anyone else.
Some notes and disclaimers: I used lief 0.9.0 to process the raw data for both versions. Despite the warnings I did obtain the exact same AUC down to the last decimal for version 1, so I figure the generated features must be near identical. And I used the model shipped with 2017v1, but since 2017v2 doesn't ship with one, I trained one with ember.train_model() with default parameters.
2017 Feature Version 1: AUC: 0.999112327 Accuracy: 0.98607 F1: 0.9860317269317931 (Precision: 0.9887483409081768, Recall: 0.98333) Confusion Matrix: 98881 1119 1667 98333
2017 Feature Version 2: AUC: 0.9990833638 Accuracy: 0.98576 F1: 0.9857253125093979 (Precision: 0.9881323230902185, Recall: 0.98333) Confusion Matrix: 98819 1181 1667 98333
This is useful information. Thanks for sharing!
Thanks for the quick reply and for pointing me to the resources directory.
Just to be clear, I have been using feature version 2 of the 2017 dataset, not the 2018 data. But regardless, I cleared up some of my confusion. I made the error of feeding rounded target values (predictions, not probabilities) into metrics.roc_auc_score(), so essentially I calculated the accuracy and not the AUC in my first post.
After downloading and processing version 1 of the 2017 data, I obtained the same AUC as in the paper and the notebook. I went ahead and computed my desired metrics for both versions of the 2017 data and am sharing them here in case they are of value to anyone else.
Some notes and disclaimers: I used lief 0.9.0 to process the raw data for both versions. Despite the warnings I did obtain the exact same AUC down to the last decimal for version 1, so I figure the generated features must be near identical. And I used the model shipped with 2017v1, but since 2017v2 doesn't ship with one, I trained one with ember.train_model() with default parameters.
2017 Feature Version 1: AUC: 0.999112327 Accuracy: 0.98607 F1: 0.9860317269317931 (Precision: 0.9887483409081768, Recall: 0.98333) Confusion Matrix: 98881 1119 1667 98333
2017 Feature Version 2: AUC: 0.9990833638 Accuracy: 0.98576 F1: 0.9857253125093979 (Precision: 0.9881323230902185, Recall: 0.98333) Confusion Matrix: 98819 1181 1667 98333
Thanks @lrsamson for sharing the result. It's really helpful! I also tried myself. The shipped model file ember_model_2017.txt
(about 342.7KB) was trained on 100 trees. If I use the same parameters as the one in train_ember.py
, i.e., using 1000 trees (the model file would be ~96.2MB), the results would be better than claimed on the paper. In this case, we can compare feature version 1 and version 2, and we can see there is not much difference.
Here are my results:
ROC AUC: 0.9997627814
Ember Model Performance at 1% FPR: Threshold: 0.006 accuracy: 0.99282 False Positive Rate: 1.000% False Negative Rate: 0.436% Detection Rate: 99.564% confusion matrix: [[99000 1000] [ 436 99564]]
Ember Model Performance at 0.1% FPR: Threshold: 0.941 accuracy: 0.993365 False Positive Rate: 0.100% False Negative Rate: 1.227% Detection Rate: 98.773% confusion matrix: [[99900 100] [ 1227 98773]]
ROC AUC: 0.9997735828
Ember Model Performance at 1% FPR: Threshold: 0.006 accuracy: 0.99288 False Positive Rate: 0.999% False Negative Rate: 0.425% Detection Rate: 99.575% confusion matrix: [[99001 999] [ 425 99575]]
Ember Model Performance at 0.1% FPR: Threshold: 0.957 accuracy: 0.993315 False Positive Rate: 0.100% False Negative Rate: 1.237% Detection Rate: 98.763% confusion matrix: [[99900 100] [ 1237 98763]]
I am attempting to reproduce the benchmark to compute additional evaluation metrics. I followed the procedure outlined and obtained a model, but it appears to have lower performance than reported in the paper (
AUC=.98576(see my reply for correction) while paper reports .99911). I figured it's likely due to the difference between the versions of the 2017 dataset as I'm using version 2. Can anyone confirm?I'm primarily interested in the accuracy and F1 score for the original benchmark for comparison purposes as the benchmark appears to still outperform any DNN in the existing literature (which is very cool). Thanks!