Closed dmacko232 closed 3 years ago
@dmacko232 for normalization, we should probably decide whether to use number of lines of code vs. number of classes.
Could you compute two correlations: (number of crypto API records, number of classes)
, (number of crypto API records, total number LoC in sample)
? For normalization, we should choose the feature with higher correlation.
In this branch saving of prepare pipeline was changed to use .h5 file. Because of this change there are several changes in other cli tools. Additionally, feature normalization by class count was introduced into feature engineering. There were also some fixes of various issues that were encountered.