byu-dml / metalearn

BYU's python library of useable tools for metalearning
MIT License
22 stars 6 forks source link

Summarization functions #194

Open emrysshevek opened 5 years ago

emrysshevek commented 5 years ago

Part of the goal of this package is to encode a dataset as a one dimensional vector with a consistent size. To do that, we use the profile_distribution function on any metafeatures that return a sequence of values (e.g. means of numeric features) in order to flatten it to a consistent shape.

Currently, profile_distribution has a rigid set of summarization functions it computes every time no matter what. It would be nice to refactor this into a more flexible summarization function that allows only subset of summary measures to be computed, or possibly to have custom summary functions passed in.

This would possibly include rethinking the naming scheme for our metafeatures and the structure of the computation in order to allow an arbitrary number of summaries to be computed on a given metafeature. This could follow more closely with our current method of including the summary as a prefix to the metafeature (e.g. MeanMeansOfNumericFeatures, SumMeansOfNumericFeatures) or we could move closer to the D3M way of including the summary as an extension (e.g. MeansOfNumericFeatures.mean). The second way could also more naturally allow several chained operations to be clearly indicated (e.g. NumericFeatures.entropy.mean).