bentsherman / tesseract

A tool for creating resource prediction models for scientific workflows
MIT License
10 stars 2 forks source link

Sparsity metric for performance dataset #11

Closed bentsherman closed 4 years ago

bentsherman commented 4 years ago

Since training data must be acquired by running the target application many times, it will be important to minimize the number of training samples required to achieve good accuracy. I think a good way to measure this is the number of samples or the "sparsity" of the training set, which is the number of samples normalized by the size of the search space. For some applications this metric will be harder to define, because it depends on what you consider to be "sensible values" for each command-line parameter. It may be best to use a log-scale for this metric, since the search space grows factorially with the number of parameters and range of each paramter.

bentsherman commented 4 years ago

After thinking about it more I don't think we really need this. Continuous variables like input data size immediately make the search space infinite. Just try to minimize the absolute number of samples, and the plan is to only use historical jobs anyway, so we're not really concerned about additional computational cost.