ian-bertolacci commented 4 years ago

3 #6

After doing data collection for #3, test predicitve power on tool on those configurations with that data.

ian-bertolacci commented 4 years ago

From priorities list: "[evaluation] program extended to also do an estimate of the running time on ocelote and an evaluation of how well that prediction works." From meetings: "We prediction doesn't need to be great, just need to show process and how to evaluate"

Steps:

Choose 40-50 random nx,ny,nz parameters for sinusoidal that have NOT been used as part of the modeling effort. Use (1,1,1) process topology, and T=1 (this is what the model was developed and currently has no model that involves timesteps.
Run those tests on ocelote.
Statistically analyse observations vs expectations.

Deliverables:

Parameter set
Raw data from Ocelote (This is hopefully being standardized)
Statistical analysis including:
- Basic statistics of the error between the groups (Provide low resolution image of predictive power of model/tool at a global scale
- mean error, stddev of error
- mean absolute error, stddev of absolute error
- Hypothesis testing using Student's t-test to decide if, with some p significance (probably .05 or .01) that the error between the observed and estimated performance values (runtime, footprint) is not significantly different
- Population: set([ abs( observed[NX,NY,NZ] - model( NX, NY, NZ ) ) for (NX,NY,NZ) in test_configurations ]) (excuse the python notation)
- Null Hypothesis: the mean of the population is == 0 (i.e. the mean difference is 0 (i.e. model and observed are in the same distribution (i.e. model accurately estimated performance values that were observed.).).).
- Questions I expect someone will ask me, or that I have asked myself and will forget:
  - "Why is the population the absolute error between the two?" In this statistical test, we are essentially testing if the distributions of the two functions (observe and model) are different. However, we cannot simply take the mean and stddev of values returned by these functions and compare them in this away, as it does not consider shape of the curves. For example, imagine the functions f(x) = x and g(x) = -x; for any collection of x, the mean and stddev of is the same for both, but the error between the two (f(x) - g(x) = x - -x = x + x = 2x) is significant, and is dependent in x. Instead, we can consider the absolute error between observations and their corresponding predictions as a population to perform these statistical tests on. With this, we can then decide, not if the means of the two functions are different, but if the mean of the error is statistically different than zero.
  - "Why cant the population just be the real error between the two?" Because the there can be positive or negative difference in error. Parameters there are parameters in a t-test (h0 mean, observed mean) which are sign-dependent. We need to normalize so that positive and negative error do not cancel each other out. For example, using the same f and g functions, the real error for x=-100 is 200, and for x=100 is -200, so the mean error is zero. Additionally we don't care the direction of the error, only its magnitude. It's possible that squared error would also work, but would kinda confusing to think about, cannot be directly converted into % difference without square-rooting, which is just taking the absolute value.
  - "Aren't there other tests?" Probably? I was pointed to Hausdorff Distance, Frechet distance, Kolmogorov-Smirnov test, and Lp Norms. But my limited wikipedia-ing of these distance measurements and tests doesn't really help me in a hypothesis testing framework which allows us to decide, with confidence, if the difference between the model and the observations is significantly different.
  - "Ian, this sounds really complicated; do we really need to do this?" Absolutely. In-fact, we need to be doing this for all of our performance analysis. If we want to claim that the model can accurately predict parflow performance values, then we need to have evidence. Having a statistical test that others can evaluate (in terms of 'good-ness') such as a Student's t test is convincing evidence that supports our claim of accuracy.
  - "Why not an ANOVA test?" So an ANOVA test can be used to hypothesis test the accuracy of the model (see issue #25). What we are doing here is developing a test that decides if the observed performance values of a run configuration and the projected performance values of the same same run configuration are from two different distributions (i.e. if they are model is accurate). Further, because we only have on treatment in this setup (the absolute error between the model and observed data) an ANOVA doesn't make sense here.

ian-bertolacci commented 4 years ago

@mstrout thoughts?

CompOpt4Apps / IanHydroframeWork

Evaluate predictive power of tool on big sinusoidal #23

3 #6