Automatic threshold for peaks-over-threshold

cmichelenstrofer commented 9 months ago

replaces #214.

MHKiT already has a peaks-over-threshold method, which takes in threshold as one of the inputs. This method here is to automatically calculate the "best" threshold. It is from a paper by Vince and Seongho:

Neary, V. S., S. Ahn, B. E. Seng, M. N. Allahdadi, T. Wang, Z. Yang and R. He (2020). "Characterization of Extreme Wave Conditions for Wave Energy Converter Design and Project Risk Assessment.” J. Mar. Sci. Eng. 2020, 8(4), 289; https://doi.org/10.3390/jmse8040289.

They had it in Matlab and I converted it to Python here.

cmichelenstrofer commented 9 months ago

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

ssolson commented 9 months ago

@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.

The failing errors don't seem to be related to this PR.

Carlos thank you.

The errors are unrelated to this PR and are fixed in #241 which should be reviewed and approved soon.

I took a bit of a look at his today. Running the tests I see that I get the same answer as you have specified but I was wondering how big is the difference from the Matlab version?

I started looking into the generalized pareto distribution in scipy vs matlab via the docs but I was ultimately hindered by not being able to run the Matlab version without getting Matlab installed on my machine. It seems like the idea would be to step-by-step compare the python vs Matlab output to see where the difference arises.

@akeeste are you setup to run Matlab and do you have time to take a look into this further?

akeeste commented 9 months ago

@ssolson I'm set-up for this. I will review this and debug next week.

cmichelenstrofer commented 9 months ago

@ssolson @akeeste Thanks! The difference I'm getting is this:

Matlab: 99.53%, 1.041 m Python: 98.94%, 1.028 m

@akeeste LMK if you want to go over it together.

akeeste commented 8 months ago

@cmichelenstrofer I made some progress today and noting where I'm at for reference. I noticed a couple things, but I'm still trying to figure out how much they affect the end result, given the input Hs used in your working example. I'll refer to Seongho's original code as the 'matlab' version and your refactored code as the 'python version'

xarray's stats.scoreatpercentile gives slightly different values than MATLAB's prctile function. The differences are very small but may become more significant as the threshold range is refined. Also, they recommend moving to numpy.percentile
- Percentile, xarray threshold, matlab threshold
- 0.99, 1.0294, 1.0295
- 0.991, 1.0311, 1.0313
- 0.992, 1.0334, 1.0334
- 0.995, 1.0406, 1.0506
the refinement logic is working a little differently in the python version. the condition rate_per_year < 2 is checked for every threshold and refinement iteration in Python, but only checked every refinement iteration in MATLAB. So all 6 initial thresholds are used in MATLAB. But in the python version only the first 3 thresholds pass the check for 2+ peaks/year.
In python for the first refinement iteration, the thresholds 0.99, 0.991, 0.992 all give an identical correlation. This seems suspicious and is not the behavior seen in the MATLAB code. The _pearks_over_threshold method gives different output for thresholds of e.g. 0.99 and 0.991, but the distribution fit returns identical distribution_parameters.

I'll continue debugging next week

akeeste commented 8 months ago

I'm still digging into the settings of various functions, but I can confirm that the functions for percentile, Generalized Pareto distribution fitting, and probability plotting functions all give slightly different results between their respective MATLAB and SciPy versions. I'll see if there are optional settings we can use to converge the results

Updates:

Percentile

MATLAB's prctile function (both 'exact/linear' and 'approximate/T-digest' methods) gives identical results to NumPy's np.percentile with the Hazen method

The Hazen method requires NumPy>1.22.0. For older versions (1.22.0 > Numpy > 1.9), the 'higher' linear method is very close, but not identical to MATLAB's percentile.

Generalized Pareto distribution fitting

MATLAB and SciPy are giving different distribution parameters when the excess is fit to a Generalized Pareto Distribution.

Both MATLAB's GeneralizedParetoDistribution (description) and SciPy's stats.genpareto have an identical definition. MATLAB's fit and SciPy's fit both estimate the distribution parameters using Maximum Likelihood Estimation and assume that the location/threshold is zero. A zero threshold is correct given that both versions of the code already use the excess (peaks-threshold) to estimate the tail distribution.

MATLAB fitting does not take any other options. SciPy takes initial guesses for shape, scale, etc but that outcome seems unpredictable. I'm not sure what else can be done to make the distribution fitting converge. There are not many points in the tail. In this example, even the 90th percentile only has 9 points. This will only decrease as the percentile is iteratively increased.

Example with excess defined using a 90th percentile threshold:

   excess = [0.016299, 0.025899, 0.006499, 0.024699, 0.005099, 0.017399, 0.004299, 0.013899, 0.000199]
   shape, scale, threshold:
   MATLAB: -1.21434, 0.0314501, 0.0
   SciPy: -1.62408, 0.0420621, 0.0
   SciPy (using MATLAB results as initial guess): -1.26645, 0.032800, 0.0

Probability plotting

My stats background begins to fail here. I'm not confident in how the correlations are each defined, but when given the same distribution parameters the results are nearly identical.

The MATLAB code's result is defined as the correlation between the quantile of the input (excess) and the quantile of the fitted distribution (qqplot)
The Python code's result is defined as the r value of the least-squares fit to the excess vs the quantile of the fitted distribution (probplot)

Example, given the resultant MATLAB distribution above:

MATLAB correlation: 0.978299509175096
SciPy correlation: 0.97872960601213 (0.044% error)

cmichelenstrofer commented 8 months ago

@akeeste this is great! Thanks for looking into this in so much detail! I think the results are basically identical. The very tiny difference is from the techniques to fit the distribution. Do you think it is ready merge?

akeeste commented 8 months ago

@cmichelenstrofer Sounds good. Can you first remove the MATLAB working example and updated the test result to 99.13% and 1.032m. After that we can merge

akeeste commented 8 months ago

Seems like I could commit the suggestion directly to the branch. I mistyped last time, the percentile still returns as a fraction. I updated the test to compare to 0.9913. The test should pass now

ssolson commented 8 months ago

Overall, I'm happy with this PR as is. I have a few suggestions, neither required but I think would round out the PR nicely.

Consider consolidating all the scipy imports onto a single line.
I recommend adding a docstring at the beginning of the file. While we haven't formally adopted this in MHKiT's style guidelines, it's something I've been introducing in files I've worked on.

Lastly, I noticed the use of a nested function. Historically, we've kept these functions separate. While nesting does highlight that the inner function is only needed by the parent function, it can also make the code appear more cluttered. I'm not opposed to the current structure, but I wanted to initiate a brief discussion on this topic. What are your thoughts?

akeeste commented 8 months ago

To wrap up this PR I

[x] broke out the nested function
[x] consolidated the scipy imports

I leave the submodule-level docstring and any other optional developments for future PRs. All tests have passed. Merging this PR

MHKiT-Software / MHKiT-Python