Closed cmichelenstrofer closed 8 months ago
@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.
The failing errors don't seem to be related to this PR.
@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this.
The failing errors don't seem to be related to this PR.
Carlos thank you.
The errors are unrelated to this PR and are fixed in #241 which should be reviewed and approved soon.
I took a bit of a look at his today. Running the tests I see that I get the same answer as you have specified but I was wondering how big is the difference from the Matlab version?
I started looking into the generalized pareto distribution in scipy vs matlab via the docs but I was ultimately hindered by not being able to run the Matlab version without getting Matlab installed on my machine. It seems like the idea would be to step-by-step compare the python vs Matlab output to see where the difference arises.
@akeeste are you setup to run Matlab and do you have time to take a look into this further?
@ssolson I'm set-up for this. I will review this and debug next week.
@ssolson @akeeste Thanks! The difference I'm getting is this:
Matlab: 99.53%, 1.041 m Python: 98.94%, 1.028 m
@akeeste LMK if you want to go over it together.
@cmichelenstrofer I made some progress today and noting where I'm at for reference. I noticed a couple things, but I'm still trying to figure out how much they affect the end result, given the input Hs used in your working example. I'll refer to Seongho's original code as the 'matlab' version and your refactored code as the 'python version'
stats.scoreatpercentile
gives slightly different values than MATLAB's prctile
function. The differences are very small but may become more significant as the threshold range is refined. Also, they recommend moving to numpy.percentile
rate_per_year < 2
is checked for every threshold and refinement iteration in Python, but only checked every refinement iteration in MATLAB. So all 6 initial thresholds are used in MATLAB. But in the python version only the first 3 thresholds pass the check for 2+ peaks/year._pearks_over_threshold
method gives different output for thresholds of e.g. 0.99 and 0.991, but the distribution fit returns identical distribution_parameters
. I'll continue debugging next week
I'm still digging into the settings of various functions, but I can confirm that the functions for percentile, Generalized Pareto distribution fitting, and probability plotting functions all give slightly different results between their respective MATLAB and SciPy versions. I'll see if there are optional settings we can use to converge the results
Updates:
MATLAB's prctile
function (both 'exact/linear' and 'approximate/T-digest' methods) gives identical results to NumPy's np.percentile
with the Hazen method
The Hazen method requires NumPy>1.22.0. For older versions (1.22.0 > Numpy > 1.9), the 'higher' linear method is very close, but not identical to MATLAB's percentile.
MATLAB and SciPy are giving different distribution parameters when the excess is fit to a Generalized Pareto Distribution.
Both MATLAB's GeneralizedParetoDistribution (description) and SciPy's stats.genpareto have an identical definition. MATLAB's fit and SciPy's fit both estimate the distribution parameters using Maximum Likelihood Estimation and assume that the location/threshold is zero. A zero threshold is correct given that both versions of the code already use the excess (peaks-threshold) to estimate the tail distribution.
MATLAB fitting does not take any other options. SciPy takes initial guesses for shape, scale, etc but that outcome seems unpredictable. I'm not sure what else can be done to make the distribution fitting converge. There are not many points in the tail. In this example, even the 90th percentile only has 9 points. This will only decrease as the percentile is iteratively increased.
Example with excess defined using a 90th percentile threshold:
excess = [0.016299, 0.025899, 0.006499, 0.024699, 0.005099, 0.017399, 0.004299, 0.013899, 0.000199]
shape, scale, threshold:
MATLAB: -1.21434, 0.0314501, 0.0
SciPy: -1.62408, 0.0420621, 0.0
SciPy (using MATLAB results as initial guess): -1.26645, 0.032800, 0.0
My stats background begins to fail here. I'm not confident in how the correlations are each defined, but when given the same distribution parameters the results are nearly identical.
Example, given the resultant MATLAB distribution above:
MATLAB correlation: 0.978299509175096
SciPy correlation: 0.97872960601213 (0.044% error)
@akeeste this is great! Thanks for looking into this in so much detail! I think the results are basically identical. The very tiny difference is from the techniques to fit the distribution. Do you think it is ready merge?
@cmichelenstrofer Sounds good. Can you first remove the MATLAB working example and updated the test result to 99.13% and 1.032m. After that we can merge
Seems like I could commit the suggestion directly to the branch. I mistyped last time, the percentile still returns as a fraction. I updated the test to compare to 0.9913. The test should pass now
Overall, I'm happy with this PR as is. I have a few suggestions, neither required but I think would round out the PR nicely.
scipy
imports onto a single line.Lastly, I noticed the use of a nested function. Historically, we've kept these functions separate. While nesting does highlight that the inner function is only needed by the parent function, it can also make the code appear more cluttered. I'm not opposed to the current structure, but I wanted to initiate a brief discussion on this topic. What are your thoughts?
To wrap up this PR I
I leave the submodule-level docstring and any other optional developments for future PRs. All tests have passed. Merging this PR
replaces #214.
MHKiT already has a peaks-over-threshold method, which takes in threshold as one of the inputs. This method here is to automatically calculate the "best" threshold. It is from a paper by Vince and Seongho:
Neary, V. S., S. Ahn, B. E. Seng, M. N. Allahdadi, T. Wang, Z. Yang and R. He (2020). "Characterization of Extreme Wave Conditions for Wave Energy Converter Design and Project Risk Assessment.” J. Mar. Sci. Eng. 2020, 8(4), 289; https://doi.org/10.3390/jmse8040289.
They had it in Matlab and I converted it to Python here.