PCA Contours With Tp Data

Graham-EGI commented 12 months ago

Describe the bug:

PCA environmental contours returning odd results when Tp is used as the period value (rather than Te)

To Reproduce:

This is just a slight modification to the code in the example notebook in examples.environmental_contours_example.ipynb to calculate Tp from the spectral data returned from NDBC swden parameter.

Not all buoys have this same behavior. Example is of one that creates it.

from mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd

parameter = 'swden'
buoy_number = '46054' 
ndbc_available_data= ndbc.available_data(parameter, buoy_number)

years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]

filenames= years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data={}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

Hm0_list=[]
Te_list=[]
#! ADDITION
Tp_list=[]

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    year_data = ndbc_data[year]
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    Tp_list.append(resource.peak_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list ,axis=0)
Hm0 = pd.concat(Hm0_list ,axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0,Te,Tp],axis=1)

# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)

# Return period (years) of interest
period = 50  

# Remove TP Outliers 
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]

# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values  
Te  = Hm0_Te_clean.Te.values 
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states 
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds  

# Get the contour values

#! TP IS THE ONE WITH ODD BEHAVIOR
copula_tp = contours.environmental_contours(Hm0, Tp, dt, period, 'PCA', return_PCA=True)
Hm0_contour_tp=copula_tp['PCA_x1']
Te_contour_tp=copula_tp['PCA_x2']

#! TE Looks good
copula_te = contours.environmental_contours(Hm0, Te, dt, period, 'PCA', return_PCA=True)
Hm0_contour_te=copula_te['PCA_x1']
Te_contour_te=copula_te['PCA_x2']

fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Tp, Hm0, 
                                      Te_contour_tp, Hm0_contour_tp, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Spectral Peak Period, $Tp$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)
plt.show()

fig,ax=plt.subplots(figsize=(8,6))

ax=graphics.plot_environmental_contour(Te, Hm0, 
                                      Te_contour_te, Hm0_contour_te, 
                                      data_label='NDBC 46054', 
                                      contour_label='50 Year Contour',
                                      x_label = 'Energy Period, $Te$ [s]',
                                      y_label = 'Sig. wave height, $Hm0$ [m]', 
                                      ax=ax)

plt.show()

Expected behavior:

PCA contours having similar shapes between Tp and Te as the period value.

Screenshots:

Bad contour with Tp as period value:

Good contour with Te as period value:

Desktop (please complete the following information):

OS: iOS & Ubuntu - example on Mac, found in system deployed on Ubuntu.
MHKiT Version: 0.6.0

Additional context:

There are quite a few of these NDBC swden data sets that produce this contour shape when using Tp calculated from MHKiT. Appears to be related to the banding, and the density of the data:

Created with (where Tp and Hm0 variables are from the example above) :


from scipy.stats import gaussian_kde
import numpy as np

xy = np.vstack([Tp,Hm0])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
x, y, z = Tp[idx], Hm0[idx], z[idx]

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()

ssolson commented 12 months ago

Thanks, @Graham-EGI for bringing this up. On my first look at the issue, my guess is similar to yours in that the binning of the Te is causing the behavior you are seeing.

@ryancoe can you comment on if this is a known behavior for the modified PCA approach we take here?

ryancoe commented 11 months ago

Referring back to Aubrey's paper on the PCA contours, I suspect it may have to do with the binning that's done to fit in second principal component/dimension (see Section 3.2), which looks like it is set by an optional argument. Try playing around with this.

https://github.com/MHKiT-Software/MHKiT-Python/blob/e5916004ff0413f32f9559790582c95193acdd24/mhkit/wave/contours.py#L100

Eckert-Gallup, Aubrey C., et al. "Application of principal component analysis (PCA) and improved joint probability distributions to the inverse first-order reliability method (I-FORM) for predicting extreme sea states." Ocean Engineering 112 (2016): 307-319.

Graham-EGI commented 11 months ago

Wanted to take a quick look if this was something that could likely be tuned on my end, when calling contours.environmental_contours(), here are the results:

Code

```python from mhkit.wave import resource, contours import matplotlib.pyplot as plt from mhkit.wave.io import ndbc import pandas as pd import numpy as np parameter = "swden" buoy_number = "46054" ndbc_available_data = ndbc.available_data(parameter, buoy_number) years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022] filenames = years_of_interest["filename"] ndbc_requested_data = ndbc.request_data(parameter, filenames) ndbc_data = {} # Create a Datetime Index and remove NOAA date columns for each year for year in ndbc_requested_data: year_data = ndbc_requested_data[year] ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data) Hm0_list = [] Te_list = [] #! ADDITION Tp_list = [] # Iterate over each year and save the result in the initalized dictionary for year in ndbc_data: year_data = ndbc_data[year] Hm0_list.append(resource.significant_wave_height(year_data.T)) Te_list.append(resource.energy_period(year_data.T)) Tp_list.append(resource.peak_period(year_data.T)) # Concatenate list of Series into a single DataFrame Te = pd.concat(Te_list, axis=0) Hm0 = pd.concat(Hm0_list, axis=0) Tp = pd.concat(Tp_list, axis=0) Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1) # Drop any NaNs created from the calculation of Hm0 or Te Hm0_Te.dropna(inplace=True) # Sort the DateTime index Hm0_Te.sort_index(inplace=True) # Return period (years) of interest period = 50 # Remove TP Outliers # 30 removes the Hm0 outliers as well Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30] # Get only the values from the DataFrame Hm0 = Hm0_Te_clean.Hm0.values Te = Hm0_Te_clean.Te.values Tp = Hm0_Te_clean.Tp.values # Delta time of sea-states dt = (Hm0_Te_clean.index[2] - Hm0_Te_clean.index[1]).seconds # Get the contour values # ! TP IS THE ONE WITH ODD BEHAVIOR bin_sizes = {i: None for i in range(0, 1100, 100)} failed = [] for k in bin_sizes.keys(): try: copula_tp = contours.environmental_contours( Hm0, Tp, dt, period, "PCA", return_PCA=True, PCA_bin_size=k ) except: failed.append(k) continue Hm0_contour_tp = copula_tp["PCA_x1"] Te_contour_tp = copula_tp["PCA_x2"] bin_sizes[k] = (Te_contour_tp, Hm0_contour_tp) print(failed) fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(18, 6)) ax[0].scatter(Tp, Hm0, color="blue", alpha=0.5) for k, v in bin_sizes.items(): if v is not None: ax[0].plot(v[0], v[1], label=f"Bin Size = {k}") ax[0].legend() ax[0].grid(True) ax[0].set_xlabel("Tp (s)") ax[0].set_ylabel("Hm0 (m)") h = ax[1].hist2d(Tp, Hm0, (100, 100), cmap=plt.cm.jet, cmin=1) ax[1].set_xlabel("Tp (s)") ax[1].set_ylabel("Hm0 (m)") plt.colorbar(h[3]) plt.show() ```

ssolson commented 11 months ago

Nice Graham. The y-shape is improved for larger bin sizes. My guess would be that the next issue may be that the PCA assumes equal bin size. Perhaps you could try setting independent bin sizes here: https://github.com/MHKiT-Software/MHKiT-Python/blob/e5916004ff0413f32f9559790582c95193acdd24/mhkit/wave/contours.py#L461C5-L461C27

Well, looking at it more we didn't see any changes in the x-axis for varying bin sizes from your analysis. But potentially below the linked line by adjusting the fit you could get better results. Very interested in any findings you have.

ryancoe commented 11 months ago

Yeah... I'm not honestly sure if the bin size is really the fundamental problem here... I think we need to look a bit more closely before go down a rabbit hole on this. I know they're quite busy with other things these days, but perhaps @aubreyeckert or even @nevmart have some idea as to what's going on here?

Graham-EGI commented 11 months ago

Got it, thanks guys!

mattEhall commented 11 months ago

Hey @ryancoe and all. Just chiming in that we're hitting this bug as well. In the last week we've been looking into using this tool in one of our projects. It would be awesome to get to the bottom of it.

ssolson commented 10 months ago

@jtgrasb in #261 detailed the same issue.

In my opinion this seems to me be more of a short coming of the provided methods than a bug but I do not claim to know the details of how these methods should work.

Is this a bug or a short coming of the provided methods?

ssolson commented 7 months ago

@mattEhall @jtgrasb @mbruggs (in place of Graham) @ryancoe

I have been poking at this issue today and yesterday. I have not identified exactly where the issue is occurring in the method but wanted to check the water on the level of interest for this issue.

One of the first things I checked was the histogram of the data. As shown in the first 2 plots below it is clear that Hm0 and Te are not perfectly Gaussian. Each show skewness (they do fail the shapiro_wilk_test included in the code below). However the contour method works with these datasets.

Looking at the the third figure for Tp Q-Q plot is does not seem particularly worse and shows less skewness but the histogram has far more than a single peak. As this is the data causing the issue my current thought is that the multiple peaks in this non gaussian distribution lead to the poor contours observed in the original issue.

The reason this is important is because the PCA method assumes that the data, when transformed into the PCA space, can be modeled with a Gaussian distribution. This is evident in the use of stats.invgauss.ppf for the inverse cumulative distribution function (CDF) to calculate the component 1 values. If Tp does not follow a Gaussian distribution, the results will not reflect the actual distribution of the data. There is then a linear relationship assumed between the two components that converts the second component into the PCA space. If the relationship between Tp and the first component is not linear, this step will not capture the true dependency, leading to incorrect contouring.

https://github.com/MHKiT-Software/MHKiT-Python/blob/2ccc286a65685e5e2f5ab68a467209917f0161d9/mhkit/wave/contours.py#L330

As stated above I have started following the changes through the method when using Te vs Tp but I have not identified exactly where things break down yet.

I am happy to keep poking at this if there is still interest. Let me know if you guys have thoughts on this initial look at the data or just that you think it is a good expenditure of time to identify exactly where this binned Tp data causes issues vs simple using Te.

Code

```python from mhkit.wave import resource, contours, graphics import matplotlib.pyplot as plt from mhkit.wave.io import ndbc import pandas as pd import numpy as np import numpy as np import scipy.stats as stats import seaborn as sns parameter = 'swden' buoy_number = '46054' ndbc_available_data = ndbc.available_data(parameter, buoy_number) years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022] filenames = years_of_interest['filename'] ndbc_requested_data = ndbc.request_data(parameter, filenames) ndbc_data = {} # Create a Datetime Index and remove NOAA date columns for each year for year in ndbc_requested_data: year_data = ndbc_requested_data[year] ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data) Hm0_list = [] Te_list = [] #! ADDITION Tp_list = [] # Iterate over each year and save the result in the initalized dictionary for year in ndbc_data: year_data = ndbc_data[year] Hm0_list.append(resource.significant_wave_height(year_data.T)) Te_list.append(resource.energy_period(year_data.T)) Tp_list.append(resource.peak_period(year_data.T)) # Concatenate list of Series into a single DataFrame Te = pd.concat(Te_list, axis=0) Hm0 = pd.concat(Hm0_list, axis=0) Tp = pd.concat(Tp_list, axis=0) Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1) # Drop any NaNs created from the calculation of Hm0 or Te Hm0_Te.dropna(inplace=True) # Sort the DateTime index Hm0_Te.sort_index(inplace=True) # Return period (years) of interest period = 50 # Remove TP Outliers # 30 removes the Hm0 outliers as well Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30] # Get only the values from the DataFrame Hm0 = Hm0_Te_clean.Hm0.values Te = Hm0_Te_clean.Te.values Tp = Hm0_Te_clean.Tp.values # Delta time of sea-states dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds # Define function to create histogram and Q-Q plot for given data def plot_histogram_qq(data, title): sns.set(style="whitegrid") fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6)) # Histogram with kernel density estimate sns.histplot(data, kde=True, ax=ax1) ax1.set_title(f'Histogram of {title}') # Q-Q plot stats.probplot(data, dist="norm", plot=ax2) ax2.set_title(f'Q-Q Plot of {title}') plt.tight_layout() # Plot for Hm0 (simulated as normally distributed) plot_histogram_qq(Hm0, 'Hm0') # Plot for Te (simulated as log-normally distributed) plot_histogram_qq(Te, 'Te') # Plot for Tp (simulated as log-normally distributed) plot_histogram_qq(Tp, 'Tp') # Perform Shapiro-Wilk test on the simulated data to check for normality def perform_shapiro_wilk_test(data, name): stat, p = stats.shapiro(data) alpha = 0.05 if p > alpha: return f"{name} looks Gaussian (fail to reject H0) with p-value = {p}" else: return f"{name} does not look Gaussian (reject H0) with p-value = {p}" # Perform the test on the simulated data sw_Hm0 = perform_shapiro_wilk_test(Hm0, 'Hm0') sw_Te = perform_shapiro_wilk_test(Te, 'Te') sw_Tp = perform_shapiro_wilk_test(Tp, 'Tp') print(sw_Hm0) print(sw_Te) print(sw_Tp) plt.show() ```

mattEhall commented 7 months ago

Hi @ssolson, thanks for your message and the informative observations. This isn't a topic I know a lot about or have the chance to get into in more detail, so I don't follow everything (but I see what you mean about the skewness and that it might not be fitting the assumptions of the method well). For what it's worth, I am still very interested in using this functionality. We haven't found another way to create these contours.

jtgrasb commented 7 months ago

Hi @ssolson, thanks for the update. I am also still interested in this functionality. It's not a pressing issue for the project I was using it for as we were able to use the KDE method with a different dataset and get okay results, but it would be nice to have a workflow for such data in the future.

MHKiT-Software / MHKiT-Python