MHKiT-Software / MHKiT-Python

MHKiT-Python provides the marine renewable energy (MRE) community tools for data processing, visualization, quality control, resource assessment, and device performance.
https://mhkit-software.github.io/MHKiT/
BSD 3-Clause "New" or "Revised" License
47 stars 45 forks source link

Sea states clustering GM vs KM #254

Open dtgaebe opened 11 months ago

dtgaebe commented 11 months ago

I was working with this example: PacWave_resource_characterization_example but tried using sklearn.cluster.KMeans instead of sklearn.mixture.GaussianMixture to cluster the data. KMeans does require normalization prior to the clustering.

I found that KMeans requires a smaller number of clusters to converge to the total amount of energy in the data compared to GM. Here illustrated as the ratio between the representative sea states and the total energy (as in the example)

Nr clusters ratio GM ratio KM
4 0.9099 0.9331
6 0.9126 0.9679
8 0.9370 0.9859
12 0.9657 1.0006
16 0.9822 1.0078

I also organized the clusters after weight and visualized the weight with the color scale in the plots image The results of the different cluster algorithms are different, but I don't know if we should be considering other criteria apart from representing the total amount of energy.

The quick and dirty code to extend the PacWave_resource_characterization_example to reproduce the figure and table are:

# Compute Gaussian Mixture Model for each number of clusters
Ns= [4,6, 8, 12, 16]# 16, 32, 64]
X = np.vstack((data_clean.Te.values, data_clean.Hm0.values)).T

## normalize data for kmeans comparison n
Hm0_max = data_clean.Hm0.max()
Te_max = data_clean.Te.max()
Hm0_, Te_ = [], []
Hm0_.append(data.Hm0/Hm0_max)
Te_.append(data.Te/Te_max)
Hm0_ = pd.concat(Hm0_, axis=0)
Te_ = pd.concat(Te_, axis=0)
data_norm = pd.concat([Hm0_, Te_,], axis=1).dropna().sort_index()
from sklearn.cluster import KMeans
import string

results_km={}

fig, axs = plt.subplots(len(Ns),2, figsize=(16, 24), sharex=True)

results={}
for N in Ns:
    gmm = GaussianMixture(n_components=N).fit(X)

    # Save centers and weights
    result = pd.DataFrame(gmm.means_, columns=['Te','Hm0'])
    result['weights'] = gmm.weights_

    result['Tp'] = result.Te / 0.858

    labels = gmm.predict(X)
    #sort clusters after weight
    result.sort_values('weights', inplace=True, ascending=False)
    result['labels'] = list(string.ascii_uppercase[0:N])
    idx = result.index
    idx = [int(np.where(idx == i)[0]) for i in np.arange(N)]
    idx = [idx[i] for i in labels]
    result.reset_index(drop=True, inplace=True)
    results[N] = result

    i = Ns.index(N)
    gm_sc = axs[i,0].scatter(data_clean.Te.values, data_clean.Hm0.values, c=result['weights'][idx], s=0.5, cmap='viridis')
    axs[i,0].plot(result.Te, result.Hm0, 'm+')
    axs[i,0].title.set_text(f'{N} GM Clusters')
    cbar = fig.colorbar(gm_sc, ticks=result["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result['labels'][i]+': '+str(round(result["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result['Te'], result.Hm0, result['labels'] ):
        axs[i,0].text(x+0.1, y+0.1, lbl)
    plt.setp(axs[i,0], ylabel='Energy Period, $T_e$ [s]')

    ## kmeans comparison
    km = KMeans(n_clusters=N, random_state=1).fit(data_norm[["Hm0", 'Te']])
    # Un-normalize clustered data
    km.cluster_centers_[:, 0] = km.cluster_centers_[:, 0]*Hm0_max
    km.cluster_centers_[:, 1] = km.cluster_centers_[:, 1]*Te_max
    result_km = pd.DataFrame(km.cluster_centers_, columns=["Hm0", 'Te'])
    result_km['weights'] = [(km.labels_ == i).sum() / len(km.labels_) for i in range(N)]
    result_km['Tp'] = result_km.Te / 0.858

    #sort clusters after weight
    result_km.sort_values('weights', inplace=True, ascending=False)
    result_km['labels'] = list(string.ascii_uppercase[0:N])
    idx_km = result_km.index
    idx_km = [int(np.where(idx_km == i)[0]) for i in np.arange(N)]
    idx_km = [idx_km[i] for i in km.labels_]
    result_km.reset_index(drop=True, inplace=True)
    results_km[N] = result_km

    km_sc = axs[i,1].scatter(data_clean.Te, data_clean.Hm0, c=result_km['weights'][idx_km], s=0.5, cmap='viridis', rasterized=True, vmin = 0)
    axs[i,1].scatter(result_km['Te'], result_km['Hm0'], s=40, marker="x", color="w")
    axs[i,1].title.set_text(f'{N} Kmeans Clusters')
    cbar = fig.colorbar(km_sc, ticks=result_km["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result_km['labels'][i]+': '+str(round(result_km["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result_km['Te'], result_km.Hm0, result_km['labels'] ):
        axs[i,1].text(x+0.1, y+0.1, lbl)

plt.setp(axs[len(Ns)-1,0], xlabel='Sig. wave height, $Hm0$ [m]')    
plt.setp(axs[len(Ns)-1,1], xlabel='Sig. wave height, $Hm0$ [m]')    
w = ndbc_data[year].columns.values
f = w / 2*np.pi

for N in results:
    result = results[N]
    J=[]
    for i in range(len(result)):
        b = resource.jonswap_spectrum(f, result.Tp[i], result.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])

    result['J']  = J
    results[N] = result

for N in results_km:
    result_km = results_km[N]
    J=[]
    for i in range(len(result_km)):
        b = resource.jonswap_spectrum(f, result_km.Tp[i], result_km.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])

    result_km['J']  = J
    results_km[N] = result_km

ratios={}
for N in results:
    J_hr = results[N].J*len(data_clean)
    total_weighted_J= (J_hr * results[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios[N] = np.round(normalized_weighted_J, 4)

ratios_km={}
for N in results_km:
    J_hr = results_km[N].J*len(data_clean)
    total_weighted_J= (J_hr * results_km[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios_km[N] = np.round(normalized_weighted_J, 4)

(pd.Series(ratios), pd.Series(ratios_km))
dtgaebe commented 11 months ago

@ssolson, @ryancoe, @cmichelenstrofer, @jtgrasb just tagging you because at some point we probably all have talked about this.

ryancoe commented 11 months ago

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

ssolson commented 10 months ago

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

@ryancoe I stopped the study I wrote up at the sea states. I did forward you some preliminary results I found using WecOptTool although I don't think it will be of much help.

The useful part of the paper I wrote is captured in https://github.com/MHKiT-Software/MHKiT-Python/blob/master/examples/PacWave_resource_characterization_example.ipynb

ssolson commented 10 months ago

@dtgaebe that is a cool idea to use an idealized spectrum and then compare the clusters to the original seastate.

Based on what you show here it looks like it makes sense to at a minimum extend the PACWAVE example to include your comparison using the ratio and potentially look to summarize a couple steps into function for MHKiT.

I'm interested to hear your thoughts on next steps for you and perhaps where MHKiT could help with the data processing.

ryancoe commented 10 months ago

Note also that @cmichelenstrofer is working on something that intersects this where's he's considering expanding the parameterization of sea states based on machine learning.

dtgaebe commented 10 months ago

@ssolson I think that cool idea came from you, or whoever wrote the PacWave resource assessment example because the ratio is already in there. I just used the same methodology for the KMeans clustered data.

Personally, I find it very useful to sort and visualize the clusters after weight - independent of the algorithm. So this might be a nice addition to MhKit.

I would think that developers would use sea state clusters for gain scheduling. So, as Ryan suggested, it would be interesting how the different clusters impact performance and predicted performance and how sensitive the results are to number of clusters and algorithm. We did a study with sensitivity analysis to bulk spectral bulk parameters here: OCEAN2021. We could potentially do something similar here.