MHKiT-Software / MHKiT-Python

MHKiT-Python provides the marine renewable energy (MRE) community tools for data processing, visualization, quality control, resource assessment, and device performance.
https://mhkit-software.github.io/MHKiT/
BSD 3-Clause "New" or "Revised" License
47 stars 45 forks source link

v0.8.0 bug fixes and doc string fixes #327

Closed akeeste closed 2 weeks ago

akeeste commented 1 month ago

This PR addresses several bug and doc string fixes in v0.8.0. I think it's critical that we make these fixes before our workshop next Thursday.

I tested all of our Python examples on a clean install of MHKiT with Python 3.11. There are several examples with errors

Incorporates several minor documentation fixes:

akeeste commented 1 month ago

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

akeeste commented 1 month ago

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

Fortunately I found a work around that allows us to get our documentation updated quickly before OREC. So we don't need these changes pulled in immediately. But long term we should still look at how our development workflows on each repo tie into doc updates.

akeeste commented 1 month ago

@ssolson I identified issues within several python examples and work on getting these fixed ASAP. I reworked this PR to focus on bugs, not just minor documentation items. I'd like to make this a bugfix release as soon as possible.

The notebook vs script runs may have inadvertently been an environment issue (3.9 vs 3.11). Checking on that now.

akeeste commented 1 month ago

Update on WPTO_hindcastexample.ipynb: In #220 there was a slight change to the variable naming: ``variable{i}-->{variable_{gid}. Github was not comparing files well in that PR and we probably missed that change. Unless there's a good reason to keepgidappended onto the name, we should revert back to appendingi``. The old tests reflect that. Unknown why they passed in that PR but broke because of that renaming later on.

akeeste commented 1 month ago

Update on qc_example.ipynb: In #276 index_to_datetime was accidentally removed from mhkit.utils.time_utils. It has been re-added, fixing the qc example

ssolson commented 1 month ago

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

The pypi.yml release currently works exactly the way you request it to work here.

ssolson commented 1 month ago

I have cleared cache and re run multiple times.

ssolson commented 1 month ago

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

ssolson commented 1 month ago

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

PacWave_resource_characterization_example.ipynb - Folium map error

This is not an MHKiT error

You just need to pip install folium

folium is only used to add context to the example and is not a requirement of MHKiT

Apologies and Never mind.

I see the issue is the basemap was showing up gray. I have now fixed this in https://github.com/akeeste/MHKiT-Python/pull/4

ssolson commented 1 month ago

I take responsibility for not catching this but I did not realize that the wave module xarray conversion made the functions so slow they are now unusable.

E.g. in the Pacwave resource characterization this block now takes greater than 30 minutes... unless you convert Tp back to use pandas which will only take 9 minutes.

@akeeste I know the wave module is still in a bit of transition but do you expect the time gains to be significantly reduced when completed? I am questioning our approach to always convert pandas to xarray.

# Initialize empty lists to store the results from each year
Hm0_list = []
Te_list = []
J_list = []
Tp_list = []
Tz_list = []

# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
    data_raw = ndbc_data[year]
    year_data = data_raw[data_raw != 999.0].dropna()
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    Te_list.append(resource.energy_period(year_data.T))
    J_list.append(resource.energy_flux(year_data.T, h=399.0))

    # Tp_list.append(resource.peak_period(year_data.T))
    fp = year_data.T.idxmax(axis=0)
    Tp = 1/fp
    Tp = pd.DataFrame(Tp, index=year_data.T.columns, columns=["Tp"])
    Tp_list.append(Tp)

    Tz_list.append(resource.average_zero_crossing_period(year_data.T))

# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
J = pd.concat(J_list, axis=0)
Tz = pd.concat(Tz_list, axis=0)
data = pd.concat([Hm0, Te, Tp, J, Tz], axis=1)

# Calculate wave steepness
data["Sm"] = data.Hm0 / (9.81 / (2 * np.pi) * data.Tz**2)

# Drop any NaNs created from the calculation of Hm0 or Te
data.dropna(inplace=True)
# Sort the DateTime index
data.sort_index(inplace=True)
akeeste commented 1 month ago

@ssolson is it possible we can change our automated pypi release to be triggered by Github releases instead of commits to master? The current automation prevents us from making minor documentation updates after a release. We typically don't get to this in time and the documentation is built on master which we only see after a release.

The pypi.yml release currently works exactly the way you request it to work here.

  • on push we create a release to test.pypi.org
  • on release we create a release on pypi.org

@ssolson Thanks for clarifying. I missed that when skimming the pypi.yml workflow

akeeste commented 1 month ago

@ssolson thanks for addressing the cdip and D3D examples in akeeste#4.

I was also able to get the directional_waves example to run when the cached was cleared. Maybe there's an issue using pickle to read data into Python 3.11 when it was cached in 3.9? That's the situation I had locally. Not something we need to address, but something to be aware of.

On the note of computational speed-- In my experience the delay is not coming from performing calculations on xarray data vs on pandas data, but rather from the actual conversion between pandas<-->xarray. Some methods are faster than others. I noticed that using xr.Dataset(data) is significantly faster than ds = data.to_xarray(). From digging into it briefly, the to_xarray method calls a large number of checks on the data being converted from pandas, which becomes expensive when there are lots of variables and timesteps--as is the case when we pull wave spectra over long periods of time.

I implemented a few changes to our type handling to address this, but perhaps there's still an instance that uses to_xarray() on pandas data. Have you noticed a specific part of the PacWave example that is slowing down?

ssolson commented 1 month ago

@ssolson thanks for addressing the cdip and D3D examples in akeeste#4.

I was also able to get the directional_waves example to run when the cached was cleared. Maybe there's an issue using pickle to read data into Python 3.11 when it was cached in 3.9? That's the situation I had locally. Not something we need to address, but something to be aware of.

On the note of computational speed-- In my experience the delay is not coming from performing calculations on xarray data vs on pandas data, but rather from the actual conversion between pandas<-->xarray. Some methods are faster than others. I noticed that using xr.Dataset(data) is significantly faster than ds = data.to_xarray(). From digging into it briefly, the to_xarray method calls a large number of checks on the data being converted from pandas, which becomes expensive when there are lots of variables and timesteps--as is the case when we pull wave spectra over long periods of time.

I implemented a few changes to our type handling to address this, but perhaps there's still an instance that uses to_xarray() on pandas data. Have you noticed a specific part of the PacWave example that is slowing down?

Apologies my previous message was not clear.

Yes there are issues with the computation (not just the conversion) on the converted xarrays. Specifically with the calculation of Tp.

This morning I created the following script and found Tp to be the primary bottle neck. I commented out the built in MHKiT Tp and used MHKiT v0.7.0 pandas version inline and found significant speed increases ~60% faster.

Time results using built in Tp

Time results using previous pandas Tp

Look at the results for the Tp calculation vs the others:

import mhkit
from mhkit.wave import resource, performance, graphics, contours
from mhkit.wave.io import ndbc
import matplotlib.pyplot as plt
import numpy as np
import time
import xarray as xr
import pandas as pd

# Get buoy metadata
buoy_number = "46050"
buoy_metadata = ndbc.get_buoy_metadata(buoy_number)

# Spectral wave density for buoy 46050
parameter = "swden"

# Request list of available files
ndbc_available_data = ndbc.available_data(parameter, buoy_number)

# Pass file names to NDBC and request the data
filenames = ndbc_available_data["filename"]
ndbc_requested_data = ndbc.request_data(parameter, filenames)

ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
    year_data = ndbc_requested_data[year]
    ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)

# Initialize empty lists to store the results from each year
Hm0_list = []
Te_list = []
J_list = []
Tp_list = []
Tz_list = []

# Dictionaries to store the time taken by each function
timing_data = {
    "Hm0": [],
    "Te": [],
    "J": [],
    "Tp": [],
    "Tz": []
}

# List to store the total time for each year
yearly_times = []

for year in list(ndbc_data.keys())[:5]:
    year_data_xr = xr.Dataset(year_data.T)
    year_start_time = time.time()  # Start time for the year

    data_raw = ndbc_data[year]
    year_data = data_raw[data_raw != 999.0].dropna()

    start_time = time.time()
    Hm0_list.append(resource.significant_wave_height(year_data.T))
    timing_data["Hm0"].append(time.time() - start_time)

    start_time = time.time()
    Te_list.append(resource.energy_period(year_data.T))
    timing_data["Te"].append(time.time() - start_time)

    start_time = time.time()
    J_list.append(resource.energy_flux(year_data.T, h=399.0))
    timing_data["J"].append(time.time() - start_time)

    start_time = time.time()
    # Tp_list.append(resource.peak_period(year_data.T))
    fp = year_data.T.idxmax(axis=0)
    Tp = 1/fp
    Tp = pd.DataFrame(Tp, index=year_data.T.columns, columns=["Tp"])
    timing_data["Tp"].append(time.time() - start_time)

    start_time = time.time()
    Tz_list.append(resource.average_zero_crossing_period(year_data.T))
    timing_data["Tz"].append(time.time() - start_time)

    yearly_times.append(time.time() - year_start_time) 

# Print the timing data for each function
total_time = 0
for func_name, times in timing_data.items():
    avg_time = np.mean(times)
    sum_time = np.sum(times)
    print(f"Average time for {func_name}: {avg_time:.4f} seconds")
    # print(f"Total time for {func_name}: {sum_time:.4f} seconds")
    total_time += sum_time

# Calculate and print the average time to calculate metrics in a year
avg_yearly_time = np.mean(yearly_times)
print(f"Average time to calculate metrics in a year: {avg_yearly_time:.4f} seconds")

# Print the total time for all metrics
print(f"Total time for all metrics: {total_time:.4f} seconds")
akeeste commented 1 month ago

@ssolson Okay understood. Seems odd as the methodology is essentially the same now, but applied on an xarray dataset (and with type handling). I wonder if the delay is mostly in the type conversion or xarray's idxmax function.

How does the expense compare if the spectra is already an xarray Dataset (type handling eliminated) and we calculate Tp with:

   fp = S.idxmax(dim=frequency_dimension)  # Hz
   Tp = 1 / fp
ssolson commented 1 month ago

@ssolson Okay understood. Seems odd as the methodology is essentially the same now, but applied on an xarray dataset (and with type handling). I wonder if the delay is mostly in the type conversion or xarray's idxmax function.

How does the expense compare if the spectra is already an xarray Dataset (type handling eliminated) and we calculate Tp with:

   fp = S.idxmax(dim=frequency_dimension)  # Hz
   Tp = 1 / fp

That is what I found. Finding the argmax is a very expensive operation on the xarray.

I spent some time trying to fix this but I realized I was getting outside the scope of this PR pretty quickly.

So I wanted to circle back with you and see if time gains from a full implementation of xarray are to be expected or if we should reconsider our approach?

akeeste commented 1 month ago

I don't expect a significant improvement in computational expense in the future just by using xarray. Let's discuss approaches at our next monthly meeting.

ssolson commented 1 month ago

@akeeste can you merge the fixes in https://github.com/akeeste/MHKiT-Python/pull/5?

jmcvey3 commented 4 weeks ago

If I can jump in, a part of this problem might be because yall are using Datasets for single variables rather than DataArrays. This line fp = S.idxmax(dim=frequency_dimension) will run a lot faster as a DataArray using fp = S["Frequency"].where(S==S.max("Frequency"), drop=True).

If you want to keep the DataSets that line gets more complicated, but is still a touch faster: fp = S["0"]["Frequency"].where(S["0"]==S["0"].max("Frequency"), drop=True)

ssolson commented 3 weeks ago

If I can jump in, a part of this problem might be because yall are using Datasets for single variables rather than DataArrays. This line fp = S.idxmax(dim=frequency_dimension) will run a lot faster as a DataArray using fp = S["Frequency"].where(S==S.max("Frequency"), drop=True).

If you want to keep the DataSets that line gets more complicated, but is still a touch faster: fp = S["0"]["Frequency"].where(S["0"]==S["0"].max("Frequency"), drop=True)

@jmcvey3 thank you for jumping in. I have not tried this yet but looks promising.

From our meeting yesterday @akeeste is going to address the outstanding issues to pass tests and get these bug fixes through.

Because the proposed solution changes the general approach we are going to address this in a follow on PR focused on the larger xarray strategy.

akeeste commented 2 weeks ago

There's still a couple tests failing. I'm not sure why the mooring test is failing since I only updated the docstring (for better formatting in the built docs). I'll continue looking into this and narrow in on an issue in the metocean example.

akeeste commented 2 weeks ago

Part of matplotlib, unrelated to this PR, that is called in the mooring animation is depreciated in 3.9+, causing the mooring_test to fail here. I'm guessing that if we re-run tests on dev, it would fail there too. @hivanov-nrel any insight on how we can fix this?

hivanov-nrel commented 2 weeks ago

Hi Adam, it looks like the issue is on line 89 and line 129. The inputs needs to be a sequence so the fix for this is putting brackets around them like so:

line.set_data([x_data], [y_data])

Do you mind implementing this on your PR or should I do this fix separately?

akeeste commented 2 weeks ago

Hi Adam, it looks like the issue is on line 89 and line 129. The inputs needs to be a sequence so the fix for this is putting brackets around them like so:

line.set_data([x_data], [y_data])

Do you mind implementing this on your PR or should I do this fix separately?

Thanks for taking a quick look @hivanov-nrel. That fixed mooring_test, I'll add it to this PR.