Closed uwagura closed 2 months ago
Sorry for the bad news: I am getting sliiightly different results: an RMSE of 0.70 relative to the regional climatology instead of 0.71. The actual number is 0.7015044134823573, so it's not like it's close to rounding.
Seems like the results are sensitive to what grid the regional climatologies are combined onto. I can reproduce the RMSE = 0.70 by changing the script I used in the paper to use the lat/lon values in this repo's config.yaml
Seems like the results are sensitive to what grid the regional climatologies are combined onto. I can reproduce the RMSE = 0.70 by changing the script I used in the paper to use the lat/lon values in this repo's config.yaml
Yeah, I had noticed this previously but forgotten to mention it in this PR. The grid in the config.yaml is based on the grid in the original version of the sst_eval.py
script. Do we want to keep using this single grid for all scripts, or should we define different grids for different scripts?
Seems like the results are sensitive to what grid the regional climatologies are combined onto. I can reproduce the RMSE = 0.70 by changing the script I used in the paper to use the lat/lon values in this repo's config.yaml
Yeah, I had noticed this previously but forgotten to mention it in this PR. The grid in the config.yaml is based on the grid in the original version of the
sst_eval.py
script. Do we want to keep using this single grid for all scripts, or should we define different grids for different scripts?
@uwagura, I don’t have a strong preference, but it seems more straightforward to use the same single grid in the example for demonstration purposes. This approach should work unless there are specific accuracy concerns with observational data in certain areas that we’re aware of and want to avoid. @andrew-c-ross, do you have any additional thoughts?
Seems like the results are sensitive to what grid the regional climatologies are combined onto. I can reproduce the RMSE = 0.70 by changing the script I used in the paper to use the lat/lon values in this repo's config.yaml
Yeah, I had noticed this previously but forgotten to mention it in this PR. The grid in the config.yaml is based on the grid in the original version of the
sst_eval.py
script. Do we want to keep using this single grid for all scripts, or should we define different grids for different scripts?@uwagura, I don’t have a strong preference, but it seems more straightforward to use the same single grid in the example for demonstration purposes. This approach should work unless there are specific accuracy concerns with observational data in certain areas that we’re aware of and want to avoid. @andrew-c-ross, do you have any additional thoughts?
Ideally the grid being used for interpolation here should match the grid used by the regional climatologies. The idea was not really to interpolate them to a grid, just to use the interpolation as an easy way of combining different subsets of a larger domain onto the larger domain. For example, for the NWA (and elsewhere?), the regional climatologies are all on a 1/10 degree grid with centerpoints at .05 (ie lat=0.05, 0.15, 0.25, etc). I think it would be best if the grid being interpolated to had the same centerpoints.
To stick with the config files while also adapting to the regional climatology grid, I think it should be possible to do something like setting np.floor(config['lat']['south'] / .05) * .05
as the start of the lat grid and np.ceil(config['lat']['north'] / .05) * .05
as the end of the lat grid, and the same for lon. Assuming the other regions are also 1/10 degree with the same centerpoints.
To stick with the config files while also adapting to the regional climatology grid, I think it should be possible to do something like setting
np.floor(config['lat']['south'] / .05) * .05
as the start of the lat grid andnp.ceil(config['lat']['north'] / .05) * .05
as the end of the lat grid, and the same for lon. Assuming the other regions are also 1/10 degree with the same centerpoints.
Thanks @andrew-c-ross, I agree that the interpolated grid should have the same centerpoints, and I like the idea you just proposed. @uwagura, do you think you can make one more change to address @andrew-c-ross's point?
To stick with the config files while also adapting to the regional climatology grid, I think it should be possible to do something like setting
np.floor(config['lat']['south'] / .05) * .05
as the start of the lat grid andnp.ceil(config['lat']['north'] / .05) * .05
as the end of the lat grid, and the same for lon. Assuming the other regions are also 1/10 degree with the same centerpoints.
Won't this result in the same output as what I currently have? I.enp.floor(0/.05) * .05 = 0.0
and np.ceil(60/.05) * .05 = 60.0
, so np.arange( 0.0, 60.0, 0.1) = 0.0, 0.1, 0.2,
and so on.
Oh, good point. I guess it should be something closer to round(config['lat']['south'], 1) - .05)
and round(config['lat']['north'], 1) + .05)
, ie rounding to the nearest 0.1 and then extending by 0.05
Perhaps something like:
# Grid creation with centerpoints at 0.05 and 0.1 interval
regional_grid = {
'lat': np.arange(np.floor(config['lat']['south'] * 10) / 10 - 0.05,
np.ceil(config['lat']['north'] * 10) / 10 + 0.1, 0.1),
'lon': np.arange(np.floor(config['lon']['west'] * 10) / 10 - 0.05,
np.ceil(config['lon']['east'] * 10) / 10 + 0.1, 0.1)
}
BTW, I double-checked the NEP regional climatology data, and they are also at 1/10 degree with the x.05 center points. So, it should be a reasonable assumption, at least for 1/10 degree regional climatology data
@yichengt900 , I've made the change you suggested. The RMSE between the model and the regional climatologies changed ~slightly~ for the nep with the center points grid, but I don't think this is a big issue.
@uwagura , Thank you for the updates and for testing the script on NEP as well. I have two quick questions. Below is the SSS plot from Liz's manuscript:
The metrics in her plot differ from ours. I'm wondering if this discrepancy is primarily due to the domain we selected (as it seems we've excluded most of the Bering Sea?), or if it's because our combined function applies a weighting method for the overlapping areas, while Liz's approach appears to use the most recent climatology (i.e., above 50°N) directly. This isn’t a major issue or blocker, but I’m curious about which factor contributes more to the differences.
@yichengt900 , It seems to be more sensitve to the domain? I tried to mimic Liz's domain as best as I could ( i.e I extended my coodinates to include the bering sea and excluded everything north of Alaska) and the bias, MEDAE, and correlation moved slightly closer to the values that she has:
When I remove the weights but keep the same extent, there doesn't seem to be any change
Likewise, when I include the Bering Sea and remove the weights, the weights have no effect:
Do you know if Liz is using the same NCEI climatologies I used?
@uwagura, thank you for the thorough testing. I believe Liz is using the same data; the major difference seems to be in the overlap area (i.e., above 50°N), where she directly applied the NNP climatology (Version 2). It's good to confirm that the domain is a significant factor, aligning with Andrew's main point. I think we can revisit the weight part in the future and I believe this PR is ready to go! Thanks again for your excellent work!
This pull request adds config options to the config.yaml file to run
sss_eval.py
across domains. It also introduces helper functions to open and regrid regional climatology files from the National Centers for Environmental Information (NCEI), and modifies the previously introducedprocess_glorys
andprocess_oisst
scripts to select variables other thanthetao
from the glorys and oisst datasets.The new generalized version can reproduce it's previous output from the NWA domain ( i.e the "new" output mentioned in issue #87 ), and also works in the NEP domain. Since the NCEI does not provide regional climatologies for the Hawaiian Islands and does not provide decadal data for the Arctic, the new script has not been tested in those domains.
A note about the NWA output: the previous version of this script opened the regional climatology files for the Northwest Atlantic using the the
compat='override'
argument toxarray.open_mfdataset
. https://github.com/NOAA-GFDL/CEFI-regional-MOM6/blob/ad3b53b6e889dc1649216e5533f14dfa9af936ff/diagnostics/physics/sss_eval.py#L40 This argument caused xarray to set the data from2005-2017
toNaN
when merging the datasets, meaning that we were effectively just plotting1995-2004
previously. This newer version instead usesxarray.open_mfdataset( [files], combine = 'nested', concat_dim = 'time')
which should produce the same output as callingxarray.open_mfdataset
without any options without throwing away any data. These arguments have the added benefit of allowing thecombine_regional_climatology
function to work with files with identical dimensions ( such as the regional climatologies for the NEP domain) , something which raises an error when using the xarray default argumentcombine = 'combine_coords'
)One last note about the NEP plot: the regional climatologies used in this plot cover slightly different periods since they are actually two domains stitched together ( a Northern North Pacific region covering 1995-2014, and a Northeast Pacific region covering 1995-2012), meaning this plot is sort of comparing apples to oranges. The
combine_regional_climatology
function let's you do this since it calculates the period covered by each region from the file name so that it can apply the different weights to each region file. As with the last pull request, I wouldn't take the actual plot / skill metrics too seriously, but I would like to hear people's thoughts on whether or notcombine_regional_climatology
should allow you to input files covering different time periods.