config means/std incorrect for training data

andregraubner / ClimateNet

Climate Analytics using Deep Neural Networks in Python.

https://www.nersc.gov/research-and-development/data-analytics/big-data-center/climatenet/

MIT License

59 stars 25 forks source link

config means/std incorrect for training data #9

Closed jtruesdal closed 2 years ago

jtruesdal commented 2 years ago

@andregraubner The means and std values in the configuration file look to be incorrect. I am assuming these are actual means and std calculated from the training data and used for normalization. For instance the mean for PSL is 1619.3 which is odd as its calculated from a field that is in hPA. (94000-105000). As long as the data used for inference is also normalized using these wonky values the model is able to pick out features correctly but as soon as means calculated from the inference data are used the trained model is unable to detect AR's TC's

andregraubner commented 2 years ago

Thanks for bringing this to our attention! You seem to be absolutely correct, we'll look into possible reasons for this and get back to you. Is the data set you want to run the model on very different statistically from the CAM5.1 data our model was trained on? You can try using the model with the wonky normalization for now and see if there are any performance issues.

jtruesdal commented 2 years ago

Hi Andre. I work for NCAR and our group is actually collaborating with the climatenet team although that started a few years ago while Prabhat was still there. We produced the CAM5.1 data that is being used for training. The cam6 data we are using should have the same statistics as the data in your repository. Although these bad normalizing values do produce valid looking masks the normalizing data looks to be far enough off (at least for PSL) to affect the accuracy. I just wanted to make sure I wasn't missing something. We will recalculate the means/std, fix the config file and retrain, I trust you'll do the same. I think the new interface looks great by the way, nice job!

TeaganKing commented 2 years ago

Hi @andregraubner, I've been working with @jtruesdal and happened to recently calculate some of the mean and standard deviation values for the training dataset, so I thought I'd share that here in case it is helpful for cross-checking. TMQ mean: 19.21849 TMQ std: 15.73182 U850 mean: 1.55302 U850 std: 8.27790 V850 mean: 0.25413 V850 std: 6.21594 PSL mean: 100814.07031 PSL std: 1454.36969

andregraubner commented 2 years ago

Thank you very much. We will soon provide an additional pre-trained model using these values. I'll post an update here then and close the issue accordingly. Please reach out if anything else pops up!

andregraubner commented 2 years ago

Thank you again for bringing this to our attention. We have updated the pre-trained model accordingly and verified that there results reported in the paper still hold. Please don't hesitate to reach out if anything else comes up.

katiedagon commented 1 year ago

@andregraubner I'm revisiting this issue as I have two questions about best practices for calculating these means and standard deviations for the cgnet config file. I'd be interested to hear your & others thoughts.

Should the mean calculation utilize a weighted mean across space (lat/lon)? The numbers @TeaganKing shared above are for an unweighted mean across space/time for the training data.
Should the standard deviation including any weighting? Currently we are calculating standard deviation across all space/time values for a single variable.

For reference, here are the mean values if you include weighting by cos(lat) over space. They do differ from above but not very significantly: TMQ mean: 24.92724 U850 mean: 1.03567 V850 mean: 0.20848 PSL mean: 101095.0352

I have a notebook here if you want to take a closer look at the calculations: https://github.com/katiedagon/ML-extremes/blob/main/notebooks/get_averages_and_standard_devs.ipynb