the way to calculate `time_means` in script get_stats.py is wrong

NVlabs / FourCastNet

Initial public release of code, data, and model weights for FourCastNet

Other

518 stars 129 forks source link

the way to calculate `time_means` in script get_stats.py is wrong #3

Open veya2ztn opened 2 years ago

veya2ztn commented 2 years ago

Please see: https://github.com/NVlabs/FourCastNet/blob/master/data_process/get_stats.py

**time_means = np.zeros((1,21,721, 1440))**

for ii, year in enumerate(years):

    with h5py.File('/pscratch/sd/s/shas1693/data/era5/train/'+ str(year) + '.h5', 'r') as f:

        rnd_idx = np.random.randint(0, 1460-500)
        global_means += np.mean(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))
        global_stds += np.var(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))

global_means = global_means/len(years)
global_stds = np.sqrt(global_stds/len(years))
**time_means = time_means/len(years)**

the time_means is constant zero follow this script. What is the correct defination for this value?

BTW, may I know how you calculate the time_means_daily.h5 file? From its size (127G) I can only guess it is a $(1460,21,720,1440)$ tensor.

YueZhou-oh commented 1 year ago

hey, are training and test .h5 files , eg. train/2015.h5 with simliar data shape (4D data)

phrasenmaeher commented 1 year ago

I am also wondering about that, did you find any solution so far? In their paper they write

we use a time-averaged climatology in this work, motivated by [Rasp et al., 2020])

which is https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2020MS002405 defined just above A1, so that seems to be the correct way 🤷🏼

phrasenmaeher commented 1 year ago

Digging further into this, I found in the appendix this description:

long-term-mean-subtracted value of predicted (/true) variable v at the location denoted by the grid co-ordinates (m, n) at the forecast time-step l. The long-term mean of a variable is simply the mean value of that variable over a large number of historical samples in the training dataset. The long-term mean-subtracted variables X ̃ pred/true represent the anomalies of those variables that are not captured by the long term mean values

which reads that we subtract from our variables their mean -- which we do during data loading, and the mean is correctly computed over a long term (in get_stats.py)

-- Edit: However, there's the thing that the variables are also scaled by their std_dev. so it's not only the mean that is removed