In examples/weather/dataset_download/start_mirror.py the global_means and global_stds files (used later for normalization) are computed on the entire dataset and not only on the training set.
Current implementation
if cfg.compute_mean_std:
stats_path = os.path.join(cfg.hdf5_store_path, "stats")
print(f"Saving global mean and std at {stats_path}")
if not os.path.exists(stats_path):
os.makedirs(stats_path)
era5_mean = np.array(
era5_xarray.mean(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
)
era5_std = np.array(
era5_xarray.std(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
)
print(f"Finished saving global mean and std at {stats_path}")
Proposed modification
if cfg.compute_mean_std:
# Compute stats only on training data
train_era5_xarray = era5_xarray.sel(
time=era5_xarray.time.dt.year.isin(train_years)
)
stats_path = os.path.join(cfg.hdf5_store_path, "stats")
print(f"Saving global mean and std at {stats_path}")
if not os.path.exists(stats_path):
os.makedirs(stats_path)
era5_mean = np.array(
train_era5_xarray.mean(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
)
era5_std = np.array(
train_era5_xarray.std(dim=("time", "latitude", "longitude")).values
)
np.save(
os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
)
print(f"Finished saving global mean and std at {stats_path}")
Version
0.6.0
On which installation method(s) does this occur?
Docker
Describe the issue
In examples/weather/dataset_download/start_mirror.py the global_means and global_stds files (used later for normalization) are computed on the entire dataset and not only on the training set.
Current implementation
Proposed modification
Minimum reproducible example
No response
Relevant log output
No response
Environment details