NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
983 stars 234 forks source link

🐛[BUG]: Compute dataset statistics on training data #606

Open albertocarpentieri opened 3 months ago

albertocarpentieri commented 3 months ago

Version

0.6.0

On which installation method(s) does this occur?

Docker

Describe the issue

In examples/weather/dataset_download/start_mirror.py the global_means and global_stds files (used later for normalization) are computed on the entire dataset and not only on the training set.

Current implementation

    if cfg.compute_mean_std:
        stats_path = os.path.join(cfg.hdf5_store_path, "stats")
        print(f"Saving global mean and std at {stats_path}")
        if not os.path.exists(stats_path):
            os.makedirs(stats_path)
        era5_mean = np.array(
            era5_xarray.mean(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
        )
        era5_std = np.array(
            era5_xarray.std(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
        )
        print(f"Finished saving global mean and std at {stats_path}")

Proposed modification

    if cfg.compute_mean_std:
        # Compute stats only on training data
        train_era5_xarray = era5_xarray.sel(
            time=era5_xarray.time.dt.year.isin(train_years)
        )
        stats_path = os.path.join(cfg.hdf5_store_path, "stats")
        print(f"Saving global mean and std at {stats_path}")
        if not os.path.exists(stats_path):
            os.makedirs(stats_path)
        era5_mean = np.array(
            train_era5_xarray.mean(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_means.npy"), era5_mean.reshape(1, -1, 1, 1)
        )
        era5_std = np.array(
            train_era5_xarray.std(dim=("time", "latitude", "longitude")).values
        )
        np.save(
            os.path.join(stats_path, "global_stds.npy"), era5_std.reshape(1, -1, 1, 1)
        )
        print(f"Finished saving global mean and std at {stats_path}")

Minimum reproducible example

No response

Relevant log output

No response

Environment details

Modulus Docker container version 24.04
mnabian commented 4 days ago

@loliverhennigh is the proposed modification acceptable to you?