🐛[BUG]: Coord ValueError when running run.ensemble() with SFNO

hauke-dttl commented 2 weeks ago

Version

0.1.0

On which installation method(s) does this occur?

Pip

Describe the issue

SFNO_value_error

Steps leading to ValueError

Importing makani and earth2studio both with version 0.1.0.
Importing SFNO model_sfno = SFNO.load_model(SFNO.load_default_package()) which loads the following checkpoint per default "ngc://models/nvidia/modulus/sfno_73ch_small@0.1.0".

Running run.ensemble()

io = ensemble(
[start_date],
60,
2,
model_sfno,
CDS(),
ZarrBackend(file_name=output_file),
Zero(),
batch_size=4,
output_coords= {
    "lat": np.arange(0.0, 50.0, 0.25),
    "lon": np.arange(250.0, 345.0, 0.25),
    "variable": np.array(["msl", "u10m", "v10m","t2m"])
},
)

Notes

Running with FCN imported as model_fcn = FCN.load_model(FCN.load_default_package()) executes successfully.
Checking the input and output coords after import returns: Obviously, "lead_time" is at index position 1 after import for both models.
During handshake_dim of SFNO it is at index 2 while expected at position 1 (comp. ValueError above).

I suspect, that there is either

an issue with coords handling or
working with non-matching dependencies/ checkpoints (makani, earth2studio).

NickGeneva commented 2 weeks ago

Hi @hauke-dttl

Thanks for the report. Indeed this appears the be a bug in SFNO in 0.1.0. It does not seem to support batching which is required for running the ensemble workflow unfortunately.

Give a try using the 0.2.0 rc install or main branch (see installing from source for details). I was able to run the following script fine on main branch. We are preparing 0.2.0 version with the fix currently.

import numpy as np
from datetime import datetime

from earth2studio.models.px import SFNO
from earth2studio.data import ARCO
from earth2studio.io import ZarrBackend
from earth2studio.perturbation import Zero
from earth2studio.run import ensemble

model_sfno = SFNO.load_model(SFNO.load_default_package())
output_file="output.zarr"
start_date = datetime(2022, 1, 1, 12)

io = ensemble(
    [start_date],
    4,
    2,
    model_sfno,
    ARCO(),
    ZarrBackend(file_name=output_file),
    Zero(),
    batch_size=4,
    output_coords= {
        "lat": np.arange(0.0, 50.0, 0.25),
        "lon": np.arange(250.0, 345.0, 0.25),
        "variable": np.array(["msl", "u10m", "v10m","t2m"])
    },
)

print(io.root.tree())

Produces:

....
Fetching ARCO for 2022-01-01 12:00:00: 100%|████████████████████████████████████████████████████████| 73/73 [02:52<00:00,  2.36s/it]
2024-07-08 19:07:45.666 | SUCCESS  | earth2studio.run:ensemble:315 - Fetched data from ARCO
2024-07-08 19:07:45.872 | INFO     | earth2studio.run:ensemble:337 - Starting 2 Member Ensemble Inference with             1 number of batches.
Total Ensemble Batches: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.76s/it]
2024-07-08 19:07:49.634 | SUCCESS  | earth2studio.run:ensemble:382 - Inference complete                                             
/
 ├── ensemble (2,) int64
 ├── lat (200,) float64
 ├── lead_time (5,) timedelta64[h]
 ├── lon (380,) float64
 ├── msl (2, 1, 5, 200, 380) float32
 ├── t2m (2, 1, 5, 200, 380) float32
 ├── time (1,) datetime64[ns]
 ├── u10m (2, 1, 5, 200, 380) float32
 └── v10m (2, 1, 5, 200, 380) float32

Note that the API for coords did get changed this upcoming version to functions, namely to get the models coords when using 0.2.0 or source use:

model_sfno.input_coords()
model_sfno.output_coords(model_sfno.input_coords())

pavel-dltt commented 2 weeks ago

Hi @NickGeneva, can be related to this issue, so asking here

Will Perturbation class be defined in the latest version as Perturbation or PerturbationMethod? Asking because now it' also one of the perplexities for running from pip and not latest sources.

NickGeneva commented 2 weeks ago

Yes, that was another API change. Apologies for the breaking changes, we are ironing out some bigger design issues post our first release as we integrate the package into other systems. Should be much less frequent after 0.2.0 / allowing a transition period for people.

For a list of changes presently, you can check the change log. We'll have a clear API break summary upon the 0.2.0 release.

Make sure the docs versions you are referencing matches your install version. Examples are updated on the main/0.2.0 version of the docs with these updates. Many of the examples are only implemented for 0.2.0-rc and forward.

https://github.com/NVIDIA/earth2studio/blob/main/CHANGELOG.md#changed-1

hauke-dttl commented 1 week ago

Hi @NickGeneva,

first, thank you for your super quick and engaging support! Building from source seems to work, since the coords issue is not raised anymore. We are working on AWS Workstation with A10G (24GB) and are posed with a cuda memory error. The way I read the run.ensemble() code, you are retrieving the full data from the DataSource which is then copied to the device. Since there is no multi-GPU support, I would need to have an EC2 instance with ~48GB memory which the research team referenced in the paper, right?

NickGeneva commented 1 week ago

Hi @hauke-dttl

SFNO has a pretty large memory footprint and requires some higher GPU memory to run compared to some of the lighter models in the package. This is partially due to needing 73 variables but also because the model is large. I believe we have had other users have a similar memory issue on a 24Gb A10 for SFNO. Even for a batch size of 1, but confirm you are using a batch size of 1. Can potentially look at using other models such as DLWP or AFNO.

For reference, the examples are ran on a 32Gb A100, so I would suggest a card with at least 32Gb of GPU memory.

NickGeneva commented 1 week ago

Closing this issue since it seems original coordinate issue was resolved. Feel free to reopen or open a new issue.

pavel-dltt commented 1 week ago

Hi @NickGeneva, can't reopen the issue, but would do it.

Running the SFNO with zarr backbone, we see zeros on every step after the initial conditions.

Visually looks like this:

Checking the content gives:

We checked things up to ensemble() and wondering what the problem can be. If you checked the content for the run you shared before, can you confirm it was filled with some non zeros?

Hi @hauke-dttl

Thanks for the report. Indeed this appears the be a bug in SFNO in 0.1.0. It does not seem to support batching which is required for running the ensemble workflow unfortunately.

Give a try using the 0.2.0 rc install or main branch (see installing from source for details). I was able to run the following script fine on main branch. We are preparing 0.2.0 version with the fix currently.

import numpy as np
from datetime import datetime

from earth2studio.models.px import SFNO
from earth2studio.data import ARCO
from earth2studio.io import ZarrBackend
from earth2studio.perturbation import Zero
from earth2studio.run import ensemble

model_sfno = SFNO.load_model(SFNO.load_default_package())
output_file="output.zarr"
start_date = datetime(2022, 1, 1, 12)

io = ensemble(
    [start_date],
    4,
    2,
    model_sfno,
    ARCO(),
    ZarrBackend(file_name=output_file),
    Zero(),
    batch_size=4,
    output_coords= {
        "lat": np.arange(0.0, 50.0, 0.25),
        "lon": np.arange(250.0, 345.0, 0.25),
        "variable": np.array(["msl", "u10m", "v10m","t2m"])
    },
)

print(io.root.tree())

Produces:

....
Fetching ARCO for 2022-01-01 12:00:00: 100%|████████████████████████████████████████████████████████| 73/73 [02:52<00:00,  2.36s/it]
2024-07-08 19:07:45.666 | SUCCESS  | earth2studio.run:ensemble:315 - Fetched data from ARCO
2024-07-08 19:07:45.872 | INFO     | earth2studio.run:ensemble:337 - Starting 2 Member Ensemble Inference with             1 number of batches.
Total Ensemble Batches: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.76s/it]
2024-07-08 19:07:49.634 | SUCCESS  | earth2studio.run:ensemble:382 - Inference complete                                             
/
 ├── ensemble (2,) int64
 ├── lat (200,) float64
 ├── lead_time (5,) timedelta64[h]
 ├── lon (380,) float64
 ├── msl (2, 1, 5, 200, 380) float32
 ├── t2m (2, 1, 5, 200, 380) float32
 ├── time (1,) datetime64[ns]
 ├── u10m (2, 1, 5, 200, 380) float32
 └── v10m (2, 1, 5, 200, 380) float32

Note that the API for coords did get changed this upcoming version to functions, namely to get the models coords when using 0.2.0 or source use:

model_sfno.input_coords()
model_sfno.output_coords(model_sfno.input_coords())

hauke-dttl commented 6 days ago

@NickGeneva To keep you up to date. This issue was found on two different machines where we build a conda environment and installed earth2studio and makani from source. When running this container: nvcr.io/nvidia/modulus/modulus:24.04 and installing earth2studio and makani from source, the bug of zero values after the first step did not reproduce.

NickGeneva commented 6 days ago

Interesting @hauke-dttl , thanks for following up about that. We'll need to keep an eye on conda installs in the future. Appreciate the info, glad its fixed.

NVIDIA / earth2studio