Closed hauke-dttl closed 6 days ago
Hi @hauke-dttl
Thanks for the report. Indeed this appears the be a bug in SFNO in 0.1.0. It does not seem to support batching which is required for running the ensemble workflow unfortunately.
Give a try using the 0.2.0 rc install or main branch (see installing from source for details). I was able to run the following script fine on main branch. We are preparing 0.2.0 version with the fix currently.
import numpy as np
from datetime import datetime
from earth2studio.models.px import SFNO
from earth2studio.data import ARCO
from earth2studio.io import ZarrBackend
from earth2studio.perturbation import Zero
from earth2studio.run import ensemble
model_sfno = SFNO.load_model(SFNO.load_default_package())
output_file="output.zarr"
start_date = datetime(2022, 1, 1, 12)
io = ensemble(
[start_date],
4,
2,
model_sfno,
ARCO(),
ZarrBackend(file_name=output_file),
Zero(),
batch_size=4,
output_coords= {
"lat": np.arange(0.0, 50.0, 0.25),
"lon": np.arange(250.0, 345.0, 0.25),
"variable": np.array(["msl", "u10m", "v10m","t2m"])
},
)
print(io.root.tree())
Produces:
....
Fetching ARCO for 2022-01-01 12:00:00: 100%|████████████████████████████████████████████████████████| 73/73 [02:52<00:00, 2.36s/it]
2024-07-08 19:07:45.666 | SUCCESS | earth2studio.run:ensemble:315 - Fetched data from ARCO
2024-07-08 19:07:45.872 | INFO | earth2studio.run:ensemble:337 - Starting 2 Member Ensemble Inference with 1 number of batches.
Total Ensemble Batches: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.76s/it]
2024-07-08 19:07:49.634 | SUCCESS | earth2studio.run:ensemble:382 - Inference complete
/
├── ensemble (2,) int64
├── lat (200,) float64
├── lead_time (5,) timedelta64[h]
├── lon (380,) float64
├── msl (2, 1, 5, 200, 380) float32
├── t2m (2, 1, 5, 200, 380) float32
├── time (1,) datetime64[ns]
├── u10m (2, 1, 5, 200, 380) float32
└── v10m (2, 1, 5, 200, 380) float32
Note that the API for coords did get changed this upcoming version to functions, namely to get the models coords when using 0.2.0 or source use:
model_sfno.input_coords()
model_sfno.output_coords(model_sfno.input_coords())
Hi @NickGeneva, can be related to this issue, so asking here
Will Perturbation class be defined in the latest version as Perturbation or PerturbationMethod? Asking because now it' also one of the perplexities for running from pip and not latest sources.
Yes, that was another API change. Apologies for the breaking changes, we are ironing out some bigger design issues post our first release as we integrate the package into other systems. Should be much less frequent after 0.2.0 / allowing a transition period for people.
For a list of changes presently, you can check the change log. We'll have a clear API break summary upon the 0.2.0 release.
Make sure the docs versions you are referencing matches your install version. Examples are updated on the main/0.2.0 version of the docs with these updates. Many of the examples are only implemented for 0.2.0-rc and forward.
https://github.com/NVIDIA/earth2studio/blob/main/CHANGELOG.md#changed-1
Hi @NickGeneva,
first, thank you for your super quick and engaging support! Building from source seems to work, since the coords issue is not raised anymore. We are working on AWS Workstation with A10G (24GB) and are posed with a cuda memory error. The way I read the run.ensemble() code, you are retrieving the full data from the DataSource which is then copied to the device. Since there is no multi-GPU support, I would need to have an EC2 instance with ~48GB memory which the research team referenced in the paper, right?
Hi @hauke-dttl
SFNO has a pretty large memory footprint and requires some higher GPU memory to run compared to some of the lighter models in the package. This is partially due to needing 73 variables but also because the model is large. I believe we have had other users have a similar memory issue on a 24Gb A10 for SFNO. Even for a batch size of 1, but confirm you are using a batch size of 1. Can potentially look at using other models such as DLWP or AFNO.
For reference, the examples are ran on a 32Gb A100, so I would suggest a card with at least 32Gb of GPU memory.
Closing this issue since it seems original coordinate issue was resolved. Feel free to reopen or open a new issue.
Hi @NickGeneva, can't reopen the issue, but would do it.
Running the SFNO with zarr backbone, we see zeros on every step after the initial conditions.
Visually looks like this:
Checking the content gives:
We checked things up to ensemble() and wondering what the problem can be. If you checked the content for the run you shared before, can you confirm it was filled with some non zeros?
Hi @hauke-dttl
Thanks for the report. Indeed this appears the be a bug in SFNO in 0.1.0. It does not seem to support batching which is required for running the ensemble workflow unfortunately.
Give a try using the 0.2.0 rc install or main branch (see installing from source for details). I was able to run the following script fine on main branch. We are preparing 0.2.0 version with the fix currently.
import numpy as np from datetime import datetime from earth2studio.models.px import SFNO from earth2studio.data import ARCO from earth2studio.io import ZarrBackend from earth2studio.perturbation import Zero from earth2studio.run import ensemble model_sfno = SFNO.load_model(SFNO.load_default_package()) output_file="output.zarr" start_date = datetime(2022, 1, 1, 12) io = ensemble( [start_date], 4, 2, model_sfno, ARCO(), ZarrBackend(file_name=output_file), Zero(), batch_size=4, output_coords= { "lat": np.arange(0.0, 50.0, 0.25), "lon": np.arange(250.0, 345.0, 0.25), "variable": np.array(["msl", "u10m", "v10m","t2m"]) }, ) print(io.root.tree())
Produces:
.... Fetching ARCO for 2022-01-01 12:00:00: 100%|████████████████████████████████████████████████████████| 73/73 [02:52<00:00, 2.36s/it] 2024-07-08 19:07:45.666 | SUCCESS | earth2studio.run:ensemble:315 - Fetched data from ARCO 2024-07-08 19:07:45.872 | INFO | earth2studio.run:ensemble:337 - Starting 2 Member Ensemble Inference with 1 number of batches. Total Ensemble Batches: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.76s/it] 2024-07-08 19:07:49.634 | SUCCESS | earth2studio.run:ensemble:382 - Inference complete / ├── ensemble (2,) int64 ├── lat (200,) float64 ├── lead_time (5,) timedelta64[h] ├── lon (380,) float64 ├── msl (2, 1, 5, 200, 380) float32 ├── t2m (2, 1, 5, 200, 380) float32 ├── time (1,) datetime64[ns] ├── u10m (2, 1, 5, 200, 380) float32 └── v10m (2, 1, 5, 200, 380) float32
Note that the API for coords did get changed this upcoming version to functions, namely to get the models coords when using 0.2.0 or source use:
model_sfno.input_coords() model_sfno.output_coords(model_sfno.input_coords())
@NickGeneva To keep you up to date. This issue was found on two different machines where we build a conda environment and installed earth2studio and makani from source. When running this container: nvcr.io/nvidia/modulus/modulus:24.04 and installing earth2studio and makani from source, the bug of zero values after the first step did not reproduce.
Interesting @hauke-dttl , thanks for following up about that. We'll need to keep an eye on conda installs in the future. Appreciate the info, glad its fixed.
Version
0.1.0
On which installation method(s) does this occur?
Pip
Describe the issue
Steps leading to ValueError
model_sfno = SFNO.load_model(SFNO.load_default_package())
which loads the following checkpoint per default"ngc://models/nvidia/modulus/sfno_73ch_small@0.1.0"
.Notes
model_fcn = FCN.load_model(FCN.load_default_package())
executes successfully.handshake_dim
of SFNO it is at index 2 while expected at position 1 (comp. ValueError above).I suspect, that there is either