NVIDIA / earth2mip

Earth-2 Model Intercomparison Project (MIP) is a python framework that enables climate researchers and scientists to inter-compare AI models for weather and climate.
https://nvidia.github.io/earth2mip/
Apache License 2.0
187 stars 41 forks source link

πŸ›[BUG]: unrecognized input to lagged ensembles #143

Closed yairchn closed 9 months ago

yairchn commented 9 months ago

Version

source - main

On which installation method(s) does this occur?

Pip

Describe the issue

following the instructions in lagged ensembles main:

torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

produces the following error:

usage: Run a lagged ensemble scoring

    Can be run against either a fcn model (--model), a forecast directory as
    output by earth2mip.time_collection (--forecast_dir), persistence forecast
    (--persistence), or deterministic IFS (--ifs).

    Saves data as csv files (1 per rank).

    Examples:

        torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

__main__.py: error: unrecognized arguments: --inits 10
usage: Run a lagged ensemble scoring

    Can be run against either a fcn model (--model), a forecast directory as
    output by earth2mip.time_collection (--forecast_dir), persistence forecast
    (--persistence), or deterministic IFS (--ifs).

    Saves data as csv files (1 per rank).

    Examples:

        torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

__main__.py: error: unrecognized arguments: --inits 10
[2023-12-06 14:26:24,496] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 922229) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

it seem that inits is no longer an argument in parse_args - I think @nbren12 might have decided to make it a fix number rather than an input by users choice.

Environment details

running on Selene interactive session with `gitlab-master.nvidia.com/earth-2/fcn-mip:latest`
nbren12 commented 9 months ago

Try running –help. Thes args are more flexible now: https://github.com/NVIDIA/earth2mip/blob/a17fd31ae15b83a052c57c88eb30a153d2995415/earth2mip/_cli_utils.py#L25

From: Yair Cohen @.> Date: Wednesday, December 6, 2023 at 2:29 PM To: NVIDIA/earth2mip @.> Cc: Noah Brenowitz @.>, Mention @.> Subject: [NVIDIA/earth2mip] πŸ›[BUG]: unrecognized input to lagged ensembles (Issue #143) Version

source - main

On which installation method(s) does this occur?

Pip

Describe the issue

following the instructions in lagged ensembles main:

torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

produces the following error:

usage: Run a lagged ensemble scoring

Can be run against either a fcn model (--model), a forecast directory as

output by earth2mip.time_collection (--forecast_dir), persistence forecast

(--persistence), or deterministic IFS (--ifs).

Saves data as csv files (1 per rank).

Examples:

    torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

main.py: error: unrecognized arguments: --inits 10

usage: Run a lagged ensemble scoring

Can be run against either a fcn model (--model), a forecast directory as

output by earth2mip.time_collection (--forecast_dir), persistence forecast

(--persistence), or deterministic IFS (--ifs).

Saves data as csv files (1 per rank).

Examples:

    torchrun --nproc_per_node 2 --nnodes 1 -m earth2mip.lagged_ensembles --model sfno_73ch --inits 10 --leads 5 --lags 4

main.py: error: unrecognized arguments: --inits 10

[2023-12-06 14:26:24,496] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 922229) of binary: /usr/bin/python

Traceback (most recent call last):

File "/usr/local/bin/torchrun", line 8, in

sys.exit(main())

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper

return f(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main

run(args)

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run

elastic_launch(

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

it seem that inits is no longer an argument in parse_args - I think @nbren12https://github.com/nbren12 might have decided to make it a fix number rather than an input by users choice.

Environment details

running on Selene interactive session with gitlab-master.nvidia.com/earth-2/fcn-mip:latest

β€” Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/earth2mip/issues/143, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKSREVMIGXJWWN3IISXY7TYIDWVNAVCNFSM6AAAAABAKDNRLGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZDSNBZHA3TOMA. You are receiving this because you were mentioned.Message ID: @.***>

nbren12 commented 9 months ago

Closing since the --inits flag is replaced by --start-time and --end-time.