NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
937 stars 219 forks source link

🐛[BUG]: training regression breaks #529

Closed yairchn closed 3 months ago

yairchn commented 4 months ago

Version

0.6.0

On which installation method(s) does this occur?

Pip

Describe the issue

trying to train a regression model in corrdifff I am getting

TypeError: got an unexpected keyword argument 'checkpoint_level'
srun: error: eos0224: task 6: Exited with exit code 1

Minimum reproducible example

just try training a regression model with the default config after specifying needed path in the config.

I am using this commit 

commit ea02af2aeb6c2d498c8734b42932ac794ce20351 (HEAD -> main, origin/main, origin/HEAD)
Author: Mohammad Amin Nabian <m.a.nabiyan@gmail.com>
Date:   Thu May 23 10:25:02 2024 -0700

    Support history > 0 in the ERA5 HDF5 datapipe (#518)

    * add history

    * address review comments

    ---------

    Co-authored-by: root <root@eos0414.eos.clusters.nvidia.com>
    Co-authored-by: root <root@eos0546.eos.clusters.nvidia.com>

### Relevant log output

```shell
Traceback (most recent call last):
  File "/home/modulus/examples/generative/corrdiff/train.py", line 351, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/modulus/examples/generative/corrdiff/train.py", line 343, in main
    training_loop.training_loop(
  File "/home/modulus/examples/generative/corrdiff/training/training_loop.py", line 166, in training_loop
    net = construct_class_by_name(**merged_args)  # subclass of torch.nn.Module
  File "/usr/local/lib/python3.10/dist-packages/modulus/utils/generative/utils.py", line 306, in construct_class_by_name
    return call_func_by_name(*args, func_name=class_name, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modulus/utils/generative/utils.py", line 296, in call_func_by_name
    return func_obj(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modulus/models/diffusion/unet.py", line 111, in __init__
    self.model = model_class(
  File "/usr/local/lib/python3.10/dist-packages/modulus/models/module.py", line 65, in __new__
    bound_args = sig.bind_partial(
  File "/usr/lib/python3.10/inspect.py", line 3193, in bind_partial
    return self._bind(args, kwargs, partial=True)
  File "/usr/lib/python3.10/inspect.py", line 3175, in _bind
    raise TypeError(
TypeError: got an unexpected keyword argument 'checkpoint_level'
srun: error: eos0224: task 6: Exited with exit code 1

### Environment details

```shell
`--container-image=/lustre/fsw/coreai_climate_earth2/mnabian/cont_modulus.sqsh`
mnabian commented 4 months ago

@daviddpruitt is working on a fix for this.

yairchn commented 4 months ago

@daviddpruitt is working on a fix for this.

@mnabian is there an older commit before this change you would recommend using until this is fixed?

daviddpruitt commented 3 months ago

Fixed with https://github.com/NVIDIA/modulus/pull/550