NVIDIA / modulus-sym

Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts
https://developer.nvidia.com/modulus
Apache License 2.0
163 stars 68 forks source link

🐛[BUG]: Modulus hangs on FNO training #152

Open gioviciconte opened 4 months ago

gioviciconte commented 4 months ago

Version

1.4.0

On which installation method(s) does this occur?

Pip

Describe the issue

I have adapted the FNO Darcy example to train an FNO on a shockTube example. The problem is that modulus hangs, after the .solve() method is called. The only output I see is

python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[14:13:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[14:13:14] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[14:13:17] - attempting to restore from: outputs/shockTube_FNO_lazy
[14:13:17] - optimizer checkpoint not found
[14:13:17] - model fno.0.pth not found

and nothing else, no errors. The case is attached : shockTube_FNO.zip

Minimum reproducible example

"The case is attached in the issue"

Relevant log output

python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[14:13:14] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[14:13:14] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[14:13:17] - attempting to restore from: outputs/shockTube_FNO_lazy
[14:13:17] - optimizer checkpoint not found
[14:13:17] - model fno.0.pth not found

Environment details

No response

Other/Misc.

No response