NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
798 stars 174 forks source link

CorrDiff: Remove extra call to training_loop #450

Closed simonbyrne closed 2 months ago

simonbyrne commented 2 months ago

Modulus Pull Request

Description

When I ran the training example I got the following error at the end:

Traceback (most recent call last):
  File "/code/modulus/examples/generative/corrdiff/train.py", line 335, in main
    training_loop.training_loop(
TypeError: training_loop() missing 3 required positional arguments: 'dataset_iterator', 'validation_dataset', and 'validation_dataset_iterator'

The problem appears to be that training_loop is called twice in a nested fashion. Since the error is only thrown on the outside call after the first call completes, the error is only raised after the training has already completed (which is why this probably wasn't noticed earlier).

Checklist

Dependencies

simonbyrne commented 2 months ago

@mnabian any thoughts/comments?

mnabian commented 2 months ago

This is very surprising to me! It was added here: https://github.com/NVIDIA/modulus/commit/1658a43ead87b7466c510be6fd638919bd842bb1#diff-6561b5f6860b33223b323345d5b61da97b7a1b5877d1aa53cc3a63af12399e08L305 @jleinonen was this change intentional?

mnabian commented 2 months ago

/blossom-ci

jleinonen commented 2 months ago

Yeah, I noticed this recently too. I don't know how this redundant call got there. I guess it was never caught since it's the last line of the script which avoids the error...

Anyway, LGTM!