Improve error reporting for case files.

MFlowCode / MFC

Exascale simulation of multiphase/physics fluid dynamics

https://mflowcode.github.io

MIT License

132 stars 56 forks source link

Improve error reporting for case files. #437

Closed AiredaleDev closed 1 month ago

AiredaleDev commented 1 month ago

Description

If a case file's parameter is assigned a value of the wrong type, MFC will now correctly report the mismatching field. Previously, it would point to the next parameter, even if it typechecks.

Fixes #424

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Scope

[x] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

[x] Assertions that ensure that typechecking logic covers all parameters
[x] Modifying and running a case file with (in)correct values.

Test Configuration:

What computers and compilers did you use to test this: My personal laptop:

Ryzen 7 7840U GCC 11.4.0

Checklist

[x] I have added comments for the new code
[x] I have made corresponding changes to the documentation (docs/) -- (I caught a typo while grepping through the codebase: Loginal -> Logical)
[x] I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
[x] I ran ./mfc.sh format before committing my code
[x] New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
[x] This PR does not introduce any repeated code (it follows the DRY principle)
[x] I cannot think of a way to condense this code and reduce any introduced additional line count -- I considered an automated solution to get the types from the Fortran codebase that uses grep and awk, but it seemed too brittle and complicated.

sbryngelson commented 1 month ago

This seems right to me, but the rightful person to review this is @henryleberre... nice job @AiredaleDev

AiredaleDev commented 1 month ago

Adding the check in the constructor significantly extends the pre-build/pre-run time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

AiredaleDev commented 1 month ago

This seems right to me, but the rightful person to review this is @henryleberre... nice job @AiredaleDev

Thanks! Henry and I reviewed it last night, there's only the one outstanding thing I just discovered before it should be ready to merge.

sbryngelson commented 1 month ago

Adding the check in the constructor significantly extends the pre-build time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

Can you be a bit more precise... how long are we talking?

and I guess you mean it expands the time between initiating ./mfc.sh build and it actually starting to build? or ./mfc.sh run and it actually starting to run?

AiredaleDev commented 1 month ago

Adding the check in the constructor significantly extends the pre-build time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

Can you be a bit more precise... how long are we talking?

and I guess you mean it expands the time between initiating ./mfc.sh build and it actually starting to build? or ./mfc.sh run and it actually starting to run?

Roughly a minute or two, though it's only for ./mfc.sh test since it generates a lot of Cases. ./mfc.sh build and ./mfc.sh run are fine because they validate at most one case file.

sbryngelson commented 1 month ago

Adding the check in the constructor significantly extends the pre-build time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

Can you be a bit more precise... how long are we talking? and I guess you mean it expands the time between initiating ./mfc.sh build and it actually starting to build? or ./mfc.sh run and it actually starting to run?

Roughly a minute or two, though it's only for ./mfc.sh test since it generates a lot of Cases. ./mfc.sh build and ./mfc.sh run are fine because they validate at most one case file.

That is a lot. Can we switch to type checking at every test initialization rather than all at once in the beginning? That would probably fix it... distributing 1 minute over 150 tests makes it negligible. Need to be careful with the multithreading going on, through ./mfc.sh test -j X

AiredaleDev commented 1 month ago

Adding the check in the constructor significantly extends the pre-build time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

Can you be a bit more precise... how long are we talking? and I guess you mean it expands the time between initiating ./mfc.sh build and it actually starting to build? or ./mfc.sh run and it actually starting to run?

Roughly a minute or two, though it's only for ./mfc.sh test since it generates a lot of Cases. ./mfc.sh build and ./mfc.sh run are fine because they validate at most one case file.

That is a lot. Can we switch to type checking at every test initialization rather than all at once in the beginning? That would probably fix it... distributing 1 minute over 150 tests makes it negligible. Need to be careful with the multithreading going on, through ./mfc.sh test -j X

We appear to be already doing that in addition to the initial check. Removing the initial check means we'll have to go back to waiting until after MFC is built to validate the case inputs, but I think that's fine since the alternative is sequentially validating all 150 inputs. Each test has a unique file that it reads iirc, so multithreading should behave fine.

I also just discovered we could validate only the changed parameters upfront, but the test runner is already going to run the check used during a normal run and the one it uses for generating its inputs, so it would ultimately do redundant work.

sbryngelson commented 1 month ago

Adding the check in the constructor significantly extends the pre-build time for the test suite, enough for the user to think something is wrong (validating json is apparently quite expensive). For performance reasons, I may have to remove the check in the constructor, is there a way we can run fewer checks?

Can you be a bit more precise... how long are we talking? and I guess you mean it expands the time between initiating ./mfc.sh build and it actually starting to build? or ./mfc.sh run and it actually starting to run?

Roughly a minute or two, though it's only for ./mfc.sh test since it generates a lot of Cases. ./mfc.sh build and ./mfc.sh run are fine because they validate at most one case file.

That is a lot. Can we switch to type checking at every test initialization rather than all at once in the beginning? That would probably fix it... distributing 1 minute over 150 tests makes it negligible. Need to be careful with the multithreading going on, through ./mfc.sh test -j X

We appear to be already doing that in addition to the initial check. Removing the initial check means we'll have to go back to waiting until after MFC is built to validate the case inputs, but I think that's fine since the alternative is sequentially validating all 150 inputs. Each test has a unique file that it reads iirc, so multithreading should behave fine.

I also just discovered we could validate only the changed parameters upfront, but the test runner is already going to run the check used during a normal run and the one it uses for generating its inputs, so it would ultimately do redundant work.

I don't know what you mean by it already does that, can you be more specific?

MFC is only built once before the test suite runs.

I think trying to get the toolchain to only check for changed inputs is more complex than appropriate for this.

AiredaleDev commented 1 month ago

I don't know what you mean by it already does that, can you be more specific?

The logic that validates an input was being called every time the toolchain constructs a Case. Before anything is run in parallel, the toolchain will construct many TestCases sequentially (which are children of Case). Then, the test runner would construct an MFCInputFile before running, which is also a child of Case.

I think trying to get the toolchain to only check for changed inputs is more complex than appropriate for this.

Agreed. If I were to go about it, I'd have to move check to two unintuitive places to perform a full check in the case of run and only a partial, upfront check in the case of test (call validate on the mods parameter of TestCase's constructor). Moving the check into MFCInputFile's constructor saves us the upfront minute, so I'll call it done.

sbryngelson commented 1 month ago

If I understand correctly, you have it setup so it builds all of the test cases and then checks all of the parameters for all the tests before actually starting running the tests, whereas the existing (pre-PR) implementation didn't check the parameters of the test cases at all? or did the pre-PR implementation only test some of them somehow? I'm now more confused about the nature of the issue #424

AiredaleDev commented 1 month ago

If I understand correctly, you have it setup so it builds all of the test cases and then checks all of the parameters for all the tests before actually starting running the tests, whereas the existing (pre-PR) implementation didn't check the parameters of the test cases at all? or did the pre-PR implementation only test some of them somehow? I'm now more confused about the nature of the issue #424

Okay, I'm not explaining it right. The parameters were initially validated in the Fortran codebase, but the solution was quite hacky and would often point to benign parameters. In this PR, each run now checks all of its parameters using the schema, which provides clear error reporting, and then launches MFC if everything is okay -- this is true regarding normal runs and test suite.

I unintentionally made a change that would do as you described, which introduced the minute-long delay in the test suite. I fixed this delay with my most recent commits.

sbryngelson commented 1 month ago

ah - got it! I understand now.

Can you make sure it works by introducing some variables in an example case with the wrong type and making sure it creates a problem?

henryleberre commented 1 month ago

@AiredaleDev @sbryngelson I think the best approach here is to define a method under Case called validate(). Code that makes use of Case objects should have the burden of calling it as a sanity check.

henryleberre commented 1 month ago

Also @AiredaleDev, I am to blame for the inclusion of (most of) the unused variables you found (chem_wrt, spec_pp, ...). These were unintentionally added when upstreaming some changes from my branch that is adding chemistry.

sbryngelson commented 1 month ago

@AiredaleDev @sbryngelson I think the best approach here is to define a method under Case called validate(). Code that makes use of Case objects should have the burden of calling it as a sanity check.

This makes sense to me

AiredaleDev commented 1 month ago

Code that makes use of Case objects should have the burden of calling it as a sanity check.

@henryleberre Normally I would agree, but if I apply this sanity check to all code that constructs a Case (i.e. also a TestCase) then this line in the test runner forces the user to wait almost one whole minute before it begins. Apparently jsonschema is a bit slow.

The test runner will eventually call run anyway, which ends up reading and validating each case dictionary in parallel before launching the simulation (i.e. each concurrent run of MFC checks its own parameters then runs if they are correct.) I discovered this by adding a print statement next to the validation code, and if you'd like to check this yourself, add a print statement to the newly added validate method.

# In toolchain/mfc/test/test.py
def test():
    # pylint: disable=global-statement, global-variable-not-assigned
    global nFAIL

    # This is the line I have to work around.
    # It runs before anything in parallel happens, so all 150 cases are checked sequentially.
    cases = [ _.to_case() for _ in list_cases() ]

# In toolchain/mfc/test/case.py
    def to_case(self) -> TestCase:
        dictionary = self.mods.copy()
        if self.path:
            dictionary.update(input.load(self.path, self.args).params)

            for key, value in dictionary.items():
                if not isinstance(value, str):
                    continue

                for path in [value, os.path.join(os.path.dirname(self.path), value)]:
                    path = os.path.abspath(path)
                    if os.path.exists(path):
                        dictionary[key] = path
                        break
        # This code "should" check the dictionary, but if we check the dictionary here
        # we wait ~45 seconds before the test runner starts.
        return TestCase(self.trace, dictionary, self.ppn, self.rebuild)

# In toolchain/mfc/test/test.py
def test():
    # ...

    sched.sched(sched.Task(..., func=handle_func, ...)

# Called by handle_func
def _handle_func():
   # ...
   case.run(...) # case : TestCase

class TestCase(case.Case):
    # ...
    def run(...):
        # ...

        # We just end up calling `./mfc.sh run` for each test anyway.
        command = [
            mfc_script, "run", filepath, *rebuild, *tasks, *case_optimization,
            *jobs, "-t", *target_names, *gpus_select, *ARG("--")
        ]

        return common.system(command, print_cmd=False, text=True, capture_output=True)

Regarding @sbryngelson 's request, I added a case to examples that intentionally triggers the error reporting.

sbryngelson commented 1 month ago

hi @AiredaleDev no need to actually include broken code in version control, just 'proof' or 'verification' that if you do something like this your code identifies it.

sbryngelson commented 1 month ago

not sure I quite follow @AiredaleDev -- I know test is just wrapper around run at some level, so I think @henryleberre is saying that everything that creates a case, which would be any run call, including those invoked through test, gets its case file validated. I don't think test should have any special machinery for this but rather invoke it at the run level?

henryleberre commented 1 month ago

Code that makes use of Case objects should have the burden of calling it as a sanity check.

@henryleberre Normally I would agree, but if I apply this sanity check to all code that constructs a Case (i.e. also a TestCase) then this line in the test runner forces the user to wait almost one whole minute before it begins. Apparently jsonschema is a bit slow.

The test runner will eventually call run anyway, which ends up reading and validating each case dictionary in parallel before launching the simulation (i.e. each concurrent run of MFC checks its own parameters then runs if they are correct.) I discovered this by adding a print statement next to the validation code, and if you'd like to check this yourself, add a print statement to the newly added validate method.
# In toolchain/mfc/test/test.py
def test():
    # pylint: disable=global-statement, global-variable-not-assigned
    global nFAIL

    # This is the line I have to work around.
    # It runs before anything in parallel happens, so all 150 cases are checked sequentially.
    cases = [ _.to_case() for _ in list_cases() ]

# In toolchain/mfc/test/case.py
    def to_case(self) -> TestCase:
        dictionary = self.mods.copy()
        if self.path:
            dictionary.update(input.load(self.path, self.args).params)

            for key, value in dictionary.items():
                if not isinstance(value, str):
                    continue

                for path in [value, os.path.join(os.path.dirname(self.path), value)]:
                    path = os.path.abspath(path)
                    if os.path.exists(path):
                        dictionary[key] = path
                        break
        # This code "should" check the dictionary, but if we check the dictionary here
        # we wait ~45 seconds before the test runner starts.
        return TestCase(self.trace, dictionary, self.ppn, self.rebuild)

# In toolchain/mfc/test/test.py
def test():
    # ...

    sched.sched(sched.Task(..., func=handle_func, ...)

# Called by handle_func
def _handle_func():
   # ...
   case.run(...) # case : TestCase

class TestCase(case.Case):
    # ...
    def run(...):
        # ...

        # We just end up calling `./mfc.sh run` for each test anyway.
        command = [
            mfc_script, "run", filepath, *rebuild, *tasks, *case_optimization,
            *jobs, "-t", *target_names, *gpus_select, *ARG("--")
        ]

        return common.system(command, print_cmd=False, text=True, capture_output=True)
Regarding @sbryngelson 's request, I added a case to examples that intentionally triggers the error reporting.

You're right. Sorry if I was unclear. I believe we agree on this. When I said that:

Code that makes use of Case objects should have the burden of calling it as a sanity check.

I did not intend to imply that code that makes use of Case objects should always call validate. This decoupling of validation & construction is intended to facilitate this. You only run validate when it is deemed reasonable / necessary. I agree we shouldn't call it in ./mfc.sh test and just let the spawned ./mfc.sh run processes do the validation.

AiredaleDev commented 1 month ago

not sure I quite follow @AiredaleDev -- I know test is just wrapper around run at some level, so I think @henryleberre is saying that everything that creates a case, which would be any run call, including those invoked through test, gets its case file validated. I don't think test should have any special machinery for this but rather invoke it at the run level?

This is how it behaves currently.

I did not intend to imply that code that makes use of Case objects should always call validate. This decoupling of validation & construction is intended to facilitate this. You only run validate when it is deemed reasonable / necessary. I agree we shouldn't call it in ./mfc.sh test and just let the spawned ./mfc.sh run processes do the validation.

Ah, I see! I'll go in and address that real quick.

sbryngelson commented 1 month ago

@henryleberre can approve and merge this PR when it's ready

AiredaleDev commented 1 month ago

hi @AiredaleDev no need to actually include broken code in version control, just 'proof' or 'verification' that if you do something like this your code identifies it.

Okay, will remove the broken case and show you the logs:

mfc: OK > (venv) Entered the Python 3.10.12 virtual environment (>= 3.8).

      .=++*:          -+*+=.        | cameron@pop-os [Linux]
     :+   -*-        ==   =* .      | ----------------------
   :*+      ==      ++    .+-       | --jobs 1
  :*##-.....:*+   .#%+++=--+=:::.   | --mpi
  -=-++-======#=--**+++==+*++=::-:. | --no-gpu
 .:++=----------====+*= ==..:%..... | --no-debug
  .:-=++++===--==+=-+=   +.  :=     | --targets pre_process, simulation, and post_process
  +#=::::::::=%=. -+:    =+   *:    | ----------------------------------------------------------
 .*=-=*=..    :=+*+:      -...--    | $ ./mfc.sh (build, run, test, clean, count, packer) --help

 Acquiring /home/cameron/Programs/Work/MFC/examples/2D_ibm/case.py...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/cameron/Programs/Work/MFC/toolchain/main.py:65 in <module>                                 │
│                                                                                                  │
│   62 │   │                                                                                       │
│   63 │   │   __print_greeting()                                                                  │
│   64 │   │   __checks()                                                                          │
│ ❱ 65 │   │   __run()                                                                             │
│   66 │                                                                                           │
│   67 │   except MFCException as exc:                                                             │
│   68 │   │   cons.reset()                                                                        │
│                                                                                                  │
│ /home/cameron/Programs/Work/MFC/toolchain/main.py:50 in __run                                    │
│                                                                                                  │
│   47                                                                                             │
│   48                                                                                             │
│   49 def __run():                                                                                │
│ ❱ 50 │   {"test":   test.test,     "run":        run.run,          "build":      build.build,    │
│   51 │    "clean":  build.clean,   "bench":      bench.bench,      "count":      count.count,    │
│   52 │    "packer": packer.packer, "count_diff": count.count_diff, "bench_diff": bench.diff      │
│   53 │   }[ARG("command")]()                                                                     │
│                                                                                                  │
│ /home/cameron/Programs/Work/MFC/toolchain/mfc/run/run.py:133 in run                              │
│                                                                                                  │
│   130                                                                                            │
│   131 def run(targets = None, case = None):                                                      │
│   132 │   targets = get_targets(list(REQUIRED_TARGETS) + (targets or ARG("targets")))            │
│ ❱ 133 │   case    = case or input.load(ARG("input"), ARG("arguments"))                           │
│   134 │                                                                                          │
│   135 │   build(targets)                                                                         │
│   136                                                                                            │
│                                                                                                  │
│ /home/cameron/Programs/Work/MFC/toolchain/mfc/run/input.py:99 in load                            │
│                                                                                                  │
│    96 │   │   raise common.MFCException(f"Input file {filename} did not produce valid JSON. It   │
│    97 │                                                                                          │
│    98 │   input_file = MFCInputFile(filename, dirpath, dictionary)                               │
│ ❱  99 │   input_file.validate_params()                                                           │
│   100 │   return input_file                                                                      │
│   101                                                                                            │
│   102                                                                                            │
│                                                                                                  │
│ /home/cameron/Programs/Work/MFC/toolchain/mfc/case.py:72 in validate_params                      │
│                                                                                                  │
│    69 │   │   '''Typechecks parameters read from case file. If a parameter                       │
│    70 │   │   is assigned a vlaie of the wrong type, this method throws an exception             │
│    71 │   │   highlighting the violating parameter and specifying what it expects.'''            │
│ ❱  72 │   │   jsonschema.validate(self.params, case_dicts.SCHEMA)                                │
│    73 │                                                                                          │
│    74 │   def __get_ndims(self) -> int:                                                          │
│    75 │   │   return 1 + min(int(self.params.get("n", 0)), 1) + min(int(self.params.get("p", 0   │
│                                                                                                  │
│ /home/cameron/Programs/Work/MFC/build/venv/lib/python3.10/site-packages/jsonschema/validators.py │
│ :1332 in validate                                                                                │
│                                                                                                  │
│   1329 │   validator = cls(schema, *args, **kwargs)                                              │
│   1330 │   error = exceptions.best_match(validator.iter_errors(instance))                        │
│   1331 │   if error is not None:                                                                 │
│ ❱ 1332 │   │   raise error                                                                       │
│   1333                                                                                           │
│   1334                                                                                           │
│   1335 def validator_for(                                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 'bad param' is not one of ['T', 'F']

Failed validating 'enum' in schema['properties']['run_time_info']:
    {'enum': ['T', 'F']}

On instance['run_time_info']:
    'bad param'

ERROR: An unexpected exception occurred: 'bad param' is not one of ['T', 'F']

Failed validating 'enum' in schema['properties']['run_time_info']:
    {'enum': ['T', 'F']}

On instance['run_time_info']:
    'bad param'

mfc: ERROR > mfc.py finished with a 143 exit code.
mfc: (venv) Exiting the Python virtual environment.