Parameter sweep `run_model` is hard to debug

Robbybp commented 7 months ago

In ParameterSweepBase, we wrap run_model in a try/except with no specified exception. This seems a bit dangerous. For errors in solvers and other "black-box" subroutines, we probably want to record them. For other errors, e.g. typos, we probably want the sweep to fail immediately. Otherwise we can run an entire parameter sweep just to see something like ValueError: too many values to unpack for every item. The risk of odd errors increases as we write more complicated run_model methods. Right now I have scaling, re-initialization, solve, and validation steps in a run_model function that I'm using. My question is: Should we define an exception type that users must raise in order to be caught by the sweep runner? The downside is that users will have to know what types of exceptions to expect from their black-box subroutines, and will have to catch these and re-raise our exception type. I assume the current design was chosen intentionally, but am curious to hear the rationale for it.

andrewlee94 commented 7 months ago

@Robbybp I don't know that there was too much rationale behind the current design; it mostly echoes designs from previous iterations but I am not sure that there has been much thought into different ways it can go wrong. Having a terminate-immediately Exception type sounds like a good idea however; usage would primarily fall upon the user but that is probably how it has to be (i.e. it is up to them how much they want to instrument their method).

That being said, we probably should look at catching some of the obvious ones ourselves if possible. The ValueError example above strikes me as one we might be able to handle.

To consider other alternatives, there is the opposite approach in that we fail outright on any Exception and insist that users put in try/excepts as necessary (which they would need to do in the above anyway). However, that could be just as frustrating if you have a long run and late in the cycle you run into the one Exception you did not expect.

So, in short I think trying to find something in the middle might be he best approach. I.e. have an Exception users can use to indicate fail immediately (that they can use as they wish), add some more protections where we can to catch obvious issues like callback signatures with the wrong number of arguments, and otherwise just catch the Exception and log it in the results.

Adding the terminate immediately Exception is easy, but we should think about what cases we have where we want to fail immeidately (or better yet, fail before executing anything).

andrewlee94 commented 7 months ago

On re-reading the original message, we should also log any Exceptions we catch in the results for posterity.

Robbybp commented 7 months ago

To consider other alternatives, there is the opposite approach in that we fail outright on any Exception and insist that users put in try/excepts as necessary (which they would need to do in the above anyway). However, that could be just as frustrating if you have a long run and late in the cycle you run into the one Exception you did not expect.

This would have been my suggestion, but I see that it could be frustrating to get an unhandled exception near the end of a run. Maybe we could inspect the exception in our except block and re-raise if it is TypeError, ValueError, KeyError, IndexError, AttributeError, or any other error that we do not expect a solver to raise. I think this is basically what you're suggesting?

Robbybp commented 7 months ago

On re-reading the original message, we should also log any Exceptions we catch in the results for posterity.

We do this already, right?

andrewlee94 commented 7 months ago

@Robbybp My question then is what exceptions would we expect a solver to raise, or perhaps more specifically can we neatly categorize things into exceptions to catch and log versus those to re-raise? My initial feeling is that it would be better to have a "white-list" of exceptions to catch and log rather than a black-list of exceptions to terminate on - it will fail more aggressively, but would generally indicate a user-error that should be addressed. We could then clearly document the exception types that will be logged for the user.

Another thing however would be to think about how this might work in a parallel environment where we would ideally want a critical failure in one thread to kill all other threads as well (I think this is probably part of the reason for the liberal catch and log approach at the moment).

Robbybp commented 7 months ago

My initial feeling is that it would be better to have a "white-list" of exceptions to catch and log rather than a black-list of exceptions to terminate on

I think in this case I would advocate for a ParameterSweepRuntimeError or something that we require users raise if they want the exception to be caught. This way the user is in charge of knowing what exceptions the solver could raise. I'm fine with this approach, especially as I think this is the easiest thing for us to implement and maintain. I think re-raising specific non-runtime/arithmetic errors is more user-friendly (they don't have to implement their own try/except), but less safe (what if a solver raises ValueError internally). I think overall I prefer the former approach as well.

Another thing however would be to think about how this might work in a parallel environment where we would ideally want a critical failure in one thread to kill all other threads as well

I think this is out-of-scope for our current implementation. The parallel implementation can override the default, although I think I see the advantages of the current implementation in the parallel setting.

IDAES / idaes-pse

Parameter sweep `run_model` is hard to debug #1356