Training freeze when raise a RuntimeError in EnergyPlus

hermmanhender commented 1 year ago

Hi, I used this repo as base for my own development.

I found that the comand raise RuntimeError(f"EnergyPlus failed with {self.energyplus_runner.sim_results['exit_code']}") in line 359 of run.py file (in step() method) freeze the simulation when an error apears.

I solved this problem changing this line for raise Exception(Faulty episode) and adding the following to the Tune configuration for running the experiment:

tune.Tuner(
    algorithm_name,
    run_config = air.RunConfig(
        stop = {'episode_total': 250},
        failure_config = air.FailureConfig(
        # Tries to recover a run up to this many times.
        max_failures=10
        )
    ),
    param_space=algo_config.to_dict(),
).to_fit()

This was helpful for me.

antoine-galataud commented 1 year ago

Hi @hermmanhender, thank you for sharing your experience on this. That's indeed very useful in case some EnergyPlus runs are expected to fail. I'm curious to know why? That must be specific to your work?

hermmanhender commented 1 year ago

You're welcome, it seemed important to me, since when you run EnergyPlus in different threads and on a recurring basis, from my experience, there are always chances of errors. In my case, a very peculiar error is occurring that I still can't solve, which I have published in UnmetHours (here).

In addition, I am having problems when I run a full year with some algorithms from the RLlib library, but perhaps changing the configuration I can solve it (I still have to experiment) because with shorter running time periods it works well (up to three months I have tried without faults).

airboxlab / rllib-energyplus

Training freeze when raise a RuntimeError in EnergyPlus #16