Checkpointing - Githubissues

CharlyEmpereurmot commented 5 years ago

Hello Marco,

I was thinking about another useful feature that you might want to implement: checkpointing. One might realize he has not allowed a sufficient number of evaluation steps, any kind of bugs in the evaluation function can happen, the machines can suffer power loss, hard drives can get full, etc. Sometimes days of optimization are thrown away.

It would be nice to be able to load a file to continue the optimization process from a given point (i.e. checkpoint). This file could be generated after each particle evaluation or after each swarm iteration, according to user's choice, and checkpointing could be disabled by default.

Would it be straightforward to implement ? What do you think ? Cheers!

aresio commented 5 years ago

That would be tricky to implement because you also need to also store the information about the previous state of the swarm, in order to properly apply the fuzzy rules. Nevertheless, I guess it is much better to recover a long optimization at the cost of a tiny error in the fuzzy reasoning with respect to losing the whole process and starting again from scratch. I put this idea in the set of prioritary features, thank you for the suggestion!

aresio commented 5 years ago

Hi, I am done with the implementation. By using the argument "save_checkpoint" in the solve_with_fstpso method you can specify a snapshot file. For instance:

... FP.solve_with_fstpso(save_checkpoint = "checkpoint.obj") ...

In case something was wrong with the optimization (e.g., the computer crashed) you can recover by using the argument "restart_from_checkpoint":

... FP.solve_with_fstpso(restart_from_checkpoint = "checkpoint.obj") ...

The optimization will restart from the last valid checkpoint. I hope this solution suits your needs!

I am pushing the new version on pypi asap.

aresio commented 5 years ago

New version pushed on pypi, please let me know if the new functionality works correctly!

CharlyEmpereurmot commented 5 years ago

Thank you for going back to this point! It will be useful to many I'm sure.

Before I had taken your scripts and copied them in my package directory. Now I updated via pip and I'm not using copies of the scripts anymore, but directly your package as I should. Now doing this:

# content of my_script.py
from fstpso import FuzzyPSO 
FP = FuzzyPSO()
# search_space_boundaries and eval_function are defined and working
FP.set_search_space(search_space_boundaries)
FP.set_fitness(fitness=eval_function, arguments=None, skip_test=True)
result =  FP.solve_with_fstpso(max_iter=1, initial_guess_list=initial_guess_list, max_iter_without_new_global_best=1) # for test

I have this error:

Traceback (most recent call last):
  File "./my_script.py", line 4, in <module>
    from fstpso import FuzzyPSO 
  File "/usr/local/lib/python3.6/dist-packages/fstpso/__init__.py", line 1, in <module>
    from .fstpso import FuzzyPSO
  File "/usr/local/lib/python3.6/dist-packages/fstpso/fstpso.py", line 14, in <module>
    from fstpso_checkpoints import Checkpoint
ModuleNotFoundError: No module named 'fstpso_checkpoints'

aresio commented 5 years ago

Issue with relative imports, I'm sorry, my fault.

Please try again now (version 1.7.1). I tested it locally and seems to work correctly

CharlyEmpereurmot commented 5 years ago

Thank you very much for implementing this, I will try very soon!

aresio commented 5 years ago

Please download the new version of FST-PSO, there was a bug in the file name handling.

CharlyEmpereurmot commented 4 years ago

Hey Marco,

I tested using version 1.7.9 and everything seems fine !

Thank you, this is useful :+1:

CharlyEmpereurmot commented 3 years ago

Hi Marco !

Can you please confirm that the checkpointing works fine, also when using the parallel evaluation_function ? I am about to implement something that really needs this combination to work. I will repost there ASAP to let you know what my conclusions were.

I assume there is nothing wrong with doing this:

FP.set_parallel_fitness(fitness=eval_function_parallel_swarm, arguments=[pool, slots_states], skip_test=True)
FP.solve_with_fstpso(restart_from_checkpoint = "checkpoint_1.obj", save_checkpoint = "checkpoint_2.obj")

Right ?

Also, I believe generating a checkpoint should be the default behavior. I would suggest that if no filename is provided, it would be nice to name the file after a timestamp such as: checkpoint_fstpso_04-02-2021_17h34m05s.obj

Edit: I could actually do a PR for this

Thank you very much :)

aresio commented 3 years ago

Hi Charly,

I do not think I ever tested this combination, but in principle should work...

Concerning the checkpoint, writing on the hard drive affects the performances. That's why I always try not doing that. Of course, losing the whole optimization is even worse. I will make it default, with a warning message and an option to disable it.

Thank you!

M

CharlyEmpereurmot commented 3 years ago

Thank you for your quick answer.

So the checkpoint is written at the end of each swarm iteration ? If this is the case, it is perfectly suited to my needs. Otherwise the said combination is working fine apparently :+1:

There is just the estimated worst fitness that behaves strangly for some reason, yielding a very big number when I'm running in parallel + feeding a list of initial guesses using the same format as if I were running in serial:

Estimated worst fitness: 17976931348623157081452742373170435679807056752584499659891 [...]

While the search space boundaries seem fine and my initial guesses are well within these boundaries. I remember you said this is used internally to calibrate some ranges for the fuzzy reasoner, right ? So I'm a bit afraid about this, which I had never seen before. It's possible I have some typos though, I will update soon .. !

aresio commented 3 years ago

Hi,

I just realized that I introduced a bug in the estimation of worst fitness in the case of parallel fitness evaluation (it basically uses sys.float_info.max). I will patch it by today.

Sorry about that,

Marco

aresio commented 3 years ago

Done! I also fixed a couple of additional issues (e.g., optional arguments were mandatory in the parallel fitness evaluation). Thank you and please let me know if this new version works properly.

CharlyEmpereurmot commented 3 years ago

Hi !

Everything works just fine :) I have also noticed that the checkpoint is indeed written at the end of each swarm iteration, which is exactly what I need.

I am about to use FST-PSO with running the whole swarm parallelized across nodes on a super computer, with as many nodes as particles in the swarm. Each particle is approx. 2h of computation and I have a hard limit of 24h before the master job is killed. Therefore, I absolutely need the checkpoints to just be able to run a complete optimization normally.

On 18-06-2019 you wrote:

Nevertheless, I guess it is much better to recover a long optimization at the cost of a tiny error in the fuzzy reasoning with respect to losing the whole process and starting again from scratch.

How different would be the results of 2 optimizations using the same parameters/constraints/everything in the cases of using checkpoints VERSUS NOT using checkpoints ? Since the process is not deterministic, I believe we simply cannot compare and get answers. But what would you expect ? Assuming we would set all the necessary seeds and we have reproducible behavior, what would be the difference between the run that would use checkpoints, versus the one that would not ? We can assume it's negligeable, right ?

Tons of cheers

aresio commented 3 years ago

Hi Charly,

I guess we need to perform a statistical comparison. Run a bunch of optimizations (say, 100) and compare the performance (e.g., the distribution of the fitness values of the best individual found, with and without checkpoints).

This is actually an analysis that I have never seen in my life. The results could be surprising though (e.g., does FST-PSO reduces to a common PSO in this case? PSO often performs worse than FST-PSO, etc.), probably worth investigating.

aresio / fst-pso

Checkpointing #8