MLopez-Ibanez / irace

Iterated Racing for Automatic Algorithm Configuration
https://mlopez-ibanez.github.io/irace/
GNU General Public License v2.0
58 stars 14 forks source link

Execute multiple runs of irace in parallel within R #59

Closed Saethox closed 1 year ago

Saethox commented 1 year ago

When executing irace from the shell, you can use parallel-irace to parallelize multiple runs of irace in addition to parallelizing executions of the targetRunner within a single run, right? Would it be possible to provide a irace_parallel(runs) R function that does the same, where runs is a list of tuples of scenario and parameters? I'm not familiar with how the parallelization in R works, so I don't know how much work something like this might be. I personally would already be happy with executing multiple runs on multiple threads, I have no need for MPI or Slurm etc.

If this is something that should not be part of this repository, I would also be happy with some pointers to try and implement it myself.

MLopez-Ibanez commented 1 year ago

The parallel-irace script does a bit more than execute multiple runs in parallel. It also takes care of the random seed, creating the exec-dirs, etc.

I would be happy to review and merge a function irace_parallel(runs) or maybe multiple_runs_irace(scenario, parameters, nruns=2, parallel=TRUE). You could start with the parallelization provided by the parallel package (https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf). We already use it within irace (https://github.com/MLopez-Ibanez/irace/blob/7513fef57f845c195d15b469d1e5ccd9dd706598/R/race-wrapper.R#L561-L599).

It would be great if most of the code in parallel-irace could be moved to R.

Saethox commented 1 year ago

If I understand this correctly, on platforms other than Windows the irace runs should just be parallelizable with parallel::mclapply, which, according to the documentation, allows nested calls by default (mc.allow.recursive = TRUE).

Apparently, nested parallel::parLapply does work, although it might be with considerable overhead (https://stackoverflow.com/questions/50938117/r-parallel-clusters-inside-a-cluster).

MLopez-Ibanez commented 1 year ago

I would start even simpler. Create first a function that does sequentially what you want to do in parallel. I would be happy to review and merge that function even without the parallel part. Then make that function work in parallel.

For the parallel part, I would suggest to try with parallel::parLapply mclapply as irace itself does for a first implementation. I don't even know if there are some unknown issues when running irace in this way, so you may want to have a working version first that is doing what you want, then make it faster.

If you are not happy with the performance, you could try the future package, but I haven't investigated it myself, so I don't know how easy is to use.

There is yet another third option: https://callr.r-lib.org/#multiple-background-r-processes-and-poll

MLopez-Ibanez commented 1 year ago

I have merged but I also did a few minor changes after the merging. In particular, I have moved the function to its own file, since I expect it to keep growing when you implement the parallel version.

I also do not think the handling of random seeds is completely correct, since gen_random_seeds(10) returns a list of 70 values, which does not seem correct. But comparing the parallel versus sequential variants will probably shed more light into this aspect.

Looking forward to the next part!

Saethox commented 1 year ago

I also do not think the handling of random seeds is completely correct, since gen_random_seeds(10) returns a list of 70 values, which does not seem correct. But comparing the parallel versus sequential variants will probably shed more light into this aspect.

Yeah, looks like I'm concatenating the lists incorrectly, the seed generated by nextRNGStream is a list of seven values. I also should set the random seed directly and reset the scenario$seed to NA, because irace expects a single positive integer as seed, and not seven integers.

Saethox commented 1 year ago

How should we approach the parallel execution? The execute.experiments function is currently hard coded to the scenario options and the global .irace$target.runner, but a parallel implementation of multi_irace would probably share 90% of the code.

Do you see any pitfalls with adapting execute.experiments to allow executing arbitrary functions either sequentially, with multi-threading, mpi, other clusters, etc.? Then we could use it inside both irace and multi_irace. Depending on how hard it is to get the cluster code to work, I would also be fine with just extracting the multi-threading part of execute.experiments.

MLopez-Ibanez commented 1 year ago

Please, duplicate/copy the parts that you need and do not modify execute.experiments. Once everything is working, if there is some duplication remaining, we can look at creating common functions.

MLopez-Ibanez commented 1 year ago

Yeah, looks like I'm concatenating the lists incorrectly, the seed generated by nextRNGStream is a list of seven values. I also should set the random seed directly and reset the scenario$seed to NA, because irace expects a single positive integer as seed, and not seven integers.

Would that allow someone to repeat individual runs knowing the seed?

I think it may be easier to randomly generate as many integers randomly (or sequentially starting from global_seed) as seeds are needed, then set the scenario$seed for each of them and do not change the RNGkind at all. This way each run of irace will work exactly as if the user had specified the seed that is recorded. It is true that the independent runs are not technically independent any longer because their RNG streams are related by the relation of their seeds, but I doubt that in the context of irace this effect can be measured at all. In fact, when running multiple runs of irace, I often set the seed to 42+run_i.