Gecode / gecode

Generic Constraint Development Environment
https://www.gecode.org
Other
275 stars 76 forks source link

On Restart Meta Search #174

Closed Dekker1 closed 1 year ago

Dekker1 commented 1 year ago

This PR contains the Gecode implementation for the on_restart meta-search feature, which is being added to MiniZinc as an experimental feature. The idea of this feature is that it allows MiniZinc users to describe some types of meta-search algorithms in their models, which are then translated for the normal propagation/solving mechanisms of the solvers apart from a few new introspection functions that are evaluated on every restart.

These features are further discussed in the following publications.

The implementation in this PR has been designed to be contained and make minimal impact on performance unless the new features are used

zayenz commented 1 year ago

Interesting work @Dekker1! I haven't gone through it in detail yet (will do so later), although I've done a short skim over. A couple of comments:

One thing that stood out to me is the use of std::vector for storing the information in a OnRestartData. That would be the first use of std::vector as storage and not just as an argument inside Gecode proper except for a few usages in FlatZincSpace and related infrastructure. While I'm not personally opposed to std::vector, I just thought I'd bring it up. At least something worth discussing.

Although the propagators are quite straight forward, we probably want to have some basic tests in the test infrastructure as well, just to ensure that they are exercised fully.

Dekker1 commented 1 year ago

Regarding std::vector, it seemed the most straightforward way to manage this in a way that the copy constructors and such would be defined. (Rather then using new std::pair<int, bool>[n] or something). Of course since the sizes don't change, it could all be allocated in a single object with some size calculations, but this seems needlessly complex without the performance requirements warranting it. Let me know if a different solution is preferred, I'm happy to change it.

Regarding the testing, I do have some written test cases within the MiniZinc testing framework that tests the overall functionality. I don't have much experience with the Gecode test framework, and since they only record information, I don't really know how to write (usefull) tests for them. Any pointers would be appreciated.

zayenz commented 1 year ago

I just tried pulling the pr (gh checkout pr 174) and compiling on a Mac M1 Max with the following compiler

X clang --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

and it gives a lot of errors and warnings (the same system compiles release/6.3.0 without warnings). See relevant compilation output in this Gist: https://gist.github.com/zayenz/d0ab09c1b8fd8aa84649d618e76036f7

The sequence of commands used were

make clean
./configure
make depend
time make -j10

Note that make depend changed the dependencies (for example, adding files complete and lastval to parser.tab$(OBJSUFFIX), and that change needs to be added into the PR as well.

Dekker1 commented 1 year ago

I have been building using CMake, so I forgot about Makefile.dep and didn't see the same warnings. I've amended the commit to resolve the warnings and added the changes to Makefile.dep.

zayenz commented 1 year ago

The parser.hh file did not include <array> which led to the compilation errors (some systems seem to include array implicitly from <vector>, but not all systems). I've pushed a fix for that to the branch.

zayenz commented 1 year ago

Regarding std::vector, it seemed the most straightforward way to manage this in a way that the copy constructors and such would be defined. [...] Let me know if a different solution is preferred, I'm happy to change it.

As mentioned, I'm not really against it, it is just something that needs a bit thinking about. Most data-structures in Gecode are Space or Region allocated, and the remaining data-structures are allocated via the heap support. One reason for this is that when GECODE_PEAKHEAP is enabled, the peak heap usage is recorded.

As a comparison, the GPI (Global Propagator Information, defined in gpi.hpp allocates global memory using arrays of Block which inherits from HeapAllocated.

I realize that the enclosing FlatZincSpace class also has some std::vector-members, so it was not as new as I first thought in the flatzinc module.

Regarding the testing, I do have some written test cases within the MiniZinc testing framework that tests the overall functionality. I don't have much experience with the Gecode test framework, and since they only record information, I don't really know how to write (usefull) tests for them. Any pointers would be appreciated.

If you have good MiniZinc tests, then these could be added as FlatZinc tests in the test/flatzinc/ directory following the pattern from any of the files.

As for more direct testing, I'm somewhat unsure. Somewhat related testing is done in test/afc.cpp which does a bunch of copies and clones to test the AFC infrastructure, while the test/search.cpp file tests some search operations. Neither is an exact match though.

For the current changes, I would say some FlatZinc-tests are needed at least.

An additional general issue that I'm thinking about, how is parallel access handled to the shared handle and the information it contains? I have not gone through the changes in detail, so it might be that I'm missing something with how it is (intended to be) used, but it seems to me that a parallel search may write to the data in parallel?

Dekker1 commented 1 year ago

With your hints I found that there already is a ArgArrayBase<T> that has all the functionality of std::vector that I need, and that should be correctly allocated according to the remainder of Gecode.

I will at least add the MiniZinc/FlatZinc tests that I've been able to come up with.

Regarding running in parallel, that is something that we haven't really considered yet in the implementation. Thinking about it now, it seems that most of it will already work correctly: the solution values are managed by the Restart Based Search engine (which unassumingly will be thread specific) and most other values are constants. The value of mark_complete would probably be correct to be shared between threads, as logically when you say the search is complete in your MiniZinc model, you wouldn't expect this to be local to one thread. I don't think there is any RW conflict, but we could use of mutex or atomic to ensure this cannot go wrong. The only values that should probably not be shared are the last_val values. I will have to look at creating seperate copies for each thread. If I remember correctly, the master function should be called once per thread, right?

Dekker1 commented 1 year ago

Looking further into the parallelisation, it seems that when using the RBS meta with the DFS engine (as FlatZincSpace does in run when restarts are present), the different restarts are not in parallel. So only the DFS within RBS is in parallel. I think in this case last_val values are also correct, since they are only supposed to be clearly defined when set through propagation. If they are set through search, any combination is valid.

We could still consider the use of synchronisation measures to avoid race conditions, but I don't think they would result in any invalid values.

zayenz commented 1 year ago

With your hints I found that there already is a ArgArrayBase that has all the functionality of std::vector that I need, and that should be correctly allocated according to the remainder of Gecode.

Unfortunately, while it has a lot of useful functionality it is not intended for storage -- the argument arrays are intended for short-lived data only (e.g., for small arrays they use in-object allocated array). Sorry for the extra work I put you through here, but I think we should revert the last change you made (I hope you have local history as you have squashed the additional changes, otherwise I can restore from my local copy).

There is no general std::vector-like structure in Gecode, where arrays are needed they are allocated as C-style arrays with a pointer and a size.

I think, especially given that std:.vector is already used in the flatzinc module, that we can leave them there. If the handle and related propagators were to be moved into a more general module, then we can revisit that. @guidotack, does that sound good with you as well?

zayenz commented 1 year ago

For parallelization, we should ensure that it works correctly regardless of which search engine combination and portfolio is used for the flatzinc runner, as that should be upgradable without being limited by the on-restart functionality.

There are a several cases to consider

As long as everything is sequential or has no restarts, then there should be no issues.

For a portfolio based system, I'm thinking that each asset could perhaps configure it's own restart_data copy to have it's own last value values? This is best left for whenever portfolio based search is added to the flatzinc runner.

This leaves parallel standard searches such as DFS and BAB. I'm not too concerned with the safety of storing the information into the OnRestartData last_val_X arrays (as far as I know, concurrent updates of values in a std:.vector that is never resized are ok, but I might be wrong here). I am, however, wondering a bit about the semantics. Consider a DFS with two threads that is restart-based. Assuming for simplicity that the two threads are mostly synchronized diving towards two failures, and that will trigger a restart. Both threads will store last value values, and they will be mixed between two (partially or fully) unrelated paths. While this is not an error (I'm assuming no guarantees are made with respect to what the last_val values are), I also do not see if it will mean anything at all?

zayenz commented 1 year ago

Regarding the parallel search issue, consider also the case where restarts are done after solutions. In this case with multiple threads, it is quite likely that most of the last_val values are from aborted searches from workers and not from the actual solution found.

I'm guessing that this might be quite a large problem in getting the desired behavior for the restart annotation to work correctly. As Gecode uses a work-stealing method for the parallel search, there is no clear division among the threads what part of the search belongs to what thread. The alternative would be to use a large portfolio search of individual single threaded searches instead, although that is quite a different setting and has different uses.

Perhaps it is not so important that the last_val values are from related parts of the search tree, that is something that you would know better as you have used it.

Dekker1 commented 1 year ago

I think it could give some unexpected behaviour (that I'll make sure to comment on in the last_val(x) documentation), but not for the purposes for which last_val was introduced.

When you create a meta-search heuristic there is basic state that you need to track of. For example “what was the last neighbourhood method that I used”, “how many times has this neighbourhood been (un)-succesful”, or “which (indexed) variable am I currently optimizing”. Initially the idea was to use sol(x), but we quickly found that the big problem is that you cannot update the state unless the solver sees something as a solution. You can of course play around with what the solver should accept as a solution, we found the idea of tracking select last_val much more practical.

To make it concrete, the uses of last_val(x) have so far always been in the context of defining x. For example, x = last_val(x) + 1 to simply count the number of restarts, x = last_val(x) + status() in {UNKNOWN, UNSAT} to count the number of times restart failed to find a solution, or x = (last_val(x) + 1) mod len to select a method in round robin style.

Although I could imagine people would find other uses for last_val(x), I don't think it is unreasonable that last_val(x) might have set a (very) different value than sol(x) found within the same restart when parallel search is enabled. It would always we possible to check whether a solution is found and use sol(x) in case it has, and last_val(x) otherwise.

Dekker1 commented 1 year ago

I've added simple FlatZinc test cases for the added functionality. I've had to make a slight change to the FlatZincTestCase code to allow me to pass in additional command line flag to set the restart flags correctly.

zayenz commented 1 year ago

Looks good to me.

If you (as a designer of how the annotations are supposed to work) are content with how they will work in conjunction with parallel search, then that is good enough for me. I had originally though of last_val as mostly used for solution-based phase-saving style uses, in which case the mixture could be quite unintuitive (i.e., almost all restarts fail immediately).

For the translation side, if it is possible to annotate on-restart-related constraints and variables with a marker on a conservative best-effort basis, I think that would be useful as solver author. Considering free-search variants, where a solver might want to drop such parts, or change them in some way.

I will do a final pass through in a couple of days (no time right now) and then if nothing else has come up I will merge.

zayenz commented 1 year ago

I did a pass through again and tested locally, and everything looks good to me.

Nice job, and looking forward to testing it out in future MiniZinc-versions.