Fix the shorter method call for BRKGA_MP_IPR::pathRelinking

afkummer commented 2 years ago

Hi @ceandrade, I was testing your library and I think I found a typo in the "shortcut" method for IPR. This problem only came up now due to how the C++ compiler handles template specializations.

Feel free to deny this pull request if I misunderstood the problem! Best, Alberto Kummer

ceandrade commented 2 years ago

Nice catch. I will be happy to merge. But before do it, I would like to understand how did you catch such typo. I usually compile this library using GCC (g++ (MacPorts gcc10 10.3.0_0) 10.3.0) and Clang (clang version 12.0.1) using very thorough checking flags like:

USER_FLAGS = -ggdb3 -fexceptions -fno-omit-frame-pointer \
    -fno-optimize-sibling-calls -fno-inline
USER_FLAGS += -Wall -Wextra -Wcast-align -Wcast-qual -Wdisabled-optimization \
    -Wformat=2 -Winit-self -Wmissing-format-attribute -Wshadow \
    -Wpointer-arith -Wredundant-decls -Wstrict-aliasing=2 \
    -Wfloat-equal -Weffc++

So, I'm surprised I haven't seen it. Am I missing something?

Thanks!

Carlos

afkummer commented 2 years ago

Well, I think this issue is related to how the C++ compiler handles the instantiation of template classes. As far as I could analyze, seems like the compiler still checks for syntactical problems, but only some very simple kinds of semantic errors.

In summary, if you have a template class with overloaded methods, then this issue may happen, and it only shows up when there is an explicit call for such a problematic method. Unfortunately, I do not think the compilers can detect this kind of problem, even if we set more rigorous compilation flags like the ones you mentioned.

This gist is a MWE that illustrates this issue: https://gist.github.com/afkummer/c287e22281bc801950b7781f446f82b8 You can try this code online here: https://godbolt.org/z/de3TnqdW3

Nice weekend! Alberto

ceandrade commented 2 years ago

So, I would like to commit only the changes you made in the PR signature call. Could you reverse the changes in rand01() calls, so that I can merge the previous change, and make you an official contributor?

And, I guess is better that, on each change, we make a different pull request. So, we can track the changes.

I'm working on an updated version, so I want to make sure we merge the changes safely.

ceandrade commented 2 years ago

About the PR call, this is the reason we always have to write unit tests. Note that both Julia and Python versions, since I wrote from scratch, I included extensive unit tests, to make sure such things don't happen. The C++ version is much older, and came from the original BRKGA implementation. In that version, we have no unit tests. I was flirting to make unit tests, but I have no time for that, this time. Maybe in the future.

afkummer commented 2 years ago

Sorry, Carlos. I thought that pull requests apply to individual commits, so I inadvertently made a few others before opening this pull request! I did some cherry-picking in the repository, so now this PR only includes the changes I have mentioned in my first message.

A small tease about these other changes: I noticed that the meta-heuristic code calls std::shuffle in each mating operation. It turns out this method has a linear complexity in the size of the chromosome, which can be quite expensive across the entire execution of the algorithm. So I experimented with OpenMP, and after some changes, I managed to run the mating in parallel. By now, my tests with the included TSP example have the following results. (Intel i7-7700HQ, linux)

Sequential processing

Command line: ./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 1
Generations: 250
Cost: 18943

Parallel processing (parallel decoding only)

Command line: ./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 8
Generations: 311
Cost: 18878

Parallel processing (parallel decoding & mating)

Command line: ./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 8
Generations: 1226
Cost: 18805

It seems like that the last approach (parallel decoding and mating) significantly increases the number of generations that the algorithm evolves within a span of 60 seconds. Would you mind letting me know if you think it may be worth including these changes in your repository? I also have some spare time to run more thorough testing of this modified code. We can open another pull request or maybe an issue report to discuss a few results.

Best, Alberto

ceandrade commented 2 years ago

Sorry, Carlos. I thought that pull requests apply to individual commits, so I inadvertently made a few others before opening this pull request! I did some cherry-picking in the repository, so now this PR only includes the changes I have mentioned in my first message.

A small tease about these other changes: I noticed that the meta-heuristic code calls std::shuffle in each mating operation. It turns out this method has a linear complexity in the size of the chromosome, which can be quite expensive across the entire execution of the algorithm. So I experimented with OpenMP, and after some changes, I managed to run the mating in parallel. By now, my tests with the included TSP example have the following results. (Intel i7-7700HQ, linux)

Sequential processing
* Command line:  `./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 1`

* Generations: 250

* Cost: 18943
Parallel processing (parallel decoding only)
* Command line:  `./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 8`

* Generations: 311

* Cost: 18878
Parallel processing (parallel decoding & mating)
* Command line:  `./main_complete -c config.conf -s 1 -r G -a 9999 -t 60 -i ../instances/rd400.dat -n 8`

* Generations: 1226

* Cost: **18805**
It seems like that the last approach (parallel decoding and mating) significantly increases the number of generations that the algorithm evolves within a span of 60 seconds. Would you mind letting me know if you think it may be worth including these changes in your repository? I also have some spare time to run more thorough testing of this modified code. We can open another pull request or maybe an issue report to discuss a few results.

Best, Alberto

That's cool @afkummer. I think we may have issues with reproducibility using parallel mating. Note that, if we are pulling data from the RNG in parallel, we may lost the sequence of states, and therefore, we may not reproduce the results, even using the same seed, since we don't how the threads are been called in a given run. I think you can open an issue and we can discuss your solution.

Thanks,

Carlos

ceandrade commented 2 years ago

@afkummer I have looked into your parallel mating code, and it looks interesting. I have a few suggestions to even improve the performance, using iota() on per-thread shuffled_individuals (lines 2121 and 2122) of brkga_mp_ipr_par.hpp. I have a few experiments showing such performance.

I also perform some tests with OpenMP to guarantee we can reproduce all experiments if needed. This is extremely important in both the scientific and corporative world. (we have several examples of why we don't use quicksort in production, for example).

Anyway, there are a few things to discuss, and I think we can do it per item. I'm looking forward you open an issue to discuss all these things. I would like you to implement these changes (instead of me) and we merge them into the main repo, and have you as a collaborator.

afkummer commented 2 years ago

Sure we can discuss these problems in more detail, @ceandrade! And sorry for the delay in my answer, I was preparing some data for the appointment I had today with my thesis advisor. I am just finishing my Ph.D., and we are finishing the review of a manuscript we submitted early this year.

Looking forward to collaborating with you!

ceandrade commented 2 years ago

Don't worry. I have some free time last week, so I got this strike. We can talk any time you find it good. Good luck with your paper.

ceandrade commented 2 years ago

@afkummer, I know you are busy, but do you mind committing your brkga_mp_ipr/brkga_mp_ipr_par.hpp and sending me a pull request?

I have reviewed that code and, indeed, polished and incorporated it in the main "brkga_mp_ipr.hpp". However, before I commit the full changes, I want to have your code pull request, so that people can track your contributions.

I think such changes should pump the current version to v1.2.0. I also intend to create a changelog (long overdue), and add you as an official collaborator. Please, let me know papers and citations you would like to be cited too.

Sorry to press, but I have the BRKGA-MP-IPR 2.0 almost ready to push to the public. And your changes are most welcome, because they really impact the overall performance.

afkummer commented 2 years ago

Hi @ceandrade!

No worries. I have been developing a cheap mechanism to ensure the reproducibility of the runs with the parallel mating algorithm.

A few days ago, I noticed that this could be achieved by ensuring that each mating uses the same sequence of random numbers. Then I came out with this solution: I generate and store one seed for each "mating operation" and then create a local std::mt19937 generator with these seeds. For generating these seeds, I use the main generator BRKGA_MP_IPR::rng.

Changes here: https://github.com/afkummer/brkga_mp_ipr_cpp/blob/4680b1ddcec8b211ced29e07fc0aa284f20b297d/brkga_mp_ipr/brkga_mp_ipr_par.hpp#L2091

I just pushed a cleanup commit to my fork. Until now, I have kept my WIP into a separate file to ease the tracking of changes. Anyway, I have one more question before opening the pull request: my changes are stored in branch WIP/parallel_mating. Do I open a PR using this branch? Or it is ok if I merge it into master, and then I open the PR?

I have also made some benchmarks, but I think you are already convinced of the benefits of the parallel mating approach. :-)

ceandrade commented 2 years ago

Hi @afkummer, very nice. So, I liked your previous idea. I have incorporated parts of your code in the branch experimental/parallel_mating. I made some enhancements to increase the performance preallocating some data structures per thread. Please, check it.

I did some tests, and it looks like it holds the reproducibility (performance-wise, it is amazing for fast decoders). Indeed, it requires we use the proper OpenMP flags. The following code can help us to see that:

#include <chrono>
#include <iostream>
#include <omp.h>
#include <random>
#include <thread>
#include <vector>

using namespace std;
const unsigned NUM_THREADS = 6;

int main() {
    std::mt19937 rng(270000001);
    std::uniform_int_distribution<int> dist(100, 500);
    rng.discard(1000);

    unsigned num_chunks;
    unsigned num_iterations;
    cin >> num_chunks >> num_iterations;

    cout
    << "\nNum. chunks: " << num_chunks
    << "\nNum. iterations: " << num_iterations
    << endl;

    vector<vector<int>> threads_per_chunk(num_chunks);
    for(auto& v : threads_per_chunk)
        v.reserve(iterations);

    for(unsigned i = 0; i < num_iterations; ++i) {
        #ifdef _OPENMP
            #pragma omp parallel for num_threads(NUM_THREADS) schedule(static, 1)
        #endif
        for(unsigned j = 0; j < num_chunks; ++j) {
            threads_per_chunk[j].push_back(omp_get_thread_num());
            std::this_thread::sleep_for(std::chrono::milliseconds((dist(rng))));
        }
    }

    for(unsigned j = 0; j < num_chunks; ++j) {
        cout << "\nChunk " << j << ": ";
        for(unsigned i = 0; i < num_iterations; ++i) {
            cout << threads_per_chunk[j][i] << " ";
        }
    }

    return 0;
}

Compiling this code using aggressive optimization and warning flags:

$ g++ -std=c++17 -O3 -pthread -fopenmp -fomit-frame-pointer -funroll-loops -ftracer -fpeel-loops \
    -fprefetch-loop-arrays -Wall -Wextra -Wcast-align -Wcast-qual -Wdisabled-optimization -Wformat=2 \
    -Winit-self -Wmissing-format-attribute -Wshadow -Wpointer-arith -Wredundant-decls \
    -Wstrict-aliasing=2 -Wfloat-equal -Weffc++ -Wunsafe-loop-optimizations openmp_test.cpp -o test

and testing, we have that:

$ ./test
10
5

Chunks: 10
Iterations: 5

Chunk 0: 0 0 0 0 0
Chunk 1: 1 1 1 1 1
Chunk 2: 2 2 2 2 2
Chunk 3: 3 3 3 3 3
Chunk 4: 4 4 4 4 4
Chunk 5: 5 5 5 5 5
Chunk 6: 0 0 0 0 0
Chunk 7: 1 1 1 1 1
Chunk 8: 2 2 2 2 2
Chunk 9: 3 3 3 3 3

In this code, each chuck is handled by a thread in a single iteration by only adding the thread ID to the list. To maintain reproducibility, we must guarantee that on each iteration, each chunk is handled by the same thread of previous iterations.

We can archive that using schedule(static, 1) on the OpenMP pragma. This directive instructs OpenMP to bind the thread to the portion/chunk of the loop. In this case, we set to 1, so each thread will take the inner loop iterations in sequence. Note that we could make schedule(static, num_chunks / NUM_THREADS), to bind loop portions to the threads. However, we don't need to do that since our mating loop is complex but stable enough to absorb the cost of the thread context switching. So, my tests show that such a strategy (your original strategy) looks to maintain reproducibility. I have done that in several test cases, including a heavily loaded machine (which implies many thread-context switching).

Now, your second version looks to be safer. However, it may have a performance issue. You need to create and warm up an RNG on each iteration. Mersenne PRNGs are notorious for the memory demand and slow yield compared to other PRNGs. So, we will have a lot of overhead.

I think we should test both options more to 1) guarantee reproducibility and 2) see with one is faster, considering both fast and slow decoders.

So, please, make a pull request of your suggestions/changes in experimental/parallel_mating. If you need help to do so, take a look here. Let's continue this discussion there.

ceandrade / brkga_mp_ipr_cpp

Fix the shorter method call for BRKGA_MP_IPR::pathRelinking #2