Function visualization generates unreadable output file

CMA-ES / libcmaes

libcmaes is a multithreaded C++11 library with Python bindings for high performance blackbox stochastic optimization using the CMA-ES algorithm for Covariance Matrix Adaptation Evolution Strategy

Other

321 stars 78 forks source link

Function visualization generates unreadable output file #115

Closed fergu closed 9 years ago

fergu commented 9 years ago

Hello,

I am using libcmaes in parallel on a cluster at my university which is running CentOS (6, I believe). I am running on a total of 6 cores on 1 node. The job is scheduled and run through PBS.

I have enabled libcmaes to write the results of the optimization to an output file as described on the documentation page "Visualizing Optimization Results and Convergence". The file is generated and populated.

The problem comes when using cma_multplt.py to visualize the contents of this file. I get errors such as

Traceback (most recent call last):
  File "cma_multiplt.py", line 31, in <module>
    dat = loadtxt(sys.argv[1],dtype=float)
  File "/usr/lib/pymodules/python2.7/numpy/lib/npyio.py", line 827, in loadtxt
    items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): 0.01771583.21

Where, if one looks in the file, these oddly formatted numbers show up in a number of places in ways that are not clear to decipher. This makes visualizing the convergence data impossible.

I believe this is a bug that may be associated with parallel execution, as I have successfully viewed convergence data for serial runs. The nature of these numbers looks almost like two threads tried to write to the file at the same time.

I will mention, though I am not sure if it is relevant, that the job is still running at the time that I attempt to visualize the results (These jobs take several days, so viewing progress is helpful). I simply copy the file to my local machine and run the python script there - so I am not visualizing from the same file that is being actively written to.

beniz commented 9 years ago

Hello,

Can you try to customize the default plotting function with a mutex by any chance ?

If not, I will be able to do it shortly and give you a branch to pull from.

This would help deciding whether the parallelization is causing the problem here.

And yes you can plot before the optimization has completed.

On February 11, 2015 4:23:32 PM GMT+01:00, kjfergu notifications@github.com wrote:

Hello,

I am using libcmaes in parallel on a cluster at my university which is running CentOS (6, I believe). I am running on a total of 6 cores on 1 node. The job is scheduled and run through PBS.

I have enabled libcmaes to write the results of the optimization to an output file as described on the documentation page "Visualizing Optimization Results and Convergence". The file is generated and populated.

The problem comes when using cma_multplt.py to visualize the contents of this file. I get errors such as
Traceback (most recent call last):
 File "cma_multiplt.py", line 31, in <module>
   dat = loadtxt(sys.argv[1],dtype=float)
File "/usr/lib/pymodules/python2.7/numpy/lib/npyio.py", line 827, in
loadtxt
   items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): 0.01771583.21
Where, if one looks in the file, these oddly formatted numbers show up in a number of places in ways that are not clear to decipher. This makes visualizing the convergence data impossible.

I believe this is a bug that may be associated with parallel execution, as I have successfully viewed convergence data for serial runs. The nature of these numbers looks almost like two threads tried to write to the file at the same time.

I will mention, though I am not sure if it is relevant, that the job is still running at the time that I attempt to visualize the results (These jobs take several days, so viewing progress is helpful). I simply copy the file to my local machine and run the python script there - so I am not visualizing from the same file that is being actively written to.

Reply to this email directly or view it on GitHub: https://github.com/beniz/libcmaes/issues/115

Sent from my Android device with K-9 Mail. Please excuse my brevity.

fergu commented 9 years ago

Hello,

I can attempt this. I will just need a little bit to locate the relevant source file (or proper way to do this, if you mean customizing my code).

On 02/11/2015 10:40 AM, Emmanuel Benazera wrote:

Hello,

Can you try to customize the default plotting function with a mutex by any chance ?

If not, I will be able to do it shortly and give you a branch to pull from.

This would help deciding whether the parallelization is causing the problem here.

And yes you can plot before the optimization has completed.

On February 11, 2015 4:23:32 PM GMT+01:00, kjfergu notifications@github.com wrote:
Hello,

I am using libcmaes in parallel on a cluster at my university which is running CentOS (6, I believe). I am running on a total of 6 cores on 1 node. The job is scheduled and run through PBS.

I have enabled libcmaes to write the results of the optimization to an output file as described on the documentation page "Visualizing Optimization Results and Convergence". The file is generated and populated.

The problem comes when using cma_multplt.py to visualize the contents of this file. I get errors such as
Traceback (most recent call last):
File "cma_multiplt.py", line 31, in <module>
dat = loadtxt(sys.argv[1],dtype=float)
File "/usr/lib/pymodules/python2.7/numpy/lib/npyio.py", line 827, in
loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): 0.01771583.21
Where, if one looks in the file, these oddly formatted numbers show up in a number of places in ways that are not clear to decipher. This makes visualizing the convergence data impossible.

I believe this is a bug that may be associated with parallel execution, as I have successfully viewed convergence data for serial runs. The nature of these numbers looks almost like two threads tried to write to the file at the same time.

I will mention, though I am not sure if it is relevant, that the job is still running at the time that I attempt to visualize the results (These jobs take several days, so viewing progress is helpful). I simply copy the file to my local machine and run the python script there - so I am not visualizing from the same file that is being actively written to.

Reply to this email directly or view it on GitHub: https://github.com/beniz/libcmaes/issues/115
Sent from my Android device with K-9 Mail. Please excuse my brevity.

— Reply to this email directly or view it on GitHub https://github.com/beniz/libcmaes/issues/115#issuecomment-73902268.

fergu commented 9 years ago

Hello again,

It looks like I might be digging in a little deeper than I thought. I would appreciate a pointer to the right file to start in, or otherwise you may be better suited to make this modification correctly.

Thanks

beniz commented 9 years ago

My apologies, I thought the doc had a description on how to modify the default plotting function from user space but it not there, so I ll take care of this and will make it available to you shortly.

On February 11, 2015 4:59:11 PM GMT+01:00, kjfergu notifications@github.com wrote:

Hello again,

It looks like I might be digging in a little deeper than I thought. I would appreciate a pointer to the right file to start in, or otherwise you may be better suited to make this modification correctly.

Thanks

Reply to this email directly or view it on GitHub: https://github.com/beniz/libcmaes/issues/115#issuecomment-73905919

Sent from my Android device with K-9 Mail. Please excuse my brevity.

beniz commented 9 years ago

The easiest way I see to test the first hypothesis is to modify the default plotting function. You can find an example of this in examples/sample-code-pffun.cc. The main lines are:

PlotFunc<CMAParameters<>,CMASolutions> plotf = [](const CMAParameters<> &cmaparams, const CMASolutions &cmasols, std::ofstream &fplotstream)
{
  fplotstream << "kappa=" << cmasols.max_eigenv() / cmasols.min_eigenv() << std::endl; // storing covariance matrix condition number to file.        
  return 0;
};

which defines the custom plotting function that writes data to file. It is then passed as a parameter to the main (high level) optimization function:

CMAParameters<> cmaparams(x0,sigma);
  cmaparams.set_fplot("pffunc.dat"); // DON'T MISS: mandatory output file name.                                                                      
  CMASolutions cmasols = cmaes<>(rosenbrock,cmaparams,CMAStrategy<CovarianceUpdate>::_defaultPFunc,nullptr,cmasols,plotf);

A mutex can be added to the function as follows:

#include <mutex>
std::mutex plmtx;
PlotFunc<CMAParameters<>,CMASolutions> plotf = [](const CMAParameters<> &cmaparams, const CMASolutions &cmasols, std::ofstream &fplotstream)
{
  lock_guard<std::mutex> lock(plmtx);
  fplotstream << "kappa=" << cmasols.max_eigenv() / cmasols.min_eigenv() << std::endl; // storing covariance matrix condition number to file.        
  return 0;
};

If you would like to use the default plotting function along with the mutex above, the default function can be found in src/cmastrategy.cc, look for

 eostrat<TGenoPheno>::_pffunc = [](const CMAParameters<TGenoPheno> &cmaparams, const CMASolutions &cmasols, std::ofstream &fplotstream)
...

Bringing this scheme to the lib internals is a bit more complicated, and therefore I thought you would be better back on track if the trick above fixes the issue. Also, given that I might not be able to exactly replicate the behavior of the cluster you are using, it might be better if you could try the above and report. Hope it is not too painful :/

fergu commented 9 years ago

I am going to attempt overriding the default plotting function as described in your first two examples. It seems like an easier approach to at least answer the question at hand without potentially introducing unexpected bugs by changing the lib itself.

The run will probably need ~12 hours to generate enough iterations to confidently say that the bug is fixed or not. I will report back then!

beniz commented 9 years ago

without potentially introducing unexpected bugs by changing the lib itself.

It is unclear how your cluster might split the job so that parallel writing occurs, if you can share a description of the tools, this could be helpful.

The run will probably need ~12 hours to generate enough iterations to confidently say that the bug is fixed or not.

OK, we may want to find a simpler testbed, I'll get back to you if I think of one, or even better if I can reproduce one.

fergu commented 9 years ago

As far as I can tell (According to the info given by PBS), the code was run on 6 cores on a single node. The job was then called with mpiexec (OpenMPI). Beyond that I'm not sure I can say much more. If you have something specific you are wondering about, I can ask the administrator and see what he knows.

As far a simpler testbed goes - I may try running the simple 'sphere' example given on the wiki page in parallel and see if the same result occurs.

fergu commented 9 years ago

I had a thought on this that perhaps you can comment on.

My code that uses libcmaes is called like so

mpiexec -n 6 ./mycode

To my understanding, what this does is execute mycode with 6 available cores (Which mycode then occupies). Is there any chance that the routine used to run this in parallel instead tries to run 6 instances of mycode instead?

I ask because I noticed that in my output file from this last run has a fval that looks like two different fvals written on top of each other.

E.G - If I expect an answer of ~ 123.45, I have an fval that looks like 123.23.45

(Edit: In reality, the two numbers are not the same. I.E 123.4523.46 or something like that.)

Which makes me wonder if for some reason the actual evaluation is being carried out more than once.

beniz commented 9 years ago

Is there any chance that the routine used to run this in parallel instead tries to run 6 instances of mycode instead?

Yes, this would be my leading hypothesis here. Because otherwise I have a hard time understanding how the call to the plot function could be made in parallel. The other hypothesis would be something with the cluster's filesystem.

I will review related MPI internals tomorrow morning and let you know.

fergu commented 9 years ago

Just reporting back - Your assessment that MPI is calling the process multiple times is correct.

I wrote a simple c++ program:

#include <iostream>

using namespace std;

int main(int argc,char *argv[])
{
    cout << "THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG\n";
    return 0;
}

and then scheduled it via PBS. The program is called via

mpiexec -n 6 ./test.out

The PBS job log shows

Started at Wed Feb 11 13:55:25 EST 2015
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG
THIS IS A TEST. IF THIS IS CALLED MORE THAN ONCE, THIS SHOULD APPEAR MORE THAN ONCE IN THE LOG

Which confirms that the code is being called 6 times, not that the code is called once and 6 cores are made available.

I need to read up a bit on parallel execution and thread-safety to see if I can remove the lock on my fitness function. I'm not referencing a variable outside of the function, so I believe it is safe - I just need to be sure.

So it sounds like this is on my end. This bug can likely be closed.

fergu commented 9 years ago

Just an update

I removed the mutex on my progress function (Turns out that it was a bit of overkill caused by some misunderstanding on my part), and moved it to the only part that might be (sort of) not thread safe.

I then changed the section in the PBS script to call mycode as

./mycode

and I am now seeing that my code is called just once, and has forked to (right now) 5 processes.

I have not overridden the plotting function for this. I will report back in a few hours once a few iterations have completed.

Thanks for your help!

beniz commented 9 years ago

A natural way to parallelize the optimization with CMA-ES is to parallelize the objective function calls. The lib supports this with threads, see https://github.com/beniz/libcmaes/wiki/Multithreading

However, it is possible to go beyond this simple scheme. The first observation to make before digging into this further is whether the calls to your objective function dominate the computational cost of the optimization. Information on the dimension of your problem as well as a measure of a single call to your objective function should help assess this.