JasperSnoek / spearmint

Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper: Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Advances in Neural Information Processing Systems, 2012
http://people.seas.harvard.edu/~jsnoek/software.html
1.39k stars 335 forks source link

Trying to Understand the Learning Progression #21

Open Quanticles opened 10 years ago

Quanticles commented 10 years ago

Hi,

I'm trying to understand what the learning process is doing - it doesn't seem to be working for me. I'm learning two parameters with GPEIOptChooser on a neural network. The parameters are a global learning slowdown factor and the number of epochs to run.

I thought this should be an easy test where spearmint would dial in the best parameters quickly, but it seems to be struggling.

Questions:

  1. I had an error which caused one result to come back as "1.0" for the error. Does this break spearmint, i.e., should I filter these out when I'm running?
  2. Why does GPEIOptChooser choose only 4 elements from the grid?
  3. The best result is for Job 20054 - why does't GPEIOptChooser search around there more?
  4. Is there anything else that I can do to improve the quality of these results assuming that running a job takes a very long time?

Thanks, Dave

batch_error job_id learning_slowdown_factor epochs
0.1862 0 0.966050879 200
0.1273 1 0.994260074 600
0.131 3 0.9956919593 400
0.1315 2 0.9935449037 800
0.157 20002 1 200
0.1537 20003 0.981515085 200
0.1365 20004 0.9882011634 582
0.1329 20000 1 1000
0.1284 20001 0.9931160484 1000
0.1354 20006 0.9903122127 487
0.1413 20007 1 683
0.1417 20005 1 1000
0.1263 20008 0.9968256736 1000
0.1308 20009 0.9916010919 819
0.1281 20010 0.9931160484 1000
0.1249 20012 0.9974221541 792
0.1273 20011 0.9973518724 1000
0.1253 20013 0.9949636424 1000
0.1414 20014 1 464
0.13 20015 0.9965389618 619
0.1261 20016 0.9962885154 828
1 20019 0.9984114634 1000
0.1506 20020 0.9878039695 200
0.1296 20017 0.9985251447 892
0.1283 20018 0.9957674257 1000
0.1567 20022 1 200
0.1348 20021 0.9894007513 527
0.1823 20023 0.966050879 200
0.169 20025 1 200
0.1299 20026 0.9931053903 568
0.1362 20027 0.9893965015 648
0.1451 20028 0.9954126479 200
0.1821 20029 0.966050879 200
0.1296 20024 0.9931160484 1000
0.1339 20030 0.9907583517 744
0.1312 20032 0.9919301873 499
0.1322 20031 0.9899663619 685
0.166 20034 0.9765412915 200
0.1508 20036 0.9939019601 200
0.147 20037 0.9905185445 200
0.1327 20033 0.9931160484 1000
0.1288 20035 0.9937639185 1000
0.1636 20040 0.9780630581 200
0.1762 20041 0.9705297816 200
0.1264 20038 0.9989655154 1000
0.1503 20043 0.9979644837 200
0.1388 20039 0.9995222416 1000
0.179 20045 0.9682962955 200
0.1278 20042 0.9934396689 1000
0.1471 20046 0.9859738178 200
0.1707 20047 0.9735643934 200
0.1588 20049 0.9839755669 200
0.1297 20044 0.9941364551 1000
0.1443 20051 0.9915891514 200
0.1301 20048 0.9976431404 1000
0.1246 20050 0.9963487573 1000
0.1292 20052 0.9983211645 712
0.1268 20053 0.9945645556 1000
0.1222 20054 0.9954587131 1000
0.1425 20057 0.996722093 200
0.1707 20058 0.9751610399 200
0.1377 20055 0.9997676992 1000
0.1316 20056 0.9992591138 1000
0.1255 20059 0.9960245532 1000
0.1308 20060 0.9987979256 1000
0.1265 20061 0.9971301167 1000
0.1509 20062 0.985206682 450
0.1304 20065 0.9969074907 439
0.1286 20063 0.9952750527 548
0.1257 20064 0.9952124967 1000
0.131 20067 0.9910679167 680
0.1256 20066 0.9943426942 1000
0.176 20069 0.9718216344 200
0.1316 20068 0.9935220176 681
0.1611 20072 0.9799846017 200
0.1312 20070 0.9959859779 481
0.1393 20074 0.9995349033 423
0.1328 20073 0.9908170037 694
0.138 20075 0.994021321 300
0.1249 20071 0.9965804829 1000
0.1542 20076 0.9827179706 200
0.1271 20077 0.9943019096 645
0.1321 20078 0.9990973507 602
0.1354 20079 0.9983849074 789
0.1347 20081 0.9944653156 525
0.1285 20080 0.9972013322 821
0.1326 20082 0.9908891399 638
0.133 20083 0.9986596423 640
0.1309 20084 0.9927719836 682
0.1317 20085 0.9932655745 1000
0.1286 20086 0.9976215116 916
0.1285 20087 0.9947586907 1000
0.1796 20090 0.9693746361 200
0.1268 20088 0.9952392072 684
0.1535 20092 0.9849817771 200
0.1507 20093 0.9869039046 200
0.1258 20091 0.9924512388 559
0.1262 20089 0.9969174738 801
0.1251 20094 0.9921062573 683
0.1313 20095 0.9986064723 797
0.1249 20096 0.9962938728 741
0.1463 20098 0.9927039257 200
0.1292 20097 0.9941416492 792
0.124 20100 0.9939535187 1000
0.1262 20099 0.9961868275 1000
0.1283 20101 0.9969304773 674
0.1263 20102 0.9949710881 814
0.1359 20103 0.9994790378 672
0.7486 20104 0.995609112 1000
0.7557 20107 0.9931749066 813
0.1346 20106 0.9989398132 650
0.1332 20108 0.9932080704 318
0.1261 20105 0.9935968037 1000
0.1406 20110 1 498
0.1471 20109 1 522
0.1467 20111 1 483
0.1415 20112 1 522
0.1532 20114 1 469
0.1463 20113 1 522
JasperSnoek commented 10 years ago

Hi Dave, this seemingly really simple example actually exhibits some properties that are difficult to model using standard Gaussian processes and regression in general. In particular, you have discontinuities (the 1's), non-stationarity and heteroscedasticity. I'll answer the discontinuity part below.

Non-stationarity is when the rate of change of the function changes with the inputs. That is, small changes early in epochs will cause really major changes in the objective whereas the same sized changes in epochs later in learning will cause almost no change in objective at all. A simple way to fix this is to optimize in e.g. 'log-space'. So you project your inputs to the log-domain before passing them to the optimizer. For epochs, this will effectively stretch out the low end of the input space and compress the high end so that the same magnitude change in either end will result in the same relative change in the objective. We have developed a way to automatically learn what the transformation should be for each input dimension (http://people.seas.harvard.edu/~jsnoek/bayesopt-warping.pdf) and will be incorporating this code into spearmint very soon. In my experience it has made a really tremendous difference.

Heteroscedasticity is when the amount of noise actually changes with respect to the inputs. Certainly with training a machine learning model, generally the amount of noise is far higher early in the training and much lower later. The standard Gaussian process regression model assumes a single noise term. Thus when it sees noise early in training (resulting e.g. in random initialization of parameters) it will (must) treat that as the noise throughout. So if the noise due to random initialization is +- 20%, you can imagine that the model sees no statistically significant difference in the relatively small improvements at the end of learning (e.g. improvements of 1-5%). This can definitely throw a wrench in the gears. A simple solution that might work is also to optimize the log of the objective.

The rest of your questions are answered inline.

On Mon, Jan 27, 2014 at 10:57 AM, Quanticles notifications@github.comwrote:

Hi,

I'm trying to understand what the learning process is doing - it doesn't seem to be working for me. I'm learning two parameters with GPEIOptChooser on a neural network. The parameters are a global learning slowdown factor and the number of epochs to run.

I thought this should be an easy test where spearmint would dial in the best parameters quickly, but it seems to be struggling.

Questions:

1.

I had an error which caused one result to come back as "1.0" for the error. Does this break spearmint, i.e., should I filter these out when I'm running?

The Gaussian process assumes that the function you are modeling is smooth (i.e. continuous). When the training diverges and you get NaNs or arbitrarily bad results, discontinuities are introduced in the function. The regression will try to interpolate from the smooth function results to a sudden 'bad' result. You can imagine that this messes with the model - it either sets the noise really high and assumes this is a noisy result or it becomes extremely wavy to be able to account for the major change in the function. We have much more principled ways to deal with this now. Essentially we treat those bad jobs as 'constraint violations' and incorporate an extra model to model the constraint space. An early version of this is implemented in the GPConstrainedEIChooser. It basically treats any value that you return as a NaN or Inf as a constraint violation. Newer, more sophisticated versions are on the way.

1.

Why does GPEIOptChooser choose only 4 elements from the grid?

This is actually because it starts to optimize points outside of the grid. I.e. it optimizes the grid points and moves them around to get better results. So this is the expected behavior.

1.

The best result is for Job 20054 - why does't GPEIOptChooser search around there more?

This is a good question. My guess is probably the heteroscedasticity problem. Take a look at the output that spearmint produces and take note of what it reports as the 'noise'. This should be informative about what's going on.

1.

Is there anything else that I can do to improve the quality of these results assuming that running a job takes a very long time?

Yes. Log space in the meantime :-) Maybe let the epochs run out a minimum number of steps such that the noise is reasonable. Use the constrained chooser to deal with diverging results. Much better code is on the way. Hope that helps. Best,

Jasper

Thanks, Dave batch_error job_id learning_slowdown_factor epochs 0.1862 0 0.966050879 200 0.1273 1 0.994260074 600 0.131 3 0.9956919593 400 0.1315 2 0.9935449037 800 0.157 20002 1 200 0.1537 20003 0.981515085 200 0.1365 20004 0.9882011634 582 0.1329 20000 1 1000 0.1284 20001 0.9931160484 1000 0.1354 20006 0.9903122127 487 0.1413 20007 1 683 0.1417 20005 1 1000 0.1263 20008 0.9968256736 1000 0.1308 20009 0.9916010919 819 0.1281 20010 0.9931160484 1000 0.1249 20012 0.9974221541 792 0.1273 20011 0.9973518724 1000 0.1253 20013 0.9949636424 1000 0.1414 20014 1 464 0.13 20015 0.9965389618 619 0.1261 20016 0.9962885154 828 1 20019 0.9984114634 1000 0.1506 20020 0.9878039695 200 0.1296 20017 0.9985251447 892 0.1283 20018 0.9957674257 1000 0.1567 20022 1 200 0.1348 20021 0.9894007513 527 0.1823 20023 0.966050879 200 0.169 20025 1 200 0.1299 20026 0.9931053903 568 0.1362 20027 0.9893965015 648 0.1451 20028 0.9954126479 200 0.1821 20029 0.966050879 200 0.1296 20024 0.9931160484 1000 0.1339 20030 0.9907583517 744 0.1312 20032 0.9919301873 499 0.1322 20031 0.9899663619 685 0.166 20034 0.9765412915 200 0.1508 20036 0.9939019601 200 0.147 20037 0.9905185445 200 0.1327 20033 0.9931160484 1000 0.1288 20035 0.9937639185 1000 0.1636 20040 0.9780630581 200 0.1762 20041 0.9705297816 200 0.1264 20038 0.9989655154 1000 0.1503 20043 0.9979644837 200 0.1388 20039 0.9995222416 1000 0.179 20045 0.9682962955 200 0.1278 20042 0.9934396689 1000 0.1471 20046 0. 9859738178 200 0.1707 20047 0.9735643934 200 0.1588 20049 0.9839755669 200 0.1297 20044 0.9941364551 1000 0.1443 20051 0.9915891514 200 0.1301 20048 0.9976431404 1000 0.1246 20050 0.9963487573 1000 0.1292 20052 0.9983211645 712 0.1268 20053 0.9945645556 1000 0.1222 20054 0.9954587131 1000 0.1425 20057 0.996722093 200 0.1707 20058 0.9751610399 200 0.1377 20055 0.9997676992 1000 0.1316 20056 0.9992591138 1000 0.1255 20059 0.9960245532 1000 0.1308 20060 0.9987979256 1000 0.1265 20061 0.9971301167 1000 0.1509 20062 0.985206682 450 0.1304 20065 0.9969074907 439 0.1286 20063 0.9952750527 548 0.1257 20064 0.9952124967 1000 0.131 20067 0.9910679167 680 0.1256 20066 0.9943426942 1000 0.176 20069 0.9718216344 200 0.1316 20068 0.9935220176 681 0.1611 20072 0. 9799846017 200 0.1312 20070 0.9959859779 481 0.1393 20074 0.9995349033 423 0.1328 20073 0.9908170037 694 0.138 20075 0.994021321 300 0.1249 20071 0.9965804829 1000 0.1542 20076 0.9827179706 200 0.1271 20077 0.9943019096 645 0.1321 20078 0.9990973507 602 0.1354 20079 0.9983849074 789 0.1347 20081 0.9944653156 525 0.1285 20080 0.9972013322 821 0.1326 20082 0.9908891399 638 0.133 20083 0.9986596423 640 0.1309 20084 0.9927719836 682 0.1317 20085 0.9932655745 1000 0.1286 20086 0.9976215116 916 0.1285 20087 0.9947586907 1000 0.1796 20090 0.9693746361 200 0.1268 20088 0.9952392072 684 0.1535 20092 0.9849817771 200 0.1507 20093 0.9869039046 200 0.1258 20091 0.9924512388 559 0.1262 20089 0.9969174738 801 0.1251 20094 0.9921062573 683 0.1313 20095 0.9986064723 797 0.1249 20096 0.9962938728 741 0.1463 20098 0.9927039257 200 0.1292 20097 0.9941416492 792 0.124 20100 0.9939535187 1000 0.1262 20099 0.9961868275 1000 0.1283 20101 0.9969304773 674 0.1263 20102 0.9949710881 814 0.1359 20103 0.9994790378 672 0.7486 20104 0.995609112 1000 0.7557 20107 0.9931749066 813 0.1346 20106 0.9989398132 650 0.1332 20108 0.9932080704 318 0.1261 20105 0.9935968037 1000 0.1406 20110 1 498 0.1471 20109 1 522 0.1467 20111 1 483 0.1415 20112 1 522 0.1532 20114 1 469 0.1463 20113 1 522

— Reply to this email directly or view it on GitHubhttps://github.com/JasperSnoek/spearmint/issues/21 .

Quanticles commented 10 years ago

Thanks for the highly detailed response, this was really helpful. I changed to using the log of the output and preventing the discontinuities and it's making much more sense now.

mechaman commented 9 years ago

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

JasperSnoek commented 9 years ago

Change the input bounds of your problem to be the log of the original bounds. Then when spearmint returns a value, use the exp() of that. Hope that helps!

Jasper

On Tue, Jun 30, 2015 at 2:17 AM, Julien Hoachuck notifications@github.com wrote:

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117002197 .

Quanticles commented 9 years ago

For example, if you want to initialize weights between 0.01 and 1, but want that space to be sampled in the log domain, you can do it like this

In config.db:

variable { name: "weights_init_pow10" type: FLOAT size: 1 min: -2.0 max: 0.0 }

Inside your python/whatever function:

for x in np.nditer(params['weights_init_pow10'],

op_flags=['readwrite']): x[...] = 10 \ x

So spearmint is working with a variable on the range [-2.0,0.0], but your own python function converts it to [0.01,1] before passing it to the algorithm that you're optimizing.

On Tue, Jun 30, 2015 at 7:25 AM, Jasper Snoek notifications@github.com wrote:

Change the input bounds of your problem to be the log of the original bounds. Then when spearmint returns a value, use the exp() of that. Hope that helps!

Jasper

On Tue, Jun 30, 2015 at 2:17 AM, Julien Hoachuck <notifications@github.com

wrote:

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

— Reply to this email directly or view it on GitHub < https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117002197>

.

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117158372 .

mechaman commented 9 years ago

This is great. Thank you JasperSnoek and Quanticles for the prompt and concise reply :)

mechaman commented 9 years ago

In your paper "Input Warping for Bayesian Optimization of Non-stationary Functions" you mention warping the number of hidden units. How did you go about projecting a sequence of INTEGERS say from 0-9 units to log space? Hope I am not making this thread too long...

JasperSnoek commented 9 years ago

Hey Julien, no problem. In that paper we treated integers as continuous numbers within the Bayesian optimization and then rounded them off when they were returned to the user.

Jasper

On Wed, Jul 1, 2015 at 3:12 AM, Julien Hoachuck notifications@github.com wrote:

In your paper "Input Warping for Bayesian Optimization of Non-stationary Functions" you mention warping the number of hidden units. How did you go about projecting a sequence of INTEGERS say from 0-9 units to log space? Hope I am not making this thread too long...

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117496208 .