Trying to Understand the Learning Progression

Quanticles commented 10 years ago

Hi,

I'm trying to understand what the learning process is doing - it doesn't seem to be working for me. I'm learning two parameters with GPEIOptChooser on a neural network. The parameters are a global learning slowdown factor and the number of epochs to run.

I thought this should be an easy test where spearmint would dial in the best parameters quickly, but it seems to be struggling.

Questions:

I had an error which caused one result to come back as "1.0" for the error. Does this break spearmint, i.e., should I filter these out when I'm running?
Why does GPEIOptChooser choose only 4 elements from the grid?
The best result is for Job 20054 - why does't GPEIOptChooser search around there more?
Is there anything else that I can do to improve the quality of these results assuming that running a job takes a very long time?

Thanks, Dave

batch_error	job_id	learning_slowdown_factor	epochs
0.1862	0	0.966050879	200
0.1273	1	0.994260074	600
0.131	3	0.9956919593	400
0.1315	2	0.9935449037	800
0.157	20002	1	200
0.1537	20003	0.981515085	200
0.1365	20004	0.9882011634	582
0.1329	20000	1	1000
0.1284	20001	0.9931160484	1000
0.1354	20006	0.9903122127	487
0.1413	20007	1	683
0.1417	20005	1	1000
0.1263	20008	0.9968256736	1000
0.1308	20009	0.9916010919	819
0.1281	20010	0.9931160484	1000
0.1249	20012	0.9974221541	792
0.1273	20011	0.9973518724	1000
0.1253	20013	0.9949636424	1000
0.1414	20014	1	464
0.13	20015	0.9965389618	619
0.1261	20016	0.9962885154	828
1	20019	0.9984114634	1000
0.1506	20020	0.9878039695	200
0.1296	20017	0.9985251447	892
0.1283	20018	0.9957674257	1000
0.1567	20022	1	200
0.1348	20021	0.9894007513	527
0.1823	20023	0.966050879	200
0.169	20025	1	200
0.1299	20026	0.9931053903	568
0.1362	20027	0.9893965015	648
0.1451	20028	0.9954126479	200
0.1821	20029	0.966050879	200
0.1296	20024	0.9931160484	1000
0.1339	20030	0.9907583517	744
0.1312	20032	0.9919301873	499
0.1322	20031	0.9899663619	685
0.166	20034	0.9765412915	200
0.1508	20036	0.9939019601	200
0.147	20037	0.9905185445	200
0.1327	20033	0.9931160484	1000
0.1288	20035	0.9937639185	1000
0.1636	20040	0.9780630581	200
0.1762	20041	0.9705297816	200
0.1264	20038	0.9989655154	1000
0.1503	20043	0.9979644837	200
0.1388	20039	0.9995222416	1000
0.179	20045	0.9682962955	200
0.1278	20042	0.9934396689	1000
0.1471	20046	0.9859738178	200
0.1707	20047	0.9735643934	200
0.1588	20049	0.9839755669	200
0.1297	20044	0.9941364551	1000
0.1443	20051	0.9915891514	200
0.1301	20048	0.9976431404	1000
0.1246	20050	0.9963487573	1000
0.1292	20052	0.9983211645	712
0.1268	20053	0.9945645556	1000
0.1222	20054	0.9954587131	1000
0.1425	20057	0.996722093	200
0.1707	20058	0.9751610399	200
0.1377	20055	0.9997676992	1000
0.1316	20056	0.9992591138	1000
0.1255	20059	0.9960245532	1000
0.1308	20060	0.9987979256	1000
0.1265	20061	0.9971301167	1000
0.1509	20062	0.985206682	450
0.1304	20065	0.9969074907	439
0.1286	20063	0.9952750527	548
0.1257	20064	0.9952124967	1000
0.131	20067	0.9910679167	680
0.1256	20066	0.9943426942	1000
0.176	20069	0.9718216344	200
0.1316	20068	0.9935220176	681
0.1611	20072	0.9799846017	200
0.1312	20070	0.9959859779	481
0.1393	20074	0.9995349033	423
0.1328	20073	0.9908170037	694
0.138	20075	0.994021321	300
0.1249	20071	0.9965804829	1000
0.1542	20076	0.9827179706	200
0.1271	20077	0.9943019096	645
0.1321	20078	0.9990973507	602
0.1354	20079	0.9983849074	789
0.1347	20081	0.9944653156	525
0.1285	20080	0.9972013322	821
0.1326	20082	0.9908891399	638
0.133	20083	0.9986596423	640
0.1309	20084	0.9927719836	682
0.1317	20085	0.9932655745	1000
0.1286	20086	0.9976215116	916
0.1285	20087	0.9947586907	1000
0.1796	20090	0.9693746361	200
0.1268	20088	0.9952392072	684
0.1535	20092	0.9849817771	200
0.1507	20093	0.9869039046	200
0.1258	20091	0.9924512388	559
0.1262	20089	0.9969174738	801
0.1251	20094	0.9921062573	683
0.1313	20095	0.9986064723	797
0.1249	20096	0.9962938728	741
0.1463	20098	0.9927039257	200
0.1292	20097	0.9941416492	792
0.124	20100	0.9939535187	1000
0.1262	20099	0.9961868275	1000
0.1283	20101	0.9969304773	674
0.1263	20102	0.9949710881	814
0.1359	20103	0.9994790378	672
0.7486	20104	0.995609112	1000
0.7557	20107	0.9931749066	813
0.1346	20106	0.9989398132	650
0.1332	20108	0.9932080704	318
0.1261	20105	0.9935968037	1000
0.1406	20110	1	498
0.1471	20109	1	522
0.1467	20111	1	483
0.1415	20112	1	522
0.1532	20114	1	469
0.1463	20113	1	522

JasperSnoek commented 10 years ago

Hi Dave, this seemingly really simple example actually exhibits some properties that are difficult to model using standard Gaussian processes and regression in general. In particular, you have discontinuities (the 1's), non-stationarity and heteroscedasticity. I'll answer the discontinuity part below.

Non-stationarity is when the rate of change of the function changes with the inputs. That is, small changes early in epochs will cause really major changes in the objective whereas the same sized changes in epochs later in learning will cause almost no change in objective at all. A simple way to fix this is to optimize in e.g. 'log-space'. So you project your inputs to the log-domain before passing them to the optimizer. For epochs, this will effectively stretch out the low end of the input space and compress the high end so that the same magnitude change in either end will result in the same relative change in the objective. We have developed a way to automatically learn what the transformation should be for each input dimension (http://people.seas.harvard.edu/~jsnoek/bayesopt-warping.pdf) and will be incorporating this code into spearmint very soon. In my experience it has made a really tremendous difference.

Heteroscedasticity is when the amount of noise actually changes with respect to the inputs. Certainly with training a machine learning model, generally the amount of noise is far higher early in the training and much lower later. The standard Gaussian process regression model assumes a single noise term. Thus when it sees noise early in training (resulting e.g. in random initialization of parameters) it will (must) treat that as the noise throughout. So if the noise due to random initialization is +- 20%, you can imagine that the model sees no statistically significant difference in the relatively small improvements at the end of learning (e.g. improvements of 1-5%). This can definitely throw a wrench in the gears. A simple solution that might work is also to optimize the log of the objective.

The rest of your questions are answered inline.

On Mon, Jan 27, 2014 at 10:57 AM, Quanticles notifications@github.comwrote:

Hi,

I'm trying to understand what the learning process is doing - it doesn't seem to be working for me. I'm learning two parameters with GPEIOptChooser on a neural network. The parameters are a global learning slowdown factor and the number of epochs to run.

I thought this should be an easy test where spearmint would dial in the best parameters quickly, but it seems to be struggling.

Questions:

1.

I had an error which caused one result to come back as "1.0" for the error. Does this break spearmint, i.e., should I filter these out when I'm running?

The Gaussian process assumes that the function you are modeling is smooth (i.e. continuous). When the training diverges and you get NaNs or arbitrarily bad results, discontinuities are introduced in the function. The regression will try to interpolate from the smooth function results to a sudden 'bad' result. You can imagine that this messes with the model - it either sets the noise really high and assumes this is a noisy result or it becomes extremely wavy to be able to account for the major change in the function. We have much more principled ways to deal with this now. Essentially we treat those bad jobs as 'constraint violations' and incorporate an extra model to model the constraint space. An early version of this is implemented in the GPConstrainedEIChooser. It basically treats any value that you return as a NaN or Inf as a constraint violation. Newer, more sophisticated versions are on the way.

1.

Why does GPEIOptChooser choose only 4 elements from the grid?

This is actually because it starts to optimize points outside of the grid. I.e. it optimizes the grid points and moves them around to get better results. So this is the expected behavior.

1.

The best result is for Job 20054 - why does't GPEIOptChooser search around there more?

This is a good question. My guess is probably the heteroscedasticity problem. Take a look at the output that spearmint produces and take note of what it reports as the 'noise'. This should be informative about what's going on.

1.

Is there anything else that I can do to improve the quality of these results assuming that running a job takes a very long time?

Yes. Log space in the meantime :-) Maybe let the epochs run out a minimum number of steps such that the noise is reasonable. Use the constrained chooser to deal with diverging results. Much better code is on the way. Hope that helps. Best,

Jasper

Thanks, Dave batch_error job_id learning_slowdown_factor epochs 0.1862 0 0.966050879 200 0.1273 1 0.994260074 600 0.131 3 0.9956919593 400 0.1315 2 0.9935449037 800 0.157 20002 1 200 0.1537 20003 0.981515085 200 0.1365 20004 0.9882011634 582 0.1329 20000 1 1000 0.1284 20001 0.9931160484 1000 0.1354 20006 0.9903122127 487 0.1413 20007 1 683 0.1417 20005 1 1000 0.1263 20008 0.9968256736 1000 0.1308 20009 0.9916010919 819 0.1281 20010 0.9931160484 1000 0.1249 20012 0.9974221541 792 0.1273 20011 0.9973518724 1000 0.1253 20013 0.9949636424 1000 0.1414 20014 1 464 0.13 20015 0.9965389618 619 0.1261 20016 0.9962885154 828 1 20019 0.9984114634 1000 0.1506 20020 0.9878039695 200 0.1296 20017 0.9985251447 892 0.1283 20018 0.9957674257 1000 0.1567 20022 1 200 0.1348 20021 0.9894007513 527 0.1823 20023 0.966050879 200 0.169 20025 1 200 0.1299 20026 0.9931053903 568 0.1362 20027 0.9893965015 648 0.1451 20028 0.9954126479 200 0.1821 20029 0.966050879 200 0.1296 20024 0.9931160484 1000 0.1339 20030 0.9907583517 744 0.1312 20032 0.9919301873 499 0.1322 20031 0.9899663619 685 0.166 20034 0.9765412915 200 0.1508 20036 0.9939019601 200 0.147 20037 0.9905185445 200 0.1327 20033 0.9931160484 1000 0.1288 20035 0.9937639185 1000 0.1636 20040 0.9780630581 200 0.1762 20041 0.9705297816 200 0.1264 20038 0.9989655154 1000 0.1503 20043 0.9979644837 200 0.1388 20039 0.9995222416 1000 0.179 20045 0.9682962955 200 0.1278 20042 0.9934396689 1000 0.1471 20046 0. 9859738178 200 0.1707 20047 0.9735643934 200 0.1588 20049 0.9839755669 200 0.1297 20044 0.9941364551 1000 0.1443 20051 0.9915891514 200 0.1301 20048 0.9976431404 1000 0.1246 20050 0.9963487573 1000 0.1292 20052 0.9983211645 712 0.1268 20053 0.9945645556 1000 0.1222 20054 0.9954587131 1000 0.1425 20057 0.996722093 200 0.1707 20058 0.9751610399 200 0.1377 20055 0.9997676992 1000 0.1316 20056 0.9992591138 1000 0.1255 20059 0.9960245532 1000 0.1308 20060 0.9987979256 1000 0.1265 20061 0.9971301167 1000 0.1509 20062 0.985206682 450 0.1304 20065 0.9969074907 439 0.1286 20063 0.9952750527 548 0.1257 20064 0.9952124967 1000 0.131 20067 0.9910679167 680 0.1256 20066 0.9943426942 1000 0.176 20069 0.9718216344 200 0.1316 20068 0.9935220176 681 0.1611 20072 0. 9799846017 200 0.1312 20070 0.9959859779 481 0.1393 20074 0.9995349033 423 0.1328 20073 0.9908170037 694 0.138 20075 0.994021321 300 0.1249 20071 0.9965804829 1000 0.1542 20076 0.9827179706 200 0.1271 20077 0.9943019096 645 0.1321 20078 0.9990973507 602 0.1354 20079 0.9983849074 789 0.1347 20081 0.9944653156 525 0.1285 20080 0.9972013322 821 0.1326 20082 0.9908891399 638 0.133 20083 0.9986596423 640 0.1309 20084 0.9927719836 682 0.1317 20085 0.9932655745 1000 0.1286 20086 0.9976215116 916 0.1285 20087 0.9947586907 1000 0.1796 20090 0.9693746361 200 0.1268 20088 0.9952392072 684 0.1535 20092 0.9849817771 200 0.1507 20093 0.9869039046 200 0.1258 20091 0.9924512388 559 0.1262 20089 0.9969174738 801 0.1251 20094 0.9921062573 683 0.1313 20095 0.9986064723 797 0.1249 20096 0.9962938728 741 0.1463 20098 0.9927039257 200 0.1292 20097 0.9941416492 792 0.124 20100 0.9939535187 1000 0.1262 20099 0.9961868275 1000 0.1283 20101 0.9969304773 674 0.1263 20102 0.9949710881 814 0.1359 20103 0.9994790378 672 0.7486 20104 0.995609112 1000 0.7557 20107 0.9931749066 813 0.1346 20106 0.9989398132 650 0.1332 20108 0.9932080704 318 0.1261 20105 0.9935968037 1000 0.1406 20110 1 498 0.1471 20109 1 522 0.1467 20111 1 483 0.1415 20112 1 522 0.1532 20114 1 469 0.1463 20113 1 522

Reply to this email directly or view it on GitHubhttps://github.com/JasperSnoek/spearmint/issues/21 .

Quanticles commented 10 years ago

Thanks for the highly detailed response, this was really helpful. I changed to using the log of the output and preventing the discontinuities and it's making much more sense now.

mechaman commented 9 years ago

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

JasperSnoek commented 9 years ago

Change the input bounds of your problem to be the log of the original bounds. Then when spearmint returns a value, use the exp() of that. Hope that helps!

Jasper

On Tue, Jun 30, 2015 at 2:17 AM, Julien Hoachuck notifications@github.com wrote:

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117002197 .

Quanticles commented 9 years ago

For example, if you want to initialize weights between 0.01 and 1, but want that space to be sampled in the log domain, you can do it like this

In config.db:

variable { name: "weights_init_pow10" type: FLOAT size: 1 min: -2.0 max: 0.0 }

Inside your python/whatever function:

for x in np.nditer(params['weights_init_pow10'],

op_flags=['readwrite']): x[...] = 10 \ x

So spearmint is working with a variable on the range [-2.0,0.0], but your own python function converts it to [0.01,1] before passing it to the algorithm that you're optimizing.

On Tue, Jun 30, 2015 at 7:25 AM, Jasper Snoek notifications@github.com wrote:

Change the input bounds of your problem to be the log of the original bounds. Then when spearmint returns a value, use the exp() of that. Hope that helps!

Jasper

On Tue, Jun 30, 2015 at 2:17 AM, Julien Hoachuck <notifications@github.com

wrote:

Just to clarify... When you say "... project your inputs to the log-domain before passing them to the optimizer.", you mean project the output from our cost function or the inputs we provide in the config file to the "wrapper" (cost function)?

— Reply to this email directly or view it on GitHub < https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117002197>

.

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117158372 .

mechaman commented 9 years ago

This is great. Thank you JasperSnoek and Quanticles for the prompt and concise reply :)

mechaman commented 9 years ago

In your paper "Input Warping for Bayesian Optimization of Non-stationary Functions" you mention warping the number of hidden units. How did you go about projecting a sequence of INTEGERS say from 0-9 units to log space? Hope I am not making this thread too long...

JasperSnoek commented 9 years ago

Hey Julien, no problem. In that paper we treated integers as continuous numbers within the Bayesian optimization and then rounded them off when they were returned to the user.

Jasper

On Wed, Jul 1, 2015 at 3:12 AM, Julien Hoachuck notifications@github.com wrote:

In your paper "Input Warping for Bayesian Optimization of Non-stationary Functions" you mention warping the number of hidden units. How did you go about projecting a sequence of INTEGERS say from 0-9 units to log space? Hope I am not making this thread too long...

— Reply to this email directly or view it on GitHub https://github.com/JasperSnoek/spearmint/issues/21#issuecomment-117496208 .

JasperSnoek / spearmint

Trying to Understand the Learning Progression #21