facebook / Ax

Adaptive Experimentation Platform
https://ax.dev
MIT License
2.39k stars 313 forks source link

[Question] Number of trials and batches in offline optimization + tracking issue to learn of experiment outcomes #1562

Closed Eric-verret closed 7 months ago

Eric-verret commented 1 year ago

Hello Everyone,

I'm working on the developpement/Optimization of a new fire retardent paint by using Ax API and I would like to know if you have any advice for the number of trials by batch. My total number of observation beyond initial points is 9. And it's more simple for us to formulate/evaluate the performance in the lab by at least batches of 2 trials.

mpolson64 commented 1 year ago

Hi Eric thank you for reaching out. The number of trials Ax needs to converge on an optimal parameterization can very depending on the specifics of the experiment. In general our default method GPEI can optimize up to a dozen parameters in about sequential 30 trials, but the best way to validate this on your specific problem is checking to see if the optimization has plateaued via and checking the cross validation plot to ensure the model fit is good. You can read more about this here https://ax.dev/tutorials/gpei_hartmann_service.html and here https://ax.dev/tutorials/visualizations.html . If you have any more specific details (how many parameters you are optimizing over, how long an offline trial takes, how long your experiment can take, any of these or other plots, etc) we may be able to provide more specific advice for your particular experiment. cc @Balandat

Eric-verret commented 1 year ago

Hi ! Thank you for your replay. About the problem : we want to optimize 7 "fire objective" according to the chemical composition of our paint applied on timber.

Evaluation of a paint formulation

For each objective we test paint 3 time to have the mean and SEM. We have relatively high SEM because we working timber which is an anisotropic material.

Very expensive objective evaluation

The evaluation of objectives is very expensive in terms of time (3 months to evaluation 20 trials on 3 different fire bench)

Number of parallel trial and max point beyond initial point

It is easier for us the lab to evaluate at least 3 trials at the same time

Max point beyond the initial point : less than 15 because very expensive evaluation.

eytan commented 1 year ago

Hi Eric, this sounds like a very interesting application!

If you have 7 objectives, you may wish to come up with some linear scalarization of these objectives, or use preference exploration to learn an objective function after you collect data from your initial batch. There is a tutorial (along with a link to the paper) for how to do this in BoTorch here: https://botorch.org/tutorials/bope. @ItsMrLin might have pointers for how to do this in Ax if you are interested (assuming the code is public...).

When you say "With 3 run for each points to know the SEM", are you saying you compute the SEM off of 3 observations? This would be far too few observations to compute a SEM and you may be better off inferring the noise. IIRC @SebastianAment has been exploring what are some good approaches to use when you only a very limited number of replications per design.

@sgbaird might have some helpful suggestions for how to most efficiently encode your design space (see e.g., #727 ).

Eric-verret commented 1 year ago

Here the first version of the code based on our data https://github.com/Eric-verret/Optim_intu/blob/main/intu_opm.ipynb

Balandat commented 1 year ago

Yes this is indeed a super interesting application.

I second Eytan's point about 3 observations being too few to compute meaningful estimates of the standard error. So if it's the same cost to just try a different composition as it is to try the same composition 3 times, then I would recommend trying more compositions and letting the model infer the noise level.

Also, 7 objectives is quite a lot for hypervolume type methods. But it could work in the setting where there are very few trials and we can spend a good amount of time on computing and optimizing the acquisition function. cc @sdaulton for thoughts / tricks on this one.

I think the main challenge you'll have is interpreting the results in a 7-dim outcome space. As Eytan said, you have some domain expertise (and access to people with domain expertise) who feel like the can compare the relative value of two compositions (each with 7 outcomes) in a reasonable fashion, this could be a prime use case for preference learning.

Eric-verret commented 1 year ago

Hi Alex !

Thank you for your reply! Ok maybe we will let the model infer the noise level. We can simplify the problem to 5/6 outcome if it is more simple. Yes we have domaine expertise in the lab.

And to come back to the number of parallel trials, to evaluate a paint formulation it's take time because the lab is located in different places in France so it is more easy to have the maximum parallel trials as possible. If the total of evaluation budget beyond the initial point is 15 what is the best strategy : 3*5 parallel , 10 parallel +2 parallel or once single batch of 15 parallel trials ?

PS : if you want to know more about our lab (https://umet.univ-lille.fr/Polymeres/index.php?lang=en) image

Balandat commented 1 year ago

If the total of evaluation budget beyond the initial point is 15 what is the best strategy : 3*5 parallel , 10 parallel +2 parallel or once single batch of 15 parallel trials ?

This is hard t o answer in general; really depends on how you value wall time vs. optimization performance and how amenable the problem is to parallelization. The most efficient use of information would be to do all trials fully sequentially. At the other extreme if you have one single large parallel batch you basically can't learn anything from the results. In between there are tradeoffs; for some problems you might be ok going with larger batch sizes, for other problems the performance can degrade quite a bit. Since there are so few trials I would be careful not to use too large a batch size - I would advise to start from how many batches you can afford to do (time-wise) and then run the maximum batch size you can do.

I don't know how you're collecting the results, but you could also use asynchronous optimization, where you generate a new candidate whenever you get some information back - this would really only make sense if the results from the batches come back at different times (e.g. b/c some lab is slower than another one).

sgbaird commented 1 year ago

@eytan, thanks for the cc. I'm back from a long vacation and getting back into the swing of things.

@sgbaird might have some helpful suggestions for how to most efficiently encode your design space (see e.g., #727 ).

I suggest using the linear equality --> linear inequality reparameterization mentioned in https://github.com/facebook/Ax/issues/727#issuecomment-975644304 for ease of implementation. I compared this to the case of not implementing this reparameterization for one application (DOI: 10.1016/j.commatsci.2023.112134 or see personalized share link or the preprint). In general, implementing the linear inequality constraint improved performance. I expect that explicitly enforcing the linear equality constraint would enhance performance and improve model interpretability (i.e., the last feature also gets a feature importance), but it requires explicit use of BoTorch.

Also, 7 objectives is quite a lot for hypervolume type methods. But it could work in the setting where there are very few trials and we can spend a good amount of time on computing and optimizing the acquisition function. cc @sdaulton for thoughts / tricks on this one.

I think the main challenge you'll have is interpreting the results in a 7-dim outcome space. As Eytan said, you have some domain expertise (and access to people with domain expertise) who feel like the can compare the relative value of two compositions (each with 7 outcomes) in a reasonable fashion, this could be a prime use case for preference learning.

I think objective thresholds are very important here. I suggest picking outcome constraints based on domain knowledge. These can be chosen by asking the following question for each of your objectives:

For objective A, if all other objectives had amazing values, what is the worst allowable/viable value for objective A from an application standpoint?

Phrased conversely, what value of objective A would make the material inviable in spite of great performance for the other objectives?

Then give yourself something like a 10% tolerance on this outcome. For example, if you're minimizing objective A, and the maximum allowable value is 1.0, then set the outcome constraint to something like $y_A \le 1.0/0.9 = 1.11$.

Pulling from the Ax multi-objective optimization tutorial:

The reference point should be set to be slightly worse (10% is reasonable) than the worst value of each objective that a decision maker would tolerate.

See also:

You might also consider reformulating this as a constraint satisfaction problem (constraint active search) https://github.com/facebook/Ax/issues/930#issuecomment-1276857874, but I'll defer to the devs (cc @eytan) for whether this seems like a good fit.

I don't know how you're collecting the results, but you could also use asynchronous optimization, where you generate a new candidate whenever you get some information back - this would really only make sense if the results from the batches come back at different times (e.g. b/c some lab is slower than another one).

Seconded, but again depends on your setup. Do you mind providing some estimates of the total time and cost of running experiments with different batch sizes? The costs can be relative (e.g., 0.0 is low-cost, 1.0 is high-cost) and should incorporate the cost of the user's time. The total time refers to how long it takes to go from start to finish of the batch experiment.

Aside: asynchronous + multi-objective is non-trivial to implement https://github.com/facebook/Ax/issues/896. Do you have a workflow figure for the synthesis and characterization equipment?

Eric-verret commented 1 year ago

This is hard t o answer in general; really depends on how you value wall time vs. optimization performance and how amenable the problem is to parallelization. The most efficient use of information would be to do all trials fully sequentially. At the other extreme if you have one single large parallel batch you basically can't learn anything from the results. In between there are tradeoffs; for some problems you might be ok going with larger batch sizes, for other problems the performance can degrade quite a bit. Since there are so few trials I would be careful not to use too large a batch size - I would advise to start from how many batches you can afford to do (time-wise) and then run the maximum batch size you can do.

Ok, maybe 5 batches of 3 parallels tests can be a good trade-off.

Seconded, but again depends on your setup. Do you mind providing some estimates of the total time and cost of running experiments with different batch sizes? The costs can be relative (e.g., 0.0 is low-cost, 1.0 is high-cost) and should incorporate the cost of the user's time. The total time refers to how long it takes to go from start to finish of the batch experiment.

Since I evaluate the flammability of the materials with 3 different fire bench and one of them is located in over place in France so I need to send the sample to this lab. Bigger batch size can help us to save lot of time because one of the test is much slower to get evaluation.

When I run several time my code on the same initial data, the algorithm do not give always the same value as "next experiement", is it normal ? And how this can be explained ?

Balandat commented 1 year ago

When I run several time my code on the same initial data, the algorithm do not give always the same value as "next experiement", is it normal ? And how this can be explained ?

Yes, this is to be expected - both fitting the models and optimizing the acquisition functions involves solving non-convex optimization problems, so it's possible to get stuck in local minima. We do a good amount of work under the hood with random restarts etc to make that less likely, but it can still happen. This doesn't necessarily need to be a bad thing. The no-determinism comes from the randomized initialization that we use for multi-start gradient descent to solve these problems. It is possible to pass a random_seed to the AxClient that should result in deterministic initializations and thus optimization behavior.

Eric-verret commented 1 year ago

Noise inference

After a discussions with @sgbaird, I was wondering if is better to let the system infer the noise or to run the code with all data available (3 repetitions for each formulations) ?

Mixture design constrain

I facing a mixture design problem so A+B+C+D =1 So for the parameter constrain I used composition_constraint_1 = separator.join(lst_name_input[:]) + " >= 1" composition_constraint_2 = separator.join(lst_name_input[:]) + " <= 1" filler_constraint = separator.join(lst_name_filler[:]) + " <= 0.54" Do you think is a good choose, this double inequality constrain ?

Next experience

When I running the code with 5 parallel trials, almost the same experiment is suggested and the sum xi≠ 1, (this issue the last point) #1635

Data for the next experiment suggested :

A| B| C| D| E| F -- | -- | -- | -- | -- | -- 0.399488 | 0.129256 | 0.224701 | 0.036112 | 0.148114 | 0.062328 0.399444 | 0.129214 | 0.224676 | 0.036406 | 0.147938 | 0.062322 0.398700 | 0.129202 | 0.224635 | 0.038193 | 0.147237 | 0.062033 0.399479 | 0.129226 | 0.224751 | 0.035881 | 0.148285 | 0.062378 0.399233 | 0.129165 | 0.224703 | 0.036857 | 0.147757 | 0.062284

For more detail the google collab code https://colab.research.google.com/drive/1TwJjgfPaw4oo1xb0XIAZOa_nBIdLPgPL?usp=sharing

Balandat commented 1 year ago

After a discussions with @sgbaird, I was wondering if is better to let the system infer the noise or to run the code with all data available (3 repetitions for each formulations) ?

Hmm I'm not sure I understand this question - You'd want to use all the data that is available, in which case the model can use the repeated observations to better infer the noise level.

Do you think is a good choose, this double inequality constrain ?

Not really, this is essentially a hack around the fact that we're not exposing exact equality constraints in the Ax APIs (Feature Request is here: https://github.com/facebook/Ax/issues/510). Doing so is going to cause issues as @sgbaird observed here: https://github.com/facebook/Ax/issues/510#issuecomment-974538098. The proper thing to do would in fact be to hook equality constraints up to Ax, but we unfortunately haven't had the bandwidth to work on this. But if you or someone else wanted to take a stab we'd be happy to help :)

When I running the code with 5 parallel trials, almost the same experiment is suggested and the sum xi≠ 1

Hmm this is interesting; it may be that this is a result of the above hack; so I'd want to address that first before digging much deeper into this.

Eric-verret commented 1 year ago

Hmm I'm not sure I understand this question - You'd want to use all the data that is available, in which case the model can use the repeated observations to better infer the noise level.

Ok ! By using the reparameterization as an inequality constraint (make one variable "hidden") make more sense then.

Hmm I'm not sure I understand this question - You'd want to use all the data that is available, in which case the model can use the repeated observations to better infer the noise level.

For the first step I have 20 points, with 3 runs (plot bellow), does make any sense to train the model by using all point so 60 points or train the model just by using the mean (20 points) and SEM ?

barplot_THR

Balandat commented 1 year ago

For the first step I have 20 points, with 3 runs (plot bellow), does make any sense to train the model by using all point so 60 points or train the model just by using the mean (20 points) and SEM ?

Either should be fine in this case.

If you use the raw points you won't have any variance estimates on the individual observations so we'd have the model infer a noise level. The variance looks relatively consistent across your evaluations so that should work fine.

But alternatively you can also pass in the means and SEMs. Doing this would be useful if your noise had high heterodsekdasticity - i.e. vary across the different points. The error bars may suggest this, but as Eytan mentioned above the main issue here is that the SEM estimate itself is going to have very high variance with only 3 observations, so this may itself just be noise in the data.

Eric-verret commented 1 year ago

Ok ! Thank you for your comment. It's time for me to go to the lab and test this actings learn loop on my coating. I will let you know the results.

sgbaird commented 1 year ago

@Eric-verret excited to hear how it goes!

lena-kashtelyan commented 1 year ago

Looking forward to learning how this went @Eric-verret : )

Eric-verret commented 1 year ago

Looking forward to learning how this went @Eric-verret : )

Hello !

I'm still in the lab, I will have the results of the first loop at the end of august (after the summer brake of Uni) !

Eric-verret commented 1 year ago

Hi everyone !

Here the results for the first itteration I still need to make one test image

The first iteration sounds like an exploration rather than an exploitation phase, especially when looking at the distant points in the design space. In the red point, the 1st iteration image So we planning to make at least 3 more iteration of 5 coatings in parallel

lena-kashtelyan commented 1 year ago

cc @eytan, @Balandat, @esantorella (current M&O oncall)

esantorella commented 7 months ago

Closing as inactive, but feel free to keep commenting or reopen as needed.