Specifying distances for categorical variables

bernardo-suez commented 2 years ago

I am trying to optimize a mix of categorical, discrete (ordered) and continuous variables. Each of my categorical variables can take one of 500 to 1,000 values, so one-hot encoding is probably out of question.

The good thing is that these two categorical variables are points in space (possible sites where to build elevated tanks for a water utility), for which the physical distance between them can be a good measure of similarity. Would using a custom kernel be a good approach for this? If so, is there a tutorial about how to implement this type of problem in either Ax or BoTorch?

My decision variables are:

junction_name_1
tank_ground_elev_1
diameter_1
junction_name_2
tank_ground_elev_2
diameter_2

I've looked into this and this issues but wasn't able to get enough out of them to solve my problem.

Balandat commented 2 years ago

Do you have the spatial coordinates of the locations (I assume these are described by the junction_name_* variables)? In that case I would build the model over the spatial location (x/y coordinates), and then deal with the discreteness of the options (water tower NIMBYs...) by just enumerating the options during the optimization.

Can you specify the problem in some more detail? What exactly are the types and ranges of the decision variables? How many locations are you trying to select (or is that in itself a variable, i.e. you could decide to build 2 large towers or 4 small ones)?

bernardo-suez commented 2 years ago

@Balandat, I do have the spatial coordinates of the junctions (aka a location, or site where to build a tank) but they are very unevenly distributed and not in a gridded way. Here's a bit more detail:

junction_name_1: which junction (name of location with x,y coordinates) where to build a tank. Categorical variable with 548 options in this particular application.
tank_ground_elev_1: elevation of the base of the first elevated tank. Continuous variable ranging from 0 to 100 ft.
diameter_1: diameter of the first elevated tank. Discrete variable that can take three values: 74, 86, 93 ft.
junction_name_2: which junction (name of location with x,y coordinates) where to build a tank. Categorical variable with 548 options in this particular application.
tank_ground_elev_2: elevation of the base of the second elevated tank. Continuous variable ranging from 0 to 100 ft.
diameter_2: diameter of the second elevated tank. Discrete variable that can take three values: 74, 86, 93 ft.

There are large pockets of land with no viable sites and others with a several. I could use x_1, y_1, x_2, and y_2 as continuous variables replacing the categorical junction_name variables and just choose the site closest to the each x,y pair, which would be equivalent to using Voronoi polygons. This wouldn't be much of a problem in regions with many sites, but may make the algorithm get stuck in pockets with few, sparsely distributed sites, no? At the same time, the distances between the points are good measure of similarity (points close together probably have water demands, ground elevation, etc.), so it's better not to distort them.

We are planning on building either one or two tanks, but I thought it would be easier if I ran these as two different optimization problems--unless you think it wouldn't be much harder to have the number of tanks (one or two) as a decision variable as well. The simplest form of the function I'm trying to optimize would be: simulate_pair_of_tanks(junction_name_1, tank_ground_elev_1, diameter_1, junction_name_2, tank_ground_elev_2, diameter_2) or simulate_pair_of_tanks(x_1, y_1, tank_ground_elev_1, diameter_1, x_2, y_2, tank_ground_elev_2, diameter_2) (x and y 1 and 2 would be continuous in this case)

I have four objectives, which are:

The summation of the volumes of both tanks (maximize)
The total cost (minimize)
The minimum pressure in the system (maximize)
The distance between tanks (maximize)

PS: The NIMBY issue is for high management to deal with based on the Pareto set I give them. They're paid more than me exactly to deal with this kind of stuff ;)

Balandat commented 2 years ago

Cool this is a really interesting problem. How quickly can you run the simulate_pair_of_tanks function? Im asking since due to the complexity of the optimization domain and the number of objectives, the acquisition function optimization when using multi-objective Bayesian Optimization will likely take quite a while. So if the simulation runs really fast, it could be be the case that some other approaches (such as evolutionary algorithms) might be the better choice here.

This wouldn't be much of a problem in regions with many sites, but may make the algorithm get stuck in pockets with few, sparsely distributed sites, no?

Yup, this is a concern.

We are planning on building either one or two tanks, but I thought it would be easier if I ran these as two different optimization problems

Yeah that makes sense. The two-tank design problem is quite a bit harder (see below) so I'd start with the single tank one first.

The NIMBY issue is for high management to deal with

Glorious. Usually shit rolls downhill, glad to see it going the other way for once :)

If the simulation is indeed costly and using MOO BO makes sense, then my recommendation for the one-tank problem would be the following:

Build the model based on the x/y coordinates, that way the model can properly account for the spatial correlation (after all, that's what the underlying Gaussian Process models we use under the hood were originally developed for). To do this just define the parameters as continuous parameters with the ranges as the bounding box over your area.
Optimize the acquisition function over the diameter and tank elevation for each of the 548 location choices. Essentially, if you're using the developer API, this will amount to looping through gen() on the modelbridge 548 times, each time with fixed_features selecting a different x/y location, and then picking the configuration with the highest acquisition value (I'm not actually sure we return this as part of the returned GeneratorRun object, worst-case you have to re-evaluate this by sticking the respective parameterizations into the evaluate_acquisition_function API. [note to self: we may want to make this easier in the future by allowing to pass a list of fixed_features that we loop over internally]

The two-tank problem is trickier. Presumably you can't really use the simulation results from the single-tank simulation so you'd have to re-simulate anyway, and now add not just a single pair of x/y parameters, but two of those. From a modeling perspective that shouldn't cause any issues, but if you want to optimize the acquisition function in the same way you'd now have to optimize over order of 550^2 (~300K) possible configurations in each step (for each of which you'd have to solve a continuous optimization problem), which is clearly not scalable.

I haven't thought much about this, but in that case it might be most reasonable to actually resort to also discretizing the continuous parameters and then solving a discrete problem (one could always fine-tune at the end around a discrete optimum). To do this you could use an approach that combines random sampling with local refinement search. We have some version of this implemented in BoTorch (https://github.com/pytorch/botorch/blob/0afcf355177fa15c34af3475a00811a3628eda31/botorch/optim/optimize.py#L710) but it wouldn't directly apply here since the x/y locations aren't on a regular grid. Basically one could do the same thing though by implementing two components: (i) a random sampler of discrete configurations, and (ii) a method for generating neighbors. Hooking all this up into Ax would require a lot of work though.

Overall, it's just a pretty challenging problem you got on your hands here.

cc @dme65

bernardo-suez commented 2 years ago

@Balandat, just to let you know that I haven't ditched the issue. I'm still implementing and evaluating different options and will get back here when I'm done. Thanks again for the help.

lena-kashtelyan commented 2 years ago

Marking this issue as needing more info for now until we hear back form @bernardo-suez!

bernardo-suez commented 2 years ago

Hey guys, sorry about the delay. In the end I could not get the GP model to approximate my objective functions, so I simplified the decision space and did a full search (I also couldn't get an MOEA that were parallelizable enough). I guess my problem was just not smooth nor continuous to a point that it doesn't work with GPs.

Also, I (or my boss) judged that waiting three days for for the runs to complete while I did other tasks would make more sense than getting into BoTorch now, given we were on a pinch and this was a one-time problem.

Thanks for the help and for all the time you spent writing suggestions, though! I really appreciate it.

facebook / Ax

Specifying distances for categorical variables #991