Applying COMBO to own dataset

nataliebarcickikas commented 4 years ago

I would like to try your method on my own dataset and own objective function. There isn't too much with respect to documentation about the format and structure to do this in the repo. Could you please provide some more details on how to set this up?

ChangYong-Oh commented 4 years ago

Hi.

'pestcontrol' will be a good example to start with.

https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/main.py#L28 In order to run COMBO, you need to run the function here with a properly defined 'objective' argument.

In order to define your own objective, you should define a class similar to https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/experiments/test_functions/multiple_categorical.py#L123 An object(not a class), (e.g PestControl() ) is given as the 'objective' argument.

As shown in https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/experiments/test_functions/multiple_categorical.py#L123 the objective class should have members 1)n_vertices 2)adjacency_mat 3)fourier_freq 4)fourier_basis. For example, if your problem has 5 categorical variables, C1, ..., C5 and each Ci has Ki categories. Then

n_vertices <List[int]> : [K1, K2, K3, K4, K5]
adjacency_mat <List[torch.Tensor]> : a list of adjacency matrices, adjacency_mat[i].size() = torch.Size(Ki, Ki)
fourier_freq <List[torch.Tensor]> : a list of eigenvalues of the corresponding graph Laplacian, fourier_freq[i].size() = torch.Size(Ki)
fourier_basis <List[torch.Tensor]> : a list of eigenvectors of the corresponding graph Laplacian, fourier_freq[i].size() = torch.Size(Ki, Ki)

Initial points with which COMBO begin can be chosen using a prior knowledge if it is available, otherwise, you can set it randomly. And also, if you have another prior knowledge about the relation between values (which are encoded in adjacency matrices). You can assume weighted undirected graphs even though all experiments in the paper assume UNWEIGHTED complete graphs for nominal variables and UNWEIGHTED path graphs for ordinal variables.

On Tue, Mar 10, 2020 at 5:10 PM nataliebarcickikas notifications@github.com wrote:

I would like to try your method on my own dataset and own objective function. There isn't too much with respect to documentation about the format and structure to do this in the repo. Could you please provide some more details on how to set this up?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4?email_source=notifications&email_token=ABJKCMGRPBDYUUCDK57EL2TRGZRA7A5CNFSM4LFCSGN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IT6GNYA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMH23ASAYZZDLA5BCIDRGZRA7ANCNFSM4LFCSGNQ .

nataliebarcickikas commented 4 years ago

Where are the number of categorical variables accounted for? In the case of the pest control problem, would that correspond to PESTCONTROL_N_STAGES = 5 in the following line? It is a little unclear because there is both a notion of stations and stages in your paper for this application.

self.n_vertices = np.array([PESTCONTROL_N_CHOICE] *PESTCONTROL_N_STAGES)

ChangYong-Oh commented 4 years ago

In pestcontrol, There are 25(PESTCONTROL_N_STAGES) categorical variables each of which has 5(PESTCONTROL_N_CHOICE) categories. Thus self.n_vertices = np.array(5, 5, ..., 5) where self.n_vertices.shape = (25,) In this example, the numbers of categories for each categorical variable are the same, but different numbers of categories are possible.

As another example, if you have 3 categorical variables,

C1(Adam, RMPProp, AdaDelta),
C2(BatchNorm, LayerNorm, WeightNorm, SpectralNorm)
C3(Using Dropout, Not using Dropout) then self.n_vertices = np.array([3, 4, 2])

I used 'stage' for contamination control and 'station' for pestcontrol. Both are basically the same.

On Wed, Mar 11, 2020 at 11:30 AM nataliebarcickikas < notifications@github.com> wrote:

Where are the number of categorical variables accounted for? In the case of the pest control problem, would that correspond to PESTCONTROL_N_STAGES = 5 in the following line? It is a little unclear because there is both a notion of stations and stages in your paper for this application.

self.n_vertices = np.array([PESTCONTROL_N_CHOICE] *PESTCONTROL_N_STAGES)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4?email_source=notifications&email_token=ABJKCMGRGVLKLUPTK2MKCILRG5R25A5CNFSM4LFCSGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOO7WJQ#issuecomment-597556006, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMG2GIYZKETA57ASKFTRG5R25ANCNFSM4LFCSGNQ .

nataliebarcicki commented 4 years ago

I am still working with your code and would like to modify the initialization of the experiment. Specifically, I would like to initialize everything to "on" i.e. vector of all 1s, and then explore the input space from there, rather than randomly switching some nodes on and some off at the beginning. I assume I control this in "experiment_configuration.py". I wrote a new function below, but this constantly generates all "on" vectors during the optimization, rather than solely the first input as all "on". Can you point me as to how to modify the starting point? I think my confusion is stemming from my lack of understanding of the argument "n_points" and the corresponding for-loop.

def allon_init_points(n_vertices, n_points, random_seed=None):
    """

    :param n_vertices: 1D array
    :param n_points:
    :param random_seed:
    :return:
    """
    if random_seed is not None:
        rng_state = torch.get_rng_state()
        torch.manual_seed(random_seed)
    init_points = torch.empty(0).long()
    for _ in range(n_points):
        init_points = torch.cat([init_points, torch.cat([torch.randint(1, 2, (1, 1)) for elm in n_vertices], dim=1)], dim=0) % silly way to make it a vecotr of all 1s
    if random_seed is not None:
        torch.set_rng_state(rng_state)
    return init_points

This function then gets called in my class definition as follows:

self.suggested_init = torch.empty(0).long()
 self.suggested_init = torch.cat([self.suggested_init, allon_init_points(self.n_vertices, 20-self.suggested_init.size(0), random_seed).long()], dim=0)

But I don't understand the code entirely.

ChangYong-Oh commented 4 years ago

Hi, Natalie.

If you want to set your first categorical variable to '1' in all your initial points and others random. for example in Ising problem. then you can simply call insert this line self.suggested_init[:, 0] = 1 (making all first categorical variable '1') after https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/experiments/test_functions/binary_categorical.py#L100 where you have random initial points

Hope this helps. Please let me know you have further questions.

Changyong Oh

On Mon, May 11, 2020 at 5:39 PM Natalie Barcicki notifications@github.com wrote:

I am still working with your code and would like to modify the initialization of the experiment. Specifically, I would like to initialize everything to "on" i.e. vector of all 1s, and then explore the input space from there, rather than randomly switching some nodes on and some off at the beginning. I assume I control this in "experiment_configuration.py". I wrote a new function below, but this constantly generates all "on" vectors during the optimization, rather than solely the first input as all "on". Can you point me as to how to modify the starting point? I think my confusion is stemming from my lack of understanding of the argument "n_points" and the corresponding for-loop.

def allon_init_points(n_vertices, n_points, random_seed=None): """
:param n_vertices: 1D array
:param n_points:
:param random_seed:
:return:
"""
if random_seed is not None:
    rng_state = torch.get_rng_state()
    torch.manual_seed(random_seed)
init_points = torch.empty(0).long()
for _ in range(n_points):
    init_points = torch.cat([init_points, torch.cat([torch.randint(1, 2, (1, 1)) for elm in n_vertices], dim=1)], dim=0) % silly way to make it a vecotr of all 1s
if random_seed is not None:
    torch.set_rng_state(rng_state)
return init_points
This function then gets called in my class definition as follows:

self.suggested_init = torch.empty(0).long() self.suggested_init = torch.cat([self.suggested_init, allon_init_points(self.n_vertices, 20-self.suggested_init.size(0), random_seed).long()], dim=0)

But I don't understand the code entirely.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4#issuecomment-626781536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMFKPK77V2QG6JEPEH3RRALZXANCNFSM4LFCSGNQ .

nataliebarcicki commented 4 years ago

Thanks for your quick response. I think you misunderstood.

Let's say I have 10 categorical variables. I want them all to take value '1' initially, and then for the Bayesian optimization to explore the space from there. So it would be: Step 0: [1 1 1 1 1 1 1 1 1 1] Step 1: [1 0 1 1 1 0 1 1 1 0] (For example) Step 2: [1 1 0 1 1 1 1 0 0 1] and so on...

Would it be simply adding the line: self.suggested_init[0, 1] = 1 (making ALL categorical variable '1') after https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/experiments/test_functions/binary_categorical.py#L100

I guess part of my confusion stems from the fact that I don't understand why we specify all the initializations anyway - shouldn't the search be guided by the acquisition function? I want the algorithm to "learn" from the initial starting point of all 1s and be more inclined to staying at all 1s, since I know that the global optimum lies close to all 1s in my case. Shouldn't the acquisition function guide the next set of points to explore instead of the user?

ChangYong-Oh commented 4 years ago

In order to set all to '1', then you can just call self.suggested_init[:, 0] = 1 [making the first row have all '1' values]

For example, here https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/experiments/test_functions/binary_categorical.py#L100 you can see 20 - self.suggested_init.size(0) which make 20 initial points. Here in the above line (99), self.suggested_init is just an empty set, so I didn't specify any preferred initial points.

As you said, you can provide your preferred initial points in self.suggested_init. (e.g. just one initial point [1, ..., 1]) then I guess the first several points suggested by BO will behave quite randomly because it needs some information (bad or good) to construct a model for the function to be optimized. So it is quite common to use a certain number of random points and (if there are) some preferred points as initial points. However, as long as you provide one initial point, it is OK.

In COMBO, it does not provide a way to encourage a search in a certain region. Conceptuallly, acq func will suggest a point with the potential to be optimal no matter how close it is to [1, ..., 1] or not.

This can be indirectly encouraged by making acquisition function start its optimization near a certain point for example at https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/acquisition/acquisition_optimizers/starting_points.py#L12 I use the spray points which are points near the currently available optimum to make BO search around the currently available optimum. You may tweak this part to make optim_init (not BO's initial point, but initial points where acq optim starts) pick points in the region of your interest.

Thanks.

Changyong Oh

On Mon, May 11, 2020 at 6:10 PM Natalie Barcicki notifications@github.com wrote:

Also, I don't want it to be random - I want it to "learn" from the initial starting point of all 1s and be more inclined to staying at all 1s, but I assume the acquisition function would guide it that way...? (I assume that the global optimum lies close to all 1s in my case).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4#issuecomment-626799854, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMDLYOTWCDZJOJ4YEJLRRAPQPANCNFSM4LFCSGNQ .

nataliebarcickikas commented 4 years ago

Hi Changyong,

One bottleneck in my simulations seems to be related to posterior sampling, specifically done in this line: https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/main.py#L100 The sampling takes a long time as shown in the timestamps below: (11:11:52) Sampling (11:16:58) 10% (1 of 10) |#####---------------------------------------------|(11:22:16) 20% (2 of 10) |##########----------------------------------------|(11:27:31) 30% (3 of 10) |###############-----------------------------------|(11:32:46) 40% (4 of 10) |####################------------------------------|(11:38:01) 50% (5 of 10) |#########################-------------------------|(11:43:22) 60% (6 of 10) |##############################--------------------|(11:48:32) 70% (7 of 10) |###################################---------------|(11:53:45) 80% (8 of 10) |########################################----------|(11:58:56) 90% (9 of 10) |#############################################-----|(12:04:08) 100% (10 of 10) |##################################################| Can you explain what is happening in this series of steps and how one might be able to speed it up? When I open the corresponding function "posterior_sampling", I see there is a comment about what is good practice or not, but I am having troubles deciphering it. Thanks a lot! Best, Natalie

ChangYong-Oh commented 4 years ago

Hi, Natalie

You can ignore the comment which is about learning underlying graph structure, which is not supported.

In your experiment, collecting 10 samples takes almost 40 minutes?

In COMBO, it uses elementwise slice sampling, elementwise means that each beta is updated while others are fixed (like Gibbs sampling) therefore, if you have K categorical variables, then its complexity is O(K^2), this needs to be checked but I think its more and less true. If your problem has a large number of categories like several hundred it may take that long (This is a limitation of COMBO) Or if you python is set to not using MKL, it can be further sped up using MKL. The largest number of categorical variables tested in is 60 dim.

If you want to speed up the sampling process itself, this is another work and requires some significant change in the current COMBO implementation. A possible option would be marginal likelihood optimization instead of sampling, but sampling is known to perform better in terms of BO performance. You can consider HMC or its variant or nonelementwise slice sampling.

Thanks

On Fri, Jul 24, 2020 at 10:11 AM nataliebarcickikas < notifications@github.com> wrote:

Hi Changyong,

One bottleneck in my simulations seems to be related to posterior sampling, specifically done in this line: https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/main.py#L100 The sampling takes a long time as shown in the timestamps below: (11:11:52) Sampling ��(11:16:58) 10% (1 of 10) |#####---------------------------------------------|��(11:22:16) 20% (2 of 10) |##########----------------------------------------|��(11:27:31) 30% (3 of 10) |###############-----------------------------------|��(11:32:46) 40% (4 of 10) |####################------------------------------|��(11:38:01) 50% (5 of 10) |#########################-------------------------|��(11:43:22) 60% (6 of 10) |##############################--------------------|��(11:48:32) 70% (7 of 10) |###################################---------------|��(11:53:45) 80% (8 of 10) |########################################----------|��(11:58:56) 90% (9 of 10) |#############################################-----|��(12:04:08) 100% (10 of 10) |##################################################| Can you explain what is happening in this series of steps and how one might be able to speed it up? When I open the corresponding function "posterior_sampling", I see there is a comment about what is good practice or not, but I am having troubles deciphering it. Thanks a lot! Best, Natalie

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4#issuecomment-663401798, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMERL7KSJ4GX7YKDDJTR5E63NANCNFSM4LFCSGNQ .

nataliebarcickikas commented 4 years ago

Hi ChangYong,

Thanks a lot for the clarification. Unfortunately, I am indeed using a couple of hundred categorical variables so this step is a real bottleneck...Can the elementwise slice sampling loop be parallelized using multiprocessing? I see that in each sampler, model.kernel fourier_freq_list and fourier_basis_list are updated, so I assume not.

Best, Natalie

nataliebarcickikas commented 4 years ago

Just following up on the above - not sure if parallelizing the elementwise slice sampling loop is mathematically sound in the algorithm?

Best, Natalie

ChangYong-Oh commented 4 years ago

The elementwise slice sampling is nothing but a Gibbs sampling using a sequence of one dimensional slice samplers. In order for sampling to be the Markov chain, one dimensional slice samplers can not be parallelized because each one-dim sampler uses the updated value of previous ones. One possible option is not using elementwise version, which may requires some modification in the code, though. For example, when you have b1, ..., bK parameters. in elementwise slice sampler, assuming that parameters are updated in the given order (b1, b2, ...., bK) -> (b'1, b2, ...., bK) -> (b'1, b'2, ...., bK) -> (b'1, b'2, ...., b'K) another option is first to choose the random direction, d, in R^K then you can perform just one dimensional slice sampling to have update from (b1, b2, ...., bK) -> (b1, b2, ...., bK) + \theta d where \theta is chosen by one dimensional slice sampling.

This will reduce the sampling cost from O(K^2) to O(K), but I am not sure how good the collected sampling with this method will be compared to the elementwise slice sampling.

https://github.com/QUVA-Lab/COMBO/blob/e16358d706525874fb3b1a79c36b6200d4689ed9/COMBO/graphGP/sampler/sample_posterior.py#L54 This is where the elementwise slice sampling is used, 'shuffled_beta_ind' is to shuffle the order with which the one-dim slice sampling is applied. You can modify this part to use so-called random-direction slice sampling

Thanks

Changyong Oh

On Wed, Jul 29, 2020 at 2:16 PM nataliebarcickikas notifications@github.com wrote:

Just following up on the above - not sure if parallelizing the elementwise slice sampling loop is mathematically sound in the algorithm?

Best, Natalie

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QUVA-Lab/COMBO/issues/4#issuecomment-665629129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKCMEZPQBLCULPCMQBIILR6AHKPANCNFSM4LFCSGNQ .

QUVA-Lab / COMBO

Applying COMBO to own dataset #4